Python3 の文字列　覚書

Python3ではstring = unicode

string をアレーとして考えてslice するときもマルチバイト文字を気にする必要がない。

>>> s = 'pythonに、どっぷりつかる'
>>> len(s)
15
>>> s[8:]
'どっぷりつかる'
>>> s[:6]
'python'
>>> s[:7]+s[8:12]
'pythonにどっぷり'
>>>

ただ、よいことばかりではなく、fileなどからのIO処理では読み込まれるのがバイト列になるため変換をしてあげる必要がある。　Python 2.7では必要なかったのでPython2.xのプログラムを移植するときに注意が必要。encode()としてunicodeのバイト列表示、decode()でUnicodeのバイト列をUnicodeの文字列表示

>>> s.encode()
b'python\xe3\x81\xab\xe3\x80\x81\xe3\x81\xa9\xe3\x81\xa3\xe3\x81\xb7\xe3\x82\x8a\xe3\x81\xa4\xe3\x81\x8b\xe3\x82\x8b'
>>> b'python\xe3\x81\xab\xe3\x80\x81\xe3\x81\xa9\xe3\x81\xa3\xe3\x81\xb7\xe3\x82\x8a\xe3\x81\xa4\xe3\x81\x8b\xe3\x82\x8b'.decode()
'pythonに、どっぷりつかる'
>>>

最初のバイト列がUnicode以外の文字列の場合はencodingを指定

>>> chara = b'\x63\x6c\x69\x63\x68\xe9'
>>> print(chara)
b'clich\xe9'
>>> print(chara.decode('latin-1'))
cliché

A2life

Yikes, there is no time to write, because I am busy doing most of them rather than writing about them

Python3 の文字列　覚書

Leave a Reply