Unicode and character encodings¶
Special characters and escape sequences¶
\n
stands for the newline character and \t
for the tab character.
Character sequences that begin with a backslash and are used to represent other
characters are called escape sequences. Escape sequences are generally used to
represent special characters, in other words, characters for which there is no
single-character printable representation.
Here are other characters you can get with the escape character:
Escape sequence |
Output |
Description |
---|---|---|
|
|
Backslash |
|
|
single quote character |
|
|
double quote character |
|
Backspace ( |
|
|
ASCII Linefeed |
|
|
ASCII Carriage Return
( |
|
|
Tabulator ( |
|
|
|
Unicode 16 bit |
|
|
Unicode 32 bit |
|
|
Unicode Emoji name |
- Lines 1–7
The ASCII character set, which is used by Python and is the standard character set on almost all computers, defines a whole range of other special characters.
- Lines 8–9
Unicode escape sequences.
- Line 10
Unicode names for specifying a Unicode character.
There are dozens of character encodings. For an overview of Python’s encodings, see Encodings and Unicode.
The string
module¶
Python’s string module distinguishes the following string constants, all of which fall into the ASCII character set:
# Some strings for ctype-style character classification
whitespace = " \t\n\r\v\f"
ascii_lowercase = "abcdefghijklmnopqrstuvwxyz"
ascii_uppercase = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
ascii_letters = ascii_lowercase + ascii_uppercase
digits = "0123456789"
hexdigits = digits + "abcdef" + "ABCDEF"
octdigits = "01234567"
punctuation = r"""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""
printable = digits + ascii_letters + punctuation + whitespace
Most of these constants should be self-explanatory in their identifier names.
hexdigits
and octdigits
refer to the hexadecimal and octal values
respectively. You can use these constants for everyday string manipulation:
>>> import string
>>> hepy = "Hello Pythonistas!"
>>> hepy.rstrip(string.punctuation)
'Hello Pythonistas'
However, the string module works with Unicode by default, which is represented as binary data (bytes).
Unicode¶
It is obvious that the ASCII character set is not nearly large enough to cover all languages, dialects, symbols and glyphs; it is not even large enough for English.
While ASCII is a complete subset of Unicode – the first 128 characters in the Unicode table correspond exactly to ASCII characters – Unicode encompasses a much larger set of characters. Unicode itself is not an encoding but is implemented by various character encodings, with UTF-8 probably being the most commonly used encoding scheme.
Note
The Python help documentation has an entry for Unicode: enter help()
and
then UNICODE
. The various options for creating Python strings are
described in detail.
Unicode and UTF-8¶
While Unicode is an abstract encoding standard, UTF-8 is a concrete encoding scheme. The Unicode standard is a mapping of characters to code points and defines several different encodings from a single character set. UTF-8 is an encoding scheme for representing Unicode characters as binary data with one or more bytes per character.
Encoding and decoding¶
The str type is intended for the representation of
human-readable text and can contain all Unicode characters. The bytes type, on the other hand, represents binary data that is not
inherently encoded. str.encode()
and bytes.decode()
are the methods of transition from one to the other:
>>> "You’re welcome!".encode("utf-8")
b'You\xe2\x80\x99re welcome!'
>>> b"You\xe2\x80\x99re welcome!".decode("utf-8")
'You’re welcome!'
The result of str.encode()
is a bytes object.
Both byte literals (such as b'You\xe2\x80\x99re welcome!'
) and
representations of bytes only allow ASCII characters. For this reason, when
calling "You’re welcome!".encode("utf-8")
, the ASCII-compatible 'You'
may be represented as it is, but the ’
becomes '\xe2\x80\x99'
. This chaotic looking sequence represents three
bytes, e2
, 80
and 99
as hexadecimal values.
Tip
In .encode()
and .decode()
, the encoding parameter is "utf-8"
by
default; however, it is recommended to specify it explicitly.
With bytes.fromhex()
you can convert the hexadecimal values into
bytes:
>>> bytes.fromhex("e2 80 99")
b'\xe2\x80\x99'
UTF-16 and UTF-32¶
The difference between these and UTF-8 is considerable in practice. In the following, I would like to show you only briefly by means of an example that a round-trip conversion can simply fail here:
>>> hepy = "Hello Pythonistas!"
>>> hepy.encode("utf-8")
b'Hello Pythonistas!'
>>> len(hepy.encode("utf-8"))
18
>>> hepy.encode("utf-8").decode("utf-16")
'效汬\u206f祐桴湯獩慴ⅳ'
>>> len(hepy.encode("utf-8").decode("utf-16"))
9
Encoding Latin letters in UTF-8 and then decoding them in UTF-16 resulted in a text that also contains characters from the Chinese, Japanese or Korean language areas as well as Roman numerals. Decoding the same byte object can lead to results that are not even in the same language or contain the same number of characters.
Python 3 and Unicode¶
Python 3 relies fully on Unicode and specifically on UTF-8:
Python 3 source code is assumed to be UTF-8 by default.
Texts (str) are Unicode by default. Encoded Unicode text is represented as binary data (Bytes).
Python 3 accepts many Unicode code points in identifiers.
Python’s re module uses the
re.UNICODE
flag by default, notre.ASCII
. This means that, for example,r"\w"
matches Unicode word characters, not just ASCII letters.The default encoding in
str.encode()
andbytes.decode()
is UTF-8.
The only exception could be open()
, which is platform
dependent and therefore depends on the value of
locale.getpreferredencoding()
:
>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'
Built-in Python Functions¶
Python has a number of built-in functions that relate to character encodings in some way:
ascii()
,bin()
,hex()
,oct()
output a string.
bytes
,str
,int
are class constructors for their respective types, converting the input to the desired type.
ord()
,chr()
are inverses of each other in that the Python function
ord()
converts anstr
character to itsbase=10
code point, whilechr()
does the opposite.
Below is a more detailed look at each of these nine functions:
Function |
Return type |
Description |
---|---|---|
|
ASCII representation of an object, escaping non-ASCII characters. |
|
|
binary representation of an integer
with the prefix |
|
|
hexadecimal representation of an
integer with the prefix |
|
|
octal representation of an integer
with the prefix |
|
|
converts the input to bytes type |
|
|
converts the input to str type |
|
|
converts the input to
|
|
|
converts a single Unicode character to its integer code point |
|
|
converts an integer code point into a single Unicode character |