I run this snippet twice, in the Ubuntu terminal (encoding set to utf-8), once with
./test.py and then with
uni = u"u001Au0BC3u1451U0001D10C" print uni
Without redirection it prints garbage. With redirection I get a UnicodeDecodeError. Can someone explain why I get the error only in the second case, or even better give a detailed explanation of what’s going on behind the curtain in both cases?
The whole key to such encoding problems is to understand that there are in principle two distinct concepts of “string”: (1) string of characters, and (2) string/array of bytes. This distinction has been mostly ignored for a long time because of the historic ubiquity of encodings with no more than 256 characters (ASCII, Latin-1, Windows-1252, Mac OS Roman,…): these encodings map a set of common characters to numbers between 0 and 255 (i.e. bytes); the relatively limited exchange of files before the advent of the web made this situation of incompatible encodings tolerable, as most programs could ignore the fact that there were multiple encodings as long as they produced text that remained on the same operating system: such programs would simply treat text as bytes (through the encoding used by the operating system). The correct, modern view properly separates these two string concepts, based on the following two points:
Characters are mostly unrelated to computers: one can draw them on a chalk board, etc., like for instance ??????, ?? and ?. “Characters” for machines also include “drawing instructions” like for example spaces, carriage return, instructions to set the writing direction (for Arabic, etc.), accents, etc. A very large character list is included in the Unicode standard; it covers most of the known characters.
On the other hand, computers do need to represent abstract characters in some way: for this, they use arrays of bytes (numbers between 0 and 255 included), because their memory comes in byte chunks. The necessary process that converts characters to bytes is called encoding. Thus, a computer requires an encoding in order to represent characters. Any text present on your computer is encoded (until it is displayed), whether it be sent to a terminal (which expects characters encoded in a specific way), or saved in a file. In order to be displayed or properly “understood” (by, say, the Python interpreter), streams of bytes are decoded into characters. A few encodings (UTF-8, UTF-16,…) are defined by Unicode for its list of characters (Unicode thus defines both a list of characters and encodings for these characters—there are still places where one sees the expression “Unicode encoding” as a way to refer to the ubiquitous UTF-8, but this is incorrect terminology, as Unicode provides multiple encodings).
In summary, computers need to internally represent characters with bytes, and they do so through two operations:
Encoding: characters ? bytes
Decoding: bytes ? characters
Some encodings cannot encode all characters (e.g., ASCII), while (some) Unicode encodings allow you to encode all Unicode characters. The encoding is also not necessarily unique, because some characters can be represented either directly or as a combination (e.g. of a base character and of accents).
Note that the concept of newline adds a layer of complication, since it can be represented by different (control) characters that depend on the operating system (this is the reason for Python’s universal newline file reading mode).
Some more information on Unicode, characters and code points, if you are interested:
Now, what I have called “character” above is what Unicode calls a “user-perceived character“. A single user-perceived character can sometimes be represented in Unicode by combining character parts (base character, accents,…) found at different indexes in the Unicode list, which are called “code points“—these codes points can be combined together to form a “grapheme cluster”.
Unicode thus leads to a third concept of string, made of a sequence of Unicode code points, that sits between byte and character strings, and which is closer to the latter. I will call them “Unicode strings” (like in Python 2).
While Python can print strings of (user-perceived) characters, Python non-byte strings are essentially sequences of Unicode code points, not of user-perceived characters. The code point values are the ones used in Python’s
U Unicode string syntax. They should not be confused with the encoding of a character (and do not have to bear any relationship with it: Unicode code points can be encoded in various ways).
This has an important consequence: the length of a Python (Unicode) string is its number of code points, which is not always its number of user-perceived characters: thus
s = "u1100u1161u11a8"; print(s, "len", len(s)) (Python 3) gives
? len 3 despite
s having a single user-perceived (Korean) character (because it is represented with 3 code points—even if it does not have to, as
print("uac01") shows). However, in many practical circumstances, the length of a string is its number of user-perceived characters, because many characters are typically stored by Python as a single Unicode code point.
In Python 2, Unicode strings are called… “Unicode strings” (
unicode type, literal form
u"…"), while byte arrays are “strings” (
str type, where the array of bytes can for instance be constructed with string literals
"…"). In Python 3, Unicode strings are simply called “strings” (
str type, literal form
"…"), while byte arrays are “bytes” (
bytes type, literal form
b"…"). As a consequence, something like
"?" gives a different result in Python 2 (
'xf0', a byte) and Python 3 (
"?", the first and only character).
With these few key points, you should be able to understand most encoding related questions!
Normally, when you print
u"…" to a terminal, you should not get garbage: Python knows the encoding of your terminal. In fact, you can check what encoding the terminal expects:
% python Python 2.7.6 (default, Nov 15 2013, 15:20:37) [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin Type "help", "copyright", "credits" or "license" for more information. import sys print sys.stdout.encoding UTF-8
If your input characters can be encoded with the terminal’s encoding, Python will do so and will send the corresponding bytes to your terminal without complaining. The terminal will then do its best to display the characters after decoding the input bytes (at worst the terminal font does not have some of the characters and will print some kind of blank instead).
If your input characters cannot be encoded with the terminal’s encoding, then it means that the terminal is not configured for displaying these characters. Python will complain (in Python with a
UnicodeEncodeError since the character string cannot be encoded in a way that suits your terminal). The only possible solution is to use a terminal that can display the characters (either by configuring the terminal so that it accepts an encoding that can represent your characters, or by using a different terminal program). This is important when you distribute programs that can be used in different environments: messages that you print should be representable in the user’s terminal. Sometimes it is thus best to stick to strings that only contain ASCII characters.
However, when you redirect or pipe the output of your program, then it is generally not possible to know what the input encoding of the receiving program is, and the above code returns some default encoding: None (Python 2.7) or UTF-8 (Python 3):
% python2.7 -c "import sys; print sys.stdout.encoding" | cat None % python3.4 -c "import sys; print(sys.stdout.encoding)" | cat UTF-8
The encoding of stdin, stdout and stderr can however be set through the
PYTHONIOENCODING environment variable, if needed:
% PYTHONIOENCODING=UTF-8 python2.7 -c "import sys; print sys.stdout.encoding" | cat UTF-8
If the printing to a terminal does not produce what you expect, you can check the UTF-8 encoding that you put manually in is correct; for instance, your first character (
u001A) is not printable, if I’m not mistaken.
At http://wiki.python.org/moin/PrintFails, you can find a solution like the following, for Python 2.x:
import codecs import locale import sys # Wrap sys.stdout into a StreamWriter to allow writing unicode. sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout) uni = u"u001Au0BC3u1451U0001D10C" print uni
For Python 3, you can check one of the questions asked previously on StackOverflow.
Python always encodes Unicode strings when writing to a terminal, file, pipe, etc. When writing to a terminal Python can usually determine the encoding of the terminal and use it correctly. When writing to a file or pipe Python defaults to the ‘ascii’ encoding unless explicitly told otherwise. Python can be told what to do when piping output through the
PYTHONIOENCODING environment variable. A shell can set this variable before redirecting Python output to a file or pipe so the correct encoding is known.
In your case you’ve printed 4 uncommon characters that your terminal didn’t support in its font. Here’s some examples to help explain the behavior, with characters that are actually supported by my terminal (which uses cp437, not UTF-8).
Note that the
#coding comment indicates the encoding in which the source file is saved. I chose utf8 so I could support characters in source that my terminal could not. Encoding redirected to stderr so it can be seen when redirected to a file.
#coding: utf8 import sys uni = u'?ß????µ???????' print >>sys.stderr,sys.stdout.encoding print uni
Output (run directly from terminal)
Python correctly determined the encoding of the terminal.
Output (redirected to file)
None Traceback (most recent call last): File "C:ex.py", line 5, in <module> print uni UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)
Python could not determine encoding (None) so used ‘ascii’ default. ASCII only supports converting the first 128 characters of Unicode.
Output (redirected to file, PYTHONIOENCODING=cp437)
and my output file was correct:
C:>type out.txt ?ß????µ???????
Now I’ll throw in a character in the source that isn’t supported by my terminal:
#coding: utf8 import sys uni = u'?ß????µ????????' # added Chinese character at end. print >>sys.stderr,sys.stdout.encoding print uni
Output (run directly from terminal)
cp437 Traceback (most recent call last): File "C:ex.py", line 5, in <module> print uni File "C:Python26libencodingscp437.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character u'u9a6c' in position 14: character maps to <undefined>
My terminal didn’t understand that last Chinese character.
Output (run directly, PYTHONIOENCODING=437:replace)
Error handlers can be specified with the encoding. In this case unknown characters were replaced with
xmlcharrefreplace are some other options. When using UTF8 (which supports encoding all Unicode characters) replacements will never be made, but the font used to display the characters must still support them.
Encode it while printing
uni = u"u001Au0BC3u1451U0001D10C" print uni.encode("utf-8")
This is because when you run the script manually python encodes it before outputting it to terminal, when you pipe it python does not encode it itself so you have to encode manually when doing I/O.