UnicodeDecodeError, invalid continuation byte

Posted on

Solving problem is about exposing yourself to as many situations as possible like UnicodeDecodeError, invalid continuation byte and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about UnicodeDecodeError, invalid continuation byte, which can be followed any time. Take easy to follow this discuss.

UnicodeDecodeError, invalid continuation byte

Why is the below item failing? Why does it succeed with “latin-1” codec?

o = "a test of xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")

Which results in:

 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "C:Python27libencodingsutf_8.py",
 line 16, in decode
     return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
 'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte
Asked By: RuiDC

||

Answer #1:

In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:

>>> b'xe9x80x80'.decode('utf-8')
u'u9000'

But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:

>>> u'xe9'.encode('utf-8')
b'xc3xa9'
>>> u'xe9'.encode('latin-1')
b'xe9'

(Note, I’m using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)

Answered By: Josh Lee

Answer #2:

I had the same error when I tried to open a CSV file by pandas.read_csv
method.

The solution was change the encoding to latin-1:

pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')
Answered By: Mazen Aly

Answer #3:

It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.

If you don’t know the codeset you’re receiving strings in, you’re in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you’d just reject ones that didn’t decode.

If you can’t do that, you’ll need heuristics.

Answered By: Sami J. Lehtinen

Answer #4:

Because UTF-8 is multibyte and there is no char corresponding to your combination of xe9 plus following space.

Why should it succeed in both utf-8 and latin-1?

Here how the same sentence should be in utf-8:

>>> o.decode('latin-1').encode("utf-8")
'a test of xc3xa9 char'
Answered By: neurino

Answer #5:

If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode

Answered By: Patrick Mutuku

Answer #6:

Use this, If it shows the error of UTF-8

pd.read_csv('File_name.csv',encoding='latin-1')

Answer #7:

utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.

the reason to raise this exception is:

1)If the code point is < 128, each byte is the same as the value of the code point.
2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)

In order to to overcome this we have a set of encodings, the most widely used is “Latin-1, also known as ISO-8859-1”

So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1

when this exception occurs when you are trying to load a data set ,try using this format

df=pd.read_csv("top50.csv",encoding='ISO-8859-1')

Add encoding technique at the end of the syntax which then accepts to load the data set.

Answered By: surya

Answer #8:

This happened to me also, while i was reading text containing Hebrew from a .txt file.

I clicked: file -> save as and I saved this file as a UTF-8 encoding

Answered By: Alon Gouldman

Leave a Reply

Your email address will not be published.