UnicodeDecodeError when reading CSV file in Pandas with Python

Posted on

Solving problem is about exposing yourself to as many situations as possible like UnicodeDecodeError when reading CSV file in Pandas with Python and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about UnicodeDecodeError when reading CSV file in Pandas with Python, which can be followed any time. Take easy to follow this discuss.

UnicodeDecodeError when reading CSV file in Pandas with Python

I’m running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error…

File "C:Importersrcdfmanimporter.py", line 26, in import_chr
     data = pd.read_csv(filepath, names=fields)
File "C:Python33libsite-packagespandasioparsers.py", line 400, in parser_f
     return _read(filepath_or_buffer, kwds)
File "C:Python33libsite-packagespandasioparsers.py", line 205, in _read
     return parser.read()
   File "C:Python33libsite-packagespandasioparsers.py", line 608, in read
     ret = self._engine.read(nrows)
File "C:Python33libsite-packagespandasioparsers.py", line 1028, in read
     data = self._reader.read(nrows)
File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandasparser.c:6745)
File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandasparser.c:6964)
File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandasparser.c:7780)
File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandasparser.c:8793)
File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandasparser.c:9484)
File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandasparser.c:10642)
File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandasparser.c:10853)
File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandasparser.c:15657)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid    continuation byte

The source/creation of these files all come from the same place. What’s the best way to correct this to proceed with the import?

Answer #1:

read_csv takes an encoding option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1"), or alternatively encoding = "utf-8" for reading, and generally utf-8 for to_csv.

You can also use one of several alias options like 'latin' instead of 'ISO-8859-1' (see python docs, also for numerous other encodings you may encounter).

See relevant Pandas documentation,
python docs examples on csv files, and plenty of related questions here on SO. A good background resource is What every developer should know about unicode and character sets.

To detect the encoding (assuming the file contains non-ascii characters), you can use enca (see man page) or file -i (linux) or file -I (osx) (see man page).

Answered By: Stefan

Answer #2:

Simplest of all Solutions:

import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')

Alternate Solution:

  • Open the csv file in Sublime text editor or VS Code.
  • Save the file in utf-8 format.

In sublime, Click File -> Save with encoding -> UTF-8

Then, you can read your file as usual:

import pandas as pd
data = pd.read_csv('file_name.csv', encoding='utf-8')

and the other different encoding types are:

encoding = "cp1252"
encoding = "ISO-8859-1"
Answered By: Gil Baggio

Answer #3:

Pandas allows to specify encoding, but does not allow to ignore errors not to automatically replace the offending bytes. So there is no one size fits all method but different ways depending on the actual use case.

  1. You know the encoding, and there is no encoding error in the file.
    Great: you have just to specify the encoding:

    file_encoding = 'cp1252'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    pd.read_csv(input_file_and_path, ..., encoding=file_encoding)
    
  2. You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. Ok, you only have to use Latin1 encoding because it accept any possible byte as input (and convert it to the unicode character of same code):

    pd.read_csv(input_file_and_path, ..., encoding='latin1')
    
  3. You know that most of the file is written with a specific encoding, but it also contains encoding errors. A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object. Typical errors parameter to use here are 'ignore' which just suppresses the offending bytes or (IMHO better) 'backslashreplace' which replaces the offending bytes by their Python’s backslashed escape sequence:

    file_encoding = 'utf8'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace')
    pd.read_csv(input_fd, ...)
    
Answered By: Serge Ballesta

Answer #4:

with open('filename.csv') as f:
   print(f)

after executing this code you will find encoding of ‘filename.csv’ then execute code as following

data=pd.read_csv('filename.csv', encoding="encoding as you found earlier"

there you go

Answered By: bhavesh

Answer #5:

In my case, a file has USC-2 LE BOM encoding, according to Notepad++.
It is encoding="utf_16_le" for python.

Hope, it helps to find an answer a bit faster for someone.

Answer #6:

Try specifying the engine=’python’.
It worked for me but I’m still trying to figure out why.

df = pd.read_csv(input_file_path,...engine='python')
Answered By: Jan33

Answer #7:

In my case this worked for python 2.7:

data = read_csv(filename, encoding = "ISO-8859-1", dtype={'name_of_colum': unicode}, low_memory=False)

And for python 3, only:

data = read_csv(filename, encoding = "ISO-8859-1", low_memory=False)
Answered By: Victor Villacorta

Answer #8:

Please try to add

encoding='unicode_escape'

This will help. Worked for me. Also, make sure you’re using the correct delimiter and column names.

You can start with loading just 1000 rows to load the file quickly.

Answered By: Prakhar Rathi

Leave a Reply

Your email address will not be published. Required fields are marked *