Error in Reading a csv file in pandas[CParserError: Error tokenizing data. C error: Buffer overflow caught – possible malformed input file.]

Posted on

Question :

Error in Reading a csv file in pandas[CParserError: Error tokenizing data. C error: Buffer overflow caught – possible malformed input file.]

So i tried reading all the csv files from a folder and then concatenate them to create a big csv(structure of all the files was same), save it and read it again. All this was done using Pandas. The Error occurs while reading. I am Attaching the code and the Error below.

import pandas as pd
import numpy as np
import glob

path =r'somePath' # use your path
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    list_.append(df)
store = pd.concat(list_)
store.to_csv("C:workDATARaw_data\store.csv", sep=',', index= False)
store1 = pd.read_csv("C:workDATARaw_data\store.csv", sep=',')

Error:-

CParserError                              Traceback (most recent call last)
<ipython-input-48-2983d97ccca6> in <module>()
----> 1 store1 = pd.read_csv("C:workDATARaw_data\store.csv", sep=',')

C:UsersarmsharmAppDataLocalContinuumAnacondalibsite-packagespandasioparsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    472                     skip_blank_lines=skip_blank_lines)
    473 
--> 474         return _read(filepath_or_buffer, kwds)
    475 
    476     parser_f.__name__ = name

C:UsersarmsharmAppDataLocalContinuumAnacondalibsite-packagespandasioparsers.pyc in _read(filepath_or_buffer, kwds)
    258         return parser
    259 
--> 260     return parser.read()
    261 
    262 _parser_defaults = {

C:UsersarmsharmAppDataLocalContinuumAnacondalibsite-packagespandasioparsers.pyc in read(self, nrows)
    719                 raise ValueError('skip_footer not supported for iteration')
    720 
--> 721         ret = self._engine.read(nrows)
    722 
    723         if self.options.get('as_recarray'):

C:UsersarmsharmAppDataLocalContinuumAnacondalibsite-packagespandasioparsers.pyc in read(self, nrows)
   1168 
   1169         try:
-> 1170             data = self._reader.read(nrows)
   1171         except StopIteration:
   1172             if nrows is None:

pandasparser.pyx in pandas.parser.TextReader.read (pandasparser.c:7544)()

pandasparser.pyx in pandas.parser.TextReader._read_low_memory (pandasparser.c:7784)()

pandasparser.pyx in pandas.parser.TextReader._read_rows (pandasparser.c:8401)()

pandasparser.pyx in pandas.parser.TextReader._tokenize_rows (pandasparser.c:8275)()

pandasparser.pyx in pandas.parser.raise_parser_error (pandasparser.c:20691)()

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

I tried using csv reader as well:-

import csv
with open("C:workDATARaw_data\store.csv", 'rb') as f:
    reader = csv.reader(f)
    l = list(reader)

Error:-

Error                                     Traceback (most recent call last)
<ipython-input-36-9249469f31a6> in <module>()
      1 with open('C:workDATARaw_data\store.csv', 'rb') as f:
      2     reader = csv.reader(f)
----> 3     l = list(reader)

Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

Answer #1:

Not an answer, but too long for a comment (not speaking of code formatting)

As it breaks when you read it in csv module, you can at least locate the line where the error occurs:

import csv
with open(r"C:workDATARaw_datastore.csv", 'rb') as f:
    reader = csv.reader(f)
    linenumber = 1
    try:
        for row in reader:
            linenumber += 1
    except Exception as e:
        print (("Error line %d: %s %s" % (linenumber, str(type(e)), e.message)))

Then look in store.csv what happens at that line.

Answered By: Serge Ballesta

Answer #2:

I found this error, the cause was that there were some carriage returns “r” in the data that pandas was using as a line terminator as if it was “n”. I thought I’d post here as that might be a common reason this error might come up.

The solution I found was to add lineterminator=’n’ into the read_csv function like this:

df_clean = pd.read_csv('test_error.csv',
                 lineterminator='n')
Answered By: Louise Fallon

Answer #3:

If you are using python and its a big file you may use
engine='python' as below and should work.

df = pd.read_csv( file_, index_col=None, header=0, engine='python' )

Answered By: Firas Aswad

Leave a Reply

Your email address will not be published. Required fields are marked *