Python CSV error: line contains NULL byte

Posted on

Question :

Python CSV error: line contains NULL byte

I’m working with some CSV files, with the following code:

reader = csv.reader(open(filepath, "rU"))
try:
    for row in reader:
        print 'Row read successfully!', row
except csv.Error, e:
    sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))

And one file is throwing this error:

file my.csv, line 1: line contains NULL byte

What can I do? Google seems to suggest that it may be an Excel file that’s been saved as a .csv improperly. Is there any way I can get round this problem in Python?

== UPDATE ==

Following @JohnMachin’s comment below, I tried adding these lines to my script:

print repr(open(filepath, 'rb').read(200)) # dump 1st 200 bytes of file
data = open(filepath, 'rb').read()
print data.find('x00')
print data.count('x00')

And this is the output I got:

'xd0xcfx11xe0xa1xb1x1axe1x00x00x00x00x00x00x00x00 .... <snip>
8
13834

So the file does indeed contain NUL bytes.

Asked By: AP257

||

Answer #1:

As @S.Lott says, you should be opening your files in ‘rb’ mode, not ‘rU’ mode. However that may NOT be causing your current problem. As far as I know, using ‘rU’ mode would mess you up if there are embedded r in the data, but not cause any other dramas. I also note that you have several files (all opened with ‘rU’ ??) but only one causing a problem.

If the csv module says that you have a “NULL” (silly message, should be “NUL”) byte in your file, then you need to check out what is in your file. I would suggest that you do this even if using ‘rb’ makes the problem go away.

repr() is (or wants to be) your debugging friend. It will show unambiguously what you’ve got, in a platform independant fashion (which is helpful to helpers who are unaware what od is or does). Do this:

print repr(open('my.csv', 'rb').read(200)) # dump 1st 200 bytes of file

and carefully copy/paste (don’t retype) the result into an edit of your question (not into a comment).

Also note that if the file is really dodgy e.g. no r or n within reasonable distance from the start of the file, the line number reported by reader.line_num will be (unhelpfully) 1. Find where the first x00 is (if any) by doing

data = open('my.csv', 'rb').read()
print data.find('x00')

and make sure that you dump at least that many bytes with repr or od.

What does data.count('x00') tell you? If there are many, you may want to do something like

for i, c in enumerate(data):
    if c == 'x00':
        print i, repr(data[i-30:i]) + ' *NUL* ' + repr(data[i+1:i+31])

so that you can see the NUL bytes in context.

If you can see x00 in the output (or in your od -c output), then you definitely have NUL byte(s) in the file, and you will need to do something like this:

fi = open('my.csv', 'rb')
data = fi.read()
fi.close()
fo = open('mynew.csv', 'wb')
fo.write(data.replace('x00', ''))
fo.close()

By the way, have you looked at the file (including the last few lines) with a text editor? Does it actually look like a reasonable CSV file like the other (no “NULL byte” exception) files?

Answered By: AP257

Answer #2:

data_initial = open("staff.csv", "rb")
data = csv.reader((line.replace('','') for line in data_initial), delimiter=",")

This works for me.

Answered By: John Machin

Answer #3:

Reading it as UTF-16 was also my problem.

Here’s my code that ended up working:

f=codecs.open(location,"rb","utf-16")
csvread=csv.reader(f,delimiter='t')
csvread.next()
for row in csvread:
    print row

Where location is the directory of your csv file.

Answered By: double

Answer #4:

I bumped into this problem as well. Using the Python csv module, I was trying to read an XLS file created in MS Excel and running into the NULL byte error you were getting. I looked around and found the xlrd Python module for reading and formatting data from MS Excel spreadsheet files. With the xlrd module, I am not only able to read the file properly, but I can also access many different parts of the file in a way I couldn’t before.

I thought it might help you.

Answered By: User

Answer #5:

Converting the encoding of the source file from UTF-16 to UTF-8 solve my problem.

How to convert a file to utf-8 in Python?

import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "utf-16") as sourceFile:
    with codecs.open(targetFileName, "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
                break
            targetFile.write(contents)
Answered By: ayaz

Answer #6:

You could just inline a generator to filter out the null values if you want to pretend they don’t exist. Of course this is assuming the null bytes are not really part of the encoding and really are some kind of erroneous artifact or bug.

with open(filepath, "rb") as f:
    reader = csv.reader( (line.replace('','') for line in f) )

    try:
        for row in reader:
            print 'Row read successfully!', row
    except csv.Error, e:
        sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
Answered By: Patrick Halley

Answer #7:

Why are you doing this?

 reader = csv.reader(open(filepath, "rU"))

The docs are pretty clear that you must do this:

with open(filepath, "rb") as src:
    reader= csv.reader( src )

The mode must be “rb” to read.

http://docs.python.org/library/csv.html#csv.reader

If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.

Answered By: woot

Answer #8:

appparently it’s a XLS file and not a CSV file as http://www.garykessler.net/library/file_sigs.html confirm

Answered By: S.Lott

Leave a Reply

Your email address will not be published.