pandas.read_csv: how to skip comment lines

Posted on

Question :

pandas.read_csv: how to skip comment lines

I think I misunderstand the intention of read_csv. If I have a file ‘j’ like

# notes
# more notes

How can I pandas.read_csv this file, skipping any ‘#’ commented lines? I see in the help ‘comment’ of lines is not supported but it indicates an empty line should be returned. I see an error

df = pandas.read_csv('j', comment='#')

CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 3

I’m currently on

In [15]: pandas.__version__
Out[15]: '0.12.0rc1'

On version’0.12.0-199-g4c8ad82′:

In [43]: df = pandas.read_csv('j', comment='#', header=None)

CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 3

Asked By: mathtick


Answer #1:

So I believe in the latest releases of pandas (version 0.16.0), you could throw in the comment='#' parameter into pd.read_csv and this should skip commented out lines.

These github issues shows that you can do this:

See the documentation on read_csv:

Answered By: mathtick

Answer #2:

One workaround is to specify skiprows to ignore the first few entries:

In [11]: s = '# notesna,b,cn# more notesn1,2,3'

In [12]: pd.read_csv(StringIO(s), sep=',', comment='#', skiprows=1)
    a   b   c
0 NaN NaN NaN
1   1   2   3

Otherwise read_csv gets a little confused:

In [13]: pd.read_csv(StringIO(s), sep=',', comment='#')
        Unnamed: 0
a   b            c
NaN NaN        NaN
1   2            3

This seems to be the case in 0.12.0, I’ve filed a bug report.

As Viktor points out you can use dropna to remove the NaN after the fact… (there is a recent open issue to have commented lines be ignored completely):

In [14]: pd.read_csv(StringIO(s2), comment='#', sep=',').dropna(how='all')
   a  b  c
1  1  2  3

Note: the default index will “give away” the fact there was missing data.

Answered By: hlin117

Answer #3:

I am on Pandas version 0.13.1 and this comments-in-csv problem still bothers me.

Here is my present workaround:

def read_csv(filename, comment='#', sep=','):
    lines = "".join([line for line in open(filename) 
                     if not line.startswith(comment)])
    return pd.read_csv(StringIO(lines), sep=sep)

Otherwise with pd.read_csv(filename, comment='#') I get

pandas.parser.CParserError: Error tokenizing data. C error: Expected 1 fields in line 16, saw 3.

Answered By: Andy Hayden

Leave a Reply

Your email address will not be published. Required fields are marked *