Search code examples
pythonpandasdataframecsvcomments

handle comment lines when reading csv using pandas


Here is a simple example:

import pandas as pd
from io import StringIO
s = """a   b   c
------------
A1    1    2
A-2  -NA-  3
------------
B-1   2   -NA-
------------
"""
df = pd.read_csv(StringIO(s), sep='\s+', comment='-')
df

a   b   c
0   A1  1.0 2.0
1   A   NaN NaN
2   B   NaN NaN

For lines containing but not starting with the comment specifier, pandas treats the substring from - as comments.


My question is as above.

Not important but just for curiosity, can pandas handle two different types of comment lines: starting with # or -

import pandas as pd
from io import StringIO
s = """a   b   c
# comment line
------------
A1   1    2
A2  -NA-  3
------------
B1   2   -NA-
------------
"""
df = pd.read_csv(StringIO(s), sep='\s+', comment='#-')
df

raises ValueError: Only length-1 comment characters supported


Solution

  • Another solution: You can "preprocess" the file before .read_csv. For example:

    import re
    import pandas as pd
    from io import StringIO
    
    
    s = """a   b   c
    # comment line
    ------------
    A1    1    2
    A-2  -NA-  3
    ------------
    B-1   2   -NA-
    ------------
    """
    
    df = pd.read_csv(
        StringIO(re.sub(r"^-{2,}", "", s, flags=re.M)), sep=r"\s+", comment="#"
    )
    print(df)
    

    Prints:

         a     b     c
    0   A1     1     2
    1  A-2  -NA-     3
    2  B-1     2  -NA-