Search code examples
pythonpandasconcatenationglob

Pandas: Error tokenizing data--when using glob.glob


I am using the following code to concatenate several files (candidate master files) I have downloaded from here; but they can also be found here:

https://github.com/108michael/ms_thesis/blob/master/cn06.txt
https://github.com/108michael/ms_thesis/blob/master/cn08.txt
https://github.com/108michael/ms_thesis/blob/master/cn10.txt
https://github.com/108michael/ms_thesis/blob/master/cn12.txt
https://github.com/108michael/ms_thesis/blob/master/cn14.txt

import numpy as np
import pandas as pd
import glob


df = pd.concat((pd.read_csv(f, header=None, names=['feccandid','candname',\
'party','date', 'state', 'chamber', 'district', 'incumb.challeng', \
'cand_status', '1', '2','3','4', '5', '6'  ], usecols=['feccandid', \
'party', 'date', 'state', 'chamber'])for f in glob.glob\
        ('/home/jayaramdas/anaconda3/Thesis/FEC/cn_data/cn**.txt')))

I am getting the following error:

CParserError: Error tokenizing data. C error: Expected 2 fields in line 58, saw 4

Does anyone have a clue on this?


Solution

  • The default delimiter for pd.read_csv is the comma ,. Since all of your candidates have names listed in the format Last, First, pandas reads two columns: everything before the comma and everything after. In one of the files, there are additional commas, leading pandas to assume that there are more columns. That's the parser error.

    To use | as the delimiter instead of ,, just change your code to use the keyword delimiter="|" or sep="|". From the docs, we see that delimiter and sep are aliases of the same keyword.

    New code:

    df = pd.concat((pd.read_csv(f, header=None, delimiter="|", names=['feccandid','candname',\
    'party','date', 'state', 'chamber', 'district', 'incumb.challeng', \
    'cand_status', '1', '2','3','4', '5', '6'  ], usecols=['feccandid', \
    'party', 'date', 'state', 'chamber'])for f in glob.glob\
        ('/home/jayaramdas/anaconda3/Thesis/FEC/cn_data/cn**.txt')))