Search code examples

Read a term-document matrix from csv using python

The reason classic csv reader doesn't work on term-document arrays is that the first column of the csv file are terms, not values. Thus the file has the following syntax:

"";"label1";"label2";"label3" ...

I need to build a dictionary whose keys are label1, label3, etc... and values are the column vectors (here it would be: dict[label1]-> 1,0 , dict[label2] -> 0,0 etc), meaning that the terms are completely useless to me.

I have implemented a custom solution which goes something like this:

keys = f.readline().split('";"') #1st line of the csv
keys = keys[1:]                  #skipping ""
zeros = [0] * len(keys)          #dicts initial values will be 0
d = OrderedDict(zip(keys, zeros))
lines = f.readlines()
for line in lines:
    splittting, stripping etc I get a list with values (eg: 1,0,8 - see example above)
    for value in values:

However reading 8 csv files (total: 12MB) takes over 90 minutes with my laptop.

Does anyone know a more efficient way to deal with this?


  • You could use the csv module anyway to read the CSV files into memory, then transpose the rows using zip(*rows) or itertools.izip(*rows):

    with open(somecsv, 'rb') as infile:
        reader = csv.reader(infile, delimiter=';')
        headers = next(reader)
        data = list(reader)
        data = dict(zip(headers, zip(*data)))

    This creates a data dictionary with the headers as keys and the columns as values. You can delete the '' 'terms' column from the dictionary if needed.

    For your input example, the data dictionary looks like this after executing the above code:

    {'': ('term1', 'term2'), 'label1': ('1', '0'), 'label2': ('0', '0'), 'label3': ('8', '3')}