Search code examples
pythonpandasimportchunks

Python & Pandas. How to work with a subset of the "chunks" in a TextFileReader object?


I need to import a large .txt file (approx. 10GB) to do some calculations. I'm using Pandas in Python 2.7.

Basically, I need to construct sums and averages of certain series (columns), conditional on the value of other series. To be more precise: I have basic information on individuals living in a country, and, for example, I want to take the average age of the people in each municipality.

I cannot import the whole file (because it is too big), so I am doing it in "chunks" (using read_table, chunksize). For each calculation, I don't need all of the chunks, just a subset of them.

Since the information might not be ordered, I first iterate over all chunks to identify which of them have information for each of the municipalities. Hence, for each municipality I have a list with the indices of the chunks that contain at least one observation belonging to it.

I would then like to use this list to select only those chunks, but I'm not being able to do it in a fast way. The only thing that seems to work is to iterate over all chunks again.

Is there a way to directly select a subset of the "chunks" in a TextFileReader object without having to iterate over all of them?


Solution

  • I would try to do it this way:

    res = \
    pd.concat([df.assign(age=(pd.datetime.now() - df.dob).astype('m8[Y]').astype(int))
                 .groupby(['country','municipality'])['age'].agg(['size','sum']).reset_index()
               for df in pd.read_csv('/path/to/file.txt', sep=..., chunksize=10**5) ],
              ignore_index=True)
    
    res = res.groupby(['country','municipality'], as_index=False).sum()
    

    This will give you a total number of individuals in each municipality (size column) and a sum of their ages (sum column).

    sum/size - will give you an average age per municipality

    UPDATE: you can use the following trick in order to calculate the age on the fly:

    In [164]: df
    Out[164]:
       country municipality        dob
    0  Ukraine           m1 1950-01-01
    1  Ukraine           m1 1960-12-14
    2      USA           m2 1971-11-27
    3      USA           m2 1982-11-09
    4      USA           m3 1993-10-22
    5  Germany           m1 2004-10-04
    6  Germany           m2 2015-09-17
    
    In [165]: df.assign(age=(pd.datetime.now() - df.dob).astype('m8[Y]').astype(int))
    Out[165]:
       country municipality        dob  age
    0  Ukraine           m1 1950-01-01   67
    1  Ukraine           m1 1960-12-14   56
    2      USA           m2 1971-11-27   45
    3      USA           m2 1982-11-09   34
    4      USA           m3 1993-10-22   23
    5  Germany           m1 2004-10-04   12
    6  Germany           m2 2015-09-17    1
    

    UPDATE2: pd.read_csv() returns a pandas.io.parsers.TextFileReader instead of DataFrame as soon as you specify chunksize:

    In [6]: reader = pd.read_csv(r'D:\temp\.data\1.csv', chunksize=3, sep='\s+')
    
    In [7]: type(reader)
    Out[7]: pandas.io.parsers.TextFileReader
    

    get 2 rows

    In [12]: reader.get_chunk(2)
    Out[12]:
          foo  foo.1   bar  bar.1  spam  spam.1
    foo  0.00   0.35  0.83   0.84  0.90    0.89
    foo  0.35   0.00  0.86   0.85  0.92    0.91
    

    get next 3 rows

    In [13]: reader.get_chunk(3)
    Out[13]:
           foo  foo.1   bar  bar.1  spam  spam.1
    bar   0.83   0.86  0.00   0.25  0.88    0.87
    bar   0.84   0.85  0.25   0.00  0.82    0.86
    spam  0.90   0.92  0.88   0.82  0.00    0.50