Search code examples
pythonpandas

Split a large pandas dataframe


I have a large dataframe with 423244 lines. I want to split this in to 4. I tried the following code which gave an error? ValueError: array split does not result in an equal division

for item in np.split(df, 4):
    print item

How to split this dataframe in to 4 groups?


Solution

  • Use np.array_split:

    Docstring:
    Split an array into multiple sub-arrays.
    
    Please refer to the ``split`` documentation.  The only difference
    between these functions is that ``array_split`` allows
    `indices_or_sections` to be an integer that does *not* equally
    divide the axis.
    
    In [1]: import pandas as pd
    
    In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
       ...:                           'foo', 'bar', 'foo', 'foo'],
       ...:                    'B' : ['one', 'one', 'two', 'three',
       ...:                           'two', 'two', 'one', 'three'],
       ...:                    'C' : randn(8), 'D' : randn(8)})
    
    In [3]: print df
         A      B         C         D
    0  foo    one -0.174067 -0.608579
    1  bar    one -0.860386 -1.210518
    2  foo    two  0.614102  1.689837
    3  bar  three -0.284792 -1.071160
    4  foo    two  0.843610  0.803712
    5  bar    two -1.514722  0.870861
    6  foo    one  0.131529 -0.968151
    7  foo  three -1.002946 -0.257468
    
    In [4]: import numpy as np
    In [5]: np.array_split(df, 3)
    Out[5]: 
    [     A    B         C         D
    0  foo  one -0.174067 -0.608579
    1  bar  one -0.860386 -1.210518
    2  foo  two  0.614102  1.689837,
          A      B         C         D
    3  bar  three -0.284792 -1.071160
    4  foo    two  0.843610  0.803712
    5  bar    two -1.514722  0.870861,
          A      B         C         D
    6  foo    one  0.131529 -0.968151
    7  foo  three -1.002946 -0.257468]