Search code examples
pythonpandassparse-matrix

Create Pandas DataFrame from (row, column, value) data


I have a Pandas Dataframe with three columns: row, column, value. The row values are all integers below some N, and the column values are all integers below some M. The values are all positive integers.

How do I efficiently create a Dataframe with N rows and M columns, with at index i, j the value val if (i, j , val) is a row in my original Dataframe, and some default value (0) otherwise? Furthermore, is it possible to create a sparse Dataframe immediately, since the data is already quite large, but N*M is still about 10 times the size of my data?


Solution

  • A NumPy solution would suit here for performance -

    a = df.values
    m,n = a[:,:2].max(0)+1
    out = np.zeros((m,n),dtype=a.dtype)
    out[a[:,0], a[:,1]] = a[:,2]
    df_out = pd.DataFrame(out)
    

    Sample run -

    In [58]: df
    Out[58]: 
        row  col  val
    0     7    1   30
    1     3    3    0
    2     4    8   30
    3     5    8   18
    4     1    3    6
    5     1    6   48
    6     0    2    6
    7     4    7    6
    8     5    0   48
    9     8    1   48
    10    3    2   12
    11    6    8   18
    
    In [59]: df_out
    Out[59]: 
        0   1   2  3  4  5   6  7   8
    0   0   0   6  0  0  0   0  0   0
    1   0   0   0  6  0  0  48  0   0
    2   0   0   0  0  0  0   0  0   0
    3   0   0  12  0  0  0   0  0   0
    4   0   0   0  0  0  0   0  6  30
    5  48   0   0  0  0  0   0  0  18
    6   0   0   0  0  0  0   0  0  18
    7   0  30   0  0  0  0   0  0   0
    8   0  48   0  0  0  0   0  0   0