Search code examples
pythonpandasnumpy-ndarrayunpivotpandas-melt

Convert x same size 2D numpy arrays to a 2+x column data frame


I have two ndarrays of size (m x n), and two lists of length m and n respectively. I want to convert the two matrices to a dataframe with four columns. The first two columns correspond to the m and n dimensions, and contain the values from the lists. The next two columns should contain the values from the two matrices. In total, the resulting dataframe should have m times n rows.

Example: If these are the two matrices and two lists,

a1 = np.array([[1, 2], [3, 4],[5,6]])
a2 = np.array([[10, 20], [30, 40],[50,60]])
l1 = [5,7,99]
l2 = [2,3]

then the resulting dataframe should look like this:

"l1" "l2" "a1" "a2"
5    2    1    10
7    2    3    30
99   2    5    50
5    3    2    20
7    3    4    40
99   3    6    60

The order of the rows does not matter.

Although I only have two matrices in this specific case, I am curious about a solution which is easily applicable to any number of same size matrices.


Solution

  • Use np.vstack for join arrays created by numpy.tile, numpy.repeat and numpy.ravel and pass to DataFrame cosntructor:

    a = np.vstack((np.tile(l1, len(l2)),
                   np.repeat(l2, len(l1)),
                   np.ravel(a1, 'F'), 
                   np.ravel(a2, 'F'))).T
    print (a)
    [[ 5  2  1 10]
     [ 7  2  3 30]
     [99  2  5 50]
     [ 5  3  2 20]
     [ 7  3  4 40]
     [99  3  6 60]]
    
    
    df = pd.DataFrame(a, columns=['l1','l2','a1','a2'])
    print (df)
       l1  l2  a1  a2
    0   5   2   1  10
    1   7   2   3  30
    2  99   2   5  50
    3   5   3   2  20
    4   7   3   4  40
    5  99   3   6  60
    

    For multiple arrays:

    arrays =  [a1, a2]
    
    arr = [np.ravel(a, 'F') for a in arrays]
    a = np.vstack((np.tile(l1, len(l2)), 
                   np.repeat(l2, len(l1)),
                   arr)).T
    print (a)
    [[ 5  2  1 10]
     [ 7  2  3 30]
     [99  2  5 50]
     [ 5  3  2 20]
     [ 7  3  4 40]
     [99  3  6 60]]
    
    
    df = pd.DataFrame(a, columns=['l1','l2'] + [f'a{x+1}' for x in range(len(arrays))])
    print (df)
       l1  l2  a1  a2
    0   5   2   1  10
    1   7   2   3  30
    2  99   2   5  50
    3   5   3   2  20
    4   7   3   4  40
    5  99   3   6  60
    

    Pandas only solution with concat and DataFrame.unstack:

    df = (pd.concat([pd.DataFrame(a1, columns=l2, index=l1).unstack(),
                    pd.DataFrame(a2, columns=l2, index=l1).unstack()],
                   axis=1, keys=['a1','a2'])
            .rename_axis(['l2','l1']).swaplevel(1,0).reset_index())
    print (df)
       l1  l2  a1  a2
    0   5   2   1  10
    1   7   2   3  30
    2  99   2   5  50
    3   5   3   2  20
    4   7   3   4  40
    5  99   3   6  60
    

    For multiple arrays:

    arrays =  [a1, a2]
    df = (pd.concat([pd.DataFrame(a, columns=l2, index=l1).unstack() for a in arrays],
                   axis=1)
            .rename_axis(['l2','l1'])
            .swaplevel(1,0)
            .rename(columns=lambda x: f'a{x+1}')
            .reset_index())
    print (df)
       l1  l2  a1  a2
    0   5   2   1  10
    1   7   2   3  30
    2  99   2   5  50
    3   5   3   2  20
    4   7   3   4  40
    5  99   3   6  60