Search code examples
pythonpandasdataframenan

Compacting data in a pandas DataFrame by removing NaNs and shifting values left to reduce number of columns


I have a data frame that looks as below:

5.29559     NaN     2.38176     NaN     0.51521     NaN     0.04454     0.00000     None    None    None    None    None    None    None    None
0   NaN     NaN     NaN     NaN     0   NaN     NaN     0   NaN     NaN     0   2   None    None    None
4.32454     NaN     1.77600     NaN     0.04454     NaN     0.00000     None    None    None    None    None    None    None    None    None
0   NaN     NaN     NaN     NaN     0   NaN     NaN     0   NaN     NaN     2   None    None    None    None    

I am trying to generate a data frame by remove all the NaN values and trying to make the current data frame look like this:

5.29559     2.38176     0.51521     0.04454     0.00000     
      0           0           0           0           2         
4.32454     1.77600     0.04454     0.00000     
      0           0           0           2     

Can someone please help? I tried the dropna() method but it did not help.


Solution

  • Let's try stacking to eliminate nans, then reset the index for each level and finally unstack again:

    (df.stack()
       .groupby(level=0)
       .apply(lambda df: df.reset_index(drop=True))
       .unstack())
    
             0        1        2        3    4
    0  5.29559  2.38176  0.51521  0.04454  0.0
    1  0.00000  0.00000  0.00000  0.00000  2.0
    2  4.32454  1.77600  0.04454  0.00000  NaN
    3  0.00000  0.00000  0.00000  2.00000  NaN
    

    Explanation:

    First, stack to remove NaNs

    df.stack()
    
    0  0     5.29559
       2     2.38176
       4     0.51521
       6     0.04454
       7     0.00000
    1  0     0.00000
       5     0.00000
       8     0.00000
       11    0.00000
       12    2.00000
    2  0     4.32454
       2     1.77600
       4     0.04454
       6     0.00000
    3  0     0.00000
       5     0.00000
       8     0.00000
       11    2.00000 
    dtype: float64
    

    You'll notice the inner level of the index isn't monotonically increasing. let's fix that with groupby.apply

    _.groupby(level=0).apply(lambda df: df.reset_index(drop=True))
    
    0  0    5.29559
       1    2.38176
       2    0.51521
       3    0.04454
       4    0.00000
    1  0    0.00000
       1    0.00000
       2    0.00000
       3    0.00000
       4    2.00000
    2  0    4.32454
       1    1.77600
       2    0.04454
       3    0.00000
    3  0    0.00000
       1    0.00000
       2    0.00000
       3    2.00000
    dtype: float64
    

    now we unstack

    _.unstack()
    
             0        1        2        3    4
    0  5.29559  2.38176  0.51521  0.04454  0.0
    1  0.00000  0.00000  0.00000  0.00000  2.0
    2  4.32454  1.77600  0.04454  0.00000  NaN
    3  0.00000  0.00000  0.00000  2.00000  NaN