Search code examples
pythonpython-3.xpandasconcatenationnonetype

How to compress a multi dimentional dataframe into a single column?


I have the following dataframe:

    0   1   2   3   4   5   6   7   8
0   Twitter (True 01/21/2015)   None    None    None    None    None    None    None    None
1   Google, Inc. (True 11/07/2016)  None    None    None    None    None    None    None    None
2   Microsoft, (True 07/01/2016)    Facebook (True 11/01/2016)  None    None    None    None    None    None    None
3   standard & poors, Inc. (True 11/08/2016)    None    None    None    None    None    None    None    None
8   apple (True 11/10/2016)     apple (True 11/01/2016)     None    None    None    None    None    apple (True 11/01/2016)     None

How can I compress the above dataframe into a single dataframe?:

    0 
0   Twitter (True 01/21/2015)
1   Google, Inc. (True 11/07/2016)
2   Microsoft, (True 07/01/2016) \ Facebook (True 11/01/2016)
3   standard & poors, Inc. (True 11/08/2016) \ 
8   apple (True 11/10/2016) \ apple (True 11/01/2016) \ apple (True 11/01/2016)

I tried to:

df = df.iloc[:,0].join('\')

However, I do not understand how to add a separator. How should I compress the dataframe with a separator character?.


Solution

  • I think you need replace None to NaN and then remove NaN by stack, last groupby with apply join:

    df = df.replace({None: np.nan, 'None': np.nan}).stack()
    df = df.groupby(level=0).apply(' \\ '.join)
    print (df)
    0                            Twitter (True 01/21/2015)
    1                       Google, Inc. (True 11/07/2016)
    2    Microsoft, (True 07/01/2016) \ Facebook (True ...
    3             standard & poors, Inc. (True 11/08/2016)
    8    apple (True 11/10/2016) \ apple (True 11/01/20...
    dtype: object
    

    Another solution with list comprehension:

    df = df.replace({None: np.nan, 'None': np.nan})
    #python 3 use str, python 2 basestring
    df = df.apply(lambda x : ' \\ '.join([y for y in x if isinstance(y, str)]), axis=1)
    
    print (df)
    0                            Twitter (True 01/21/2015)
    1                       Google, Inc. (True 11/07/2016)
    2    Microsoft, (True 07/01/2016) \ Facebook (True ...
    3             standard & poors, Inc. (True 11/08/2016)
    8    apple (True 11/10/2016) \ apple (True 11/01/20...
    dtype: object
    

    Timings:

    #[50000 rows x 9 columns]
    df = pd.concat([df]*10000).reset_index(drop=True)
    
    In [43]: %timeit (df.replace({None: np.nan, 'None': np.nan}).apply(lambda x : ''.join([y for y in x if isinstance(y, str)]), axis=1))
    1 loop, best of 3: 820 ms per loop
    
    In [44]: %timeit (df.replace({None: np.nan, 'None': np.nan}).stack().groupby(level=0).apply(' \\ '.join))
    1 loop, best of 3: 4.62 s per loop