Search code examples
pythonpandasconcatenation

Pandas - combine columns and put one after another?


I have the following dataframe:

a1,a2,b1,b2
1,2,3,4
2,3,4,5
3,4,5,6

The desirable output is:

a,b
1,3
2,4
3,5
2,4
3,5
4,6

There is a lot of "a" and "b" named headers in the dataframe, the maximum is a50 and b50. So I am looking for the way to combine them all into just "a" and "b".

I think it's possible to do with concat, but I have no idea how to combine it all, putting all the values under each other. I'll be grateful for any ideas.


Solution

  • First we read the dataframe:

    import pandas as pd
    from io import StringIO
    
    s = """a1,a2,b1,b2
    1,2,3,4
    2,3,4,5
    3,4,5,6"""
    
    df = pd.read_csv(StringIO(s), sep=',')
    

    Then we stack the columns, and separate the number of the columns from the letter 'a' or 'b':

    stacked = df.stack().rename("val").reset_index(1).reset_index()
    cols_numbers = pd.DataFrame(stacked
                                .level_1
                                .str.split('(\d)')
                                .apply(lambda l: l[:2])
                                .tolist(), 
                                columns=["col", "num"])
    x = cols_numbers.join(stacked[['val', 'index']])
    print(x)
    
       col num  val  index
    0    a   1    1      0
    1    a   2    2      0
    2    b   1    3      0
    3    b   2    4      0
    4    a   1    2      1
    5    a   2    3      1
    6    b   1    4      1
    7    b   2    5      1
    8    a   1    3      2
    9    a   2    4      2
    10   b   1    5      2
    11   b   2    6      2
    

    Finally, we group by index and num to get two columns a and b, and we fill the first row of the b column with the second value, to get what was expected:

    result = (x
             .set_index("col", append=True)
             .groupby(["index", "num"])
             .val
             .apply(lambda g: 
                    g
                    .unstack()
                    .fillna(method="bfill")
                    .head(1))
             .reset_index(-1, drop=True))
    
    print(result)
    
    col          a    b
    index num          
    0     1    1.0  3.0
          2    2.0  4.0
    1     1    2.0  4.0
          2    3.0  5.0
    2     1    3.0  5.0
          2    4.0  6.0
    

    To get rid of the multiindex at the end: result.reset_index(drop=True)