Search code examples
pythonpandasmergemelt

Melt and Merge on Substring - Python & Pandas


I have data which has data like

id      name    model_#   ms   bp1   cd1    sf1    sa1   rq1   bp2   cd2   sf2   sa2   rq2 ... 
1       John    23984     1    23    234    124     25   252   252    62   194    234   234 ... 
2       John    23984     2    234   234    242     62   262   622   262   622    26    262 ... 

for hundreds of models with up to 10 ms and variables counting up to 21.

I have usually used pd.melt for doing my analysis where i look at bp1:bp21 or whatever. I currently have a need to create a melt where I look at bp1 values along with rq 1 values.

I am looking to effectively create something like this:

              id  model_#  ms  variable_x  value_x variable_y  value_y
0            113    77515   1        bp1     23        rq1      252
1            113    77515   1        bp2     252       rq2      262
2            113    77515   1        bp3     26        rq3      311

Right now the best I have been able to do is:

              id  model_#  ms variable_x  value_x variable_y  value_y
0            113    77515   1        bp1     23        rq1      252
1            113    77515   1        bp1     23        rq2      262
2            113    77515   1        bp1     23        rq3      311
3            113    77515   1        bp1     23        rq4      246

via:

df = pd.melt(dat, id_vars=['id', 'mod_req', 'ms'], value_vars=bp)
df1 = pd.melt(dat, id_vars=['id', 'mod_req', 'ms'], value_vars=rq)
df2 = pd.merge(df,df1, on=['id', 'mod_req', 'ms'])

Is there an easy way to merge on substring such that bp1 will connect with rq1 and so forth? This would mean taking a melted dataframe which only looks at bp1:bp21 and a other melted dataframe rq1:rq21 and merging based on the substring values( bp1 rq1, not bp1 rq2)


Solution

  • Solution

    Set the index...
    Use a clever column groupby...
    Another clever function to apply...

    d1 = df.set_index(['id', 'name', 'model_#', 'ms'])
    
    def melt_(df):
        id_vars = df.index.names
        return df.reset_index().melt(id_vars=id_vars).set_index(id_vars)
    
    
    d2 = d1.groupby(d1.columns.str.extract('(\D+)', expand=False), axis=1).apply(melt_)
    
    d2.columns = d2.columns.swaplevel(0, 1).map('_'.join)
    d2.reset_index()
    
       id  name  model_#  ms variable_bp  value_bp variable_cd  value_cd variable_rq  value_rq variable_sa  value_sa variable_sf  value_sf
    0   1  John    23984   1         bp1        23         cd1       234         rq1       252         sa1        25         sf1       124
    1   2  John    23984   2         bp1       234         cd1       234         rq1       262         sa1        62         sf1       242
    2   1  John    23984   1         bp2       252         cd2        62         rq2       234         sa2       234         sf2       194
    3   2  John    23984   2         bp2       622         cd2       262         rq2       262         sa2        26         sf2       622
    

    Overly Functionalized

    e = lambda d, n: dict(zip(n, d.dtypes))
    i = lambda d, n: pd.DataFrame(d.values, d.index, n).astype(e(d, n))
    h = lambda d: i(d, d.columns.map(fmt)).reset_index()
    m = lambda d: d.reset_index().melt(cols).set_index(cols)
    fmt = '{0[1]}_{0[0]}'.format
    
    cols = ['id', 'name', 'model_#', 'ms']
    
    d1 = df.set_index(cols)
    g = d1.columns.str.extract('(\D+)', expand=False)
    d1.groupby(g, axis=1).apply(m).pipe(h)
    
       id  name  model_#  ms variable_bp  value_bp variable_cd  value_cd variable_rq  value_rq variable_sa  value_sa variable_sf  value_sf
    0   1  John    23984   1         bp1        23         cd1       234         rq1       252         sa1        25         sf1       124
    1   2  John    23984   2         bp1       234         cd1       234         rq1       262         sa1        62         sf1       242
    2   1  John    23984   1         bp2       252         cd2        62         rq2       234         sa2       234         sf2       194
    3   2  John    23984   2         bp2       622         cd2       262         rq2       262         sa2        26         sf2       622
    

    Old Answer

    This is far from pretty and I'm not even sure this is what you want.

    d1 = df.set_index(['id', 'name', 'model_#', 'ms'])
    
    cidx = pd.MultiIndex.from_tuples(
        d1.columns.to_series().str.extract('(\D+)(\d+)', expand=False).values.tolist(),
        names=[None, 'variable']
    )
    
    d1.columns = cidx
    
    d2 = d1.sort_index(axis=1).stack()
    
    variables = pd.DataFrame(
        (d2.columns + d2.index.get_level_values('variable')[:, None]).tolist(),
        d2.index, d2.columns
    )
    
    d3 = pd.concat(
        [variables, d2], axis=1, keys=['variable', 'value']
    ).reset_index('variable', drop=True).sort_index(axis=1, level=1, sort_remaining=False)
    
    d3.columns = d3.columns.map('_'.join)
    
    d3.reset_index()
    
       id  name  model_#  ms variable_bp  value_bp variable_cd  value_cd variable_rq  value_rq variable_sa  value_sa variable_sf  value_sf
    0   1  John    23984   1         bp1        23         cd1       234         rq1       252         sa1        25         sf1       124
    1   1  John    23984   1         bp2       252         cd2        62         rq2       234         sa2       234         sf2       194
    2   2  John    23984   2         bp1       234         cd1       234         rq1       262         sa1        62         sf1       242
    3   2  John    23984   2         bp2       622         cd2       262         rq2       262         sa2        26         sf2       622