Search code examples
pythonpandasdataframepandas-melt

Data Preparation Advanced Pandas Melt Multiple Column Sets. Merge some Columns to make them one


I have a dataset like:

Mileage Operator 1 16 Seater 29-seater Operator 2 16-seater.1 29-seater.1
80 BusRus £199 £299 Wheels £190 £290

I want it in the following format to build a price prediction model with Operator and Seats as Features

Mileage Operator Seats Price
80 BusRus 16 Seater £199
80 BusRus 29 Seater £299
80 Wheels 16 Seater £190
80 Wheels 29 Seater £290

Removed columns for simplicity

Data Set Has Nulls as sometimes you only get quotes for a particular seater vehicle or less Operators

I've tried the following code aswell:

import pandas as pd

# create the original dataframe
df = pd.DataFrame({
    'Mileage': [80],
    'Operator 1': ['BusRus'],
    '16-Seater': ['£199'],
    '29-Seater': ['£299'],
    'Operator 2': ['Wheels'],
    '16-Seater.1': ['£190'],
    '29-Seater.1': ['£290']
})

# melt the dataframe from wide to long format
df_melt = pd.melt(df, id_vars=['Mileage', 'Operator 1', 'Operator 2'],
                  var_name='Seats', value_name='Price')

# split the 'Seats' column into two columns - one for the number of seats and one for the type of vehicle
df_melt[['NumSeats', 'VehicleType']] = df_melt['Seats'].str.split('-', expand=True)

# determine the correct operator for each row based on which 'Seats' column the price belongs to
df_melt['Operator'] = df_melt.apply(lambda x: x['Operator 1'] if '1' in x['Seats'] else x['Operator 2'], axis=1)

# drop the original 'Seats' column and rename the remaining columns
df_melt = df_melt.drop('Seats', axis=1).rename(columns={'Operator 1': 'Dummy1', 'Operator 2': 'Dummy2'})

# drop the 'Dummy1' and 'Dummy2' columns and reorder the columns
df_melt = df_melt.drop(['Dummy1', 'Dummy2'], axis=1)[['Mileage', 'Operator', 'NumSeats', 'Price']]

# output the final dataframe
print(df_melt)

This gets me close to what I need but the Operators are incorrect

index Mileage Operator NumSeats Price
0 80 BusRus 16 £199
1 80 Wheels 29 £299
2 80 BusRus 16 £190
3 80 BusRus 29 £290

Solution

  • One option is to reorder the columns before flipping with a combination of stack and melt:

    temp = df.set_index('Mileage')
    regex = r"(?P<num1>\d+)\.(?P<num2>\d)(?P<text>-Seater).*"
    repl = lambda g: f"{g['num2']}_{g['num1']}{g['text']}"
    temp.columns = temp.columns.str.replace(regex, repl, regex=True)
    temp.columns = ["_".join(ent.split()[::-1]) 
                   if ent.startswith('Operator') 
                   else ent 
                   for ent in temp]
    temp.columns = temp.columns.str.split("_", expand=True)
    (temp
    .stack(level=0)
    .melt(id_vars='Operator', value_name = 'Price', var_name = 'Seats')
    )
    
      Operator      Seats Price
    0   BusRus  16-Seater  £199
    1   Wheels  16-Seater  £190
    2     Taxi  16-Seater  £200
    3   BusRus  29-Seater  £299
    4   Wheels  29-Seater  £290
    5     Taxi  29-Seater  £300