Search code examples
pythonpandasregexunpivotmelt

How to use pd.melt to unpivot a dataframe where columns share a prefix?


I'm trying to unpivot my data using pd.melt but no success so far. Each row is a business, and the data contains information about the business and multiple reviews. I want my data to have every review as a row.

My first 150 columns are in groups of 15, each group column name shares the same pattern reviews/n/ for 0 < n < 9. (reviews/0/text, reviews/0/date, ... , reviews/9/date). The next 65 columns in the dataframe include more data about the business (e.g. business_id, address) that should remain as id_variables.

My current data looks like this:

business_id address reviews/0/date reviews/0/text reviews/1/date reviews/1/text
12345 01 street 1/1/1990 "abc" 2/2/1995 "def"

and my new dataframe should have every review as a row instead of every business, and look like this:

business_id address review_number review_date review_text
12345 01 street 0 1/1/1990 "abc"
12345 01 street 1 2/2/1995 "def"

I tried using pd.melt but could not succeed in making code that produced something valuable to me.


Solution

  • You can use pandas.wide_to_long() to do what you want.

    However, you will need to rename your columns from the pattern reviews/N/COL to reviews/COL/N (or something similar) first, as wide_to_long() can only unpivot based on prefixes, whereas in your column names, you have a prefix and a suffix.

    You could do this manually or e.g. using the re module and an appropriate regex:

    df = df.rename(columns=lambda x: re.sub('reviews/(\d)/(.*)', r'review_\2\1', x))
    

    After that, your data should look like this (note the changed colnames):

    business_id address review_date0 review_text0 review_date1 review_text1
    12345 01 street 1/1/1990 abc 2/2/1995 def

    Now you can use pandas.wide_to_long() and use the stubnames parameter to specify the prefix of the columns that should be grouped when you unpivot.

    df = pd.wide_to_long(df,
                         stubnames=['review_date','review_text'],
                         i=['business_id', 'address'], 
                         j='review_number')
    

    Finally, call .reset_index() to achieve the result you asked for.

    Full example:

    import re
    import pandas as pd
    
    df = pd.DataFrame({'business_id': 12345, 
                       'address': '01 street', 
                       'reviews/0/date': '1/1/1990', 
                       'reviews/0/text': 'abc', 
                       'reviews/1/date': '2/2/1995', 
                       'reviews/1/text': 'def'}, index = [0])
    
    df = df.rename(columns=lambda x: re.sub('reviews/(\d)/(.*)', r'review_\2\1', x))
    
    df = pd.wide_to_long(df,
                         stubnames=['review_date','review_text'],
                         i=['business_id', 'address'], 
                         j='review_number').reset_index()
    

    Result:

    business_id address review_number review_date review_text
    12345 01 street 0 1/1/1990 abc
    12345 01 street 1 2/2/1995 def