I'm trying to unpivot my data using pd.melt
but no success so far. Each row is a business, and the data contains information about the business and multiple reviews. I want my data to have every review as a row.
My first 150 columns are in groups of 15, each group column name shares the same pattern reviews/n/
for 0 < n < 9
. (reviews/0/text
, reviews/0/date
, ... , reviews/9/date
).
The next 65 columns in the dataframe include more data about the business (e.g. business_id
, address
) that should remain as id_variables.
My current data looks like this:
business_id | address | reviews/0/date | reviews/0/text | reviews/1/date | reviews/1/text |
---|---|---|---|---|---|
12345 | 01 street | 1/1/1990 | "abc" | 2/2/1995 | "def" |
and my new dataframe should have every review as a row instead of every business, and look like this:
business_id | address | review_number | review_date | review_text |
---|---|---|---|---|
12345 | 01 street | 0 | 1/1/1990 | "abc" |
12345 | 01 street | 1 | 2/2/1995 | "def" |
I tried using pd.melt
but could not succeed in making code that produced something valuable to me.
You can use pandas.wide_to_long()
to do what you want.
However, you will need to rename your columns from the pattern reviews/N/COL
to reviews/COL/N
(or something similar) first, as wide_to_long()
can only unpivot based on prefixes, whereas in your column names, you have a prefix and a suffix.
You could do this manually or e.g. using the re
module and an appropriate regex:
df = df.rename(columns=lambda x: re.sub('reviews/(\d)/(.*)', r'review_\2\1', x))
After that, your data should look like this (note the changed colnames):
business_id | address | review_date0 | review_text0 | review_date1 | review_text1 |
---|---|---|---|---|---|
12345 | 01 street | 1/1/1990 | abc | 2/2/1995 | def |
Now you can use pandas.wide_to_long()
and use the stubnames
parameter to specify the prefix of the columns that should be grouped when you unpivot.
df = pd.wide_to_long(df,
stubnames=['review_date','review_text'],
i=['business_id', 'address'],
j='review_number')
Finally, call .reset_index()
to achieve the result you asked for.
Full example:
import re
import pandas as pd
df = pd.DataFrame({'business_id': 12345,
'address': '01 street',
'reviews/0/date': '1/1/1990',
'reviews/0/text': 'abc',
'reviews/1/date': '2/2/1995',
'reviews/1/text': 'def'}, index = [0])
df = df.rename(columns=lambda x: re.sub('reviews/(\d)/(.*)', r'review_\2\1', x))
df = pd.wide_to_long(df,
stubnames=['review_date','review_text'],
i=['business_id', 'address'],
j='review_number').reset_index()
Result:
business_id | address | review_number | review_date | review_text |
---|---|---|---|---|
12345 | 01 street | 0 | 1/1/1990 | abc |
12345 | 01 street | 1 | 2/2/1995 | def |