I have a large csv file of unordered data. It consists of music tags. I am trying to group all of the similar tags together for easier analysis.
An example of what I have:
Band1, hiphop, pop, rap
Band2, rock, rap, pop
band3, hiphop, rap
The output I am looking for would be like:
Band1, hiphop, pop, rap
Band2, NaN, pop, rap, rock
Band3 hiphop, NaN, rap
What is the best way to sort the data like this?
I have tried using pandas and doing basic sorts in excel.
Here's an optional way that avoids for
loops, just melting and pivoting the data to get to your output:
import pandas as pd
import numpy as np
df = pd.read_csv("./test.csv", names=['col1','col2','col3','col4'])
#melt on all but the first column
df = pd.melt(df, id_vars='col1', value_vars=df.columns[1:], value_name='genres')
#pivot using the new genres column as column names
df = pd.pivot_table(df, values='variable', index='col1', columns='genres', aggfunc='count').reset_index()
#swap non-null values with the column name
cols = df.columns[1:]
df[cols] = np.where(df[cols].notnull(), cols, df[cols])
+--------+-------+--------+-----+-----+------+
| genres | col1 | hiphop | pop | rap | rock |
+--------+-------+--------+-----+-----+------+
| 0 | Band1 | hiphop | pop | rap | NaN |
| 1 | Band2 | NaN | pop | rap | rock |
| 2 | band3 | hiphop | NaN | rap | NaN |
+--------+-------+--------+-----+-----+------+