Search code examples
pythonexcelsortingdata-analysis

Ordering data in python or excel


I have a large csv file of unordered data. It consists of music tags. I am trying to group all of the similar tags together for easier analysis.

An example of what I have:

Band1, hiphop, pop, rap    
Band2, rock, rap, pop    
band3, hiphop, rap

The output I am looking for would be like:

Band1, hiphop, pop, rap    
Band2, NaN,    pop, rap, rock    
Band3 hiphop,   NaN, rap

What is the best way to sort the data like this?

I have tried using pandas and doing basic sorts in excel.


Solution

  • Here's an optional way that avoids for loops, just melting and pivoting the data to get to your output:

    import pandas as pd
    import numpy as np
    
    df = pd.read_csv("./test.csv", names=['col1','col2','col3','col4'])
    
    #melt on all but the first column
    df = pd.melt(df, id_vars='col1', value_vars=df.columns[1:], value_name='genres')
    
    #pivot using the new genres column as column names
    df = pd.pivot_table(df, values='variable', index='col1', columns='genres', aggfunc='count').reset_index()
    
    #swap non-null values with the column name
    cols = df.columns[1:]
    df[cols] = np.where(df[cols].notnull(), cols, df[cols])
    

    +--------+-------+--------+-----+-----+------+
    | genres | col1  | hiphop | pop | rap | rock |
    +--------+-------+--------+-----+-----+------+
    |      0 | Band1 | hiphop | pop | rap | NaN  |
    |      1 | Band2 | NaN    | pop | rap | rock |
    |      2 | band3 | hiphop | NaN | rap | NaN  |
    +--------+-------+--------+-----+-----+------+