Search code examples
pythonpandasscikit-learndata-cleaningmultivalue

Python pandas - extracting multi-value attributes


I just started with Python and wanted to do data preparation with the numpy/pandas package on the Movielens dataset (especially the the file with MovieID, Movie Name and Year as well as Genre).

Screenshot: movielens - movie dataset

The column Genre is a multi-value column which is a problem for me since I want to try using machine learning algorithms on the datasets.

Aim: I want to have a yes/no or 0/1 information about which genre the movie falls in and which not.

Idea: Check if the 'Genre' column contains the column name of the appended columns (single genre names). If so, write yes, otherwise write now in the cell. And this iterate over all the new columns and all the rows.

Done so far: I appended empty/NaN columns to the dataframe for each Genre. And I also tried with dataframe.iloc['Genre'].str.contains(list(dataframe)[4]) which gave me the result TRUE or FALSE if the names matched or not. But how can I iterate and write in the cells in an elegant way?

Many thanks in advance. Best, Marcel

EDIT: Here you will find what I achieved so far. I split the data in the Genre column with the pipe separator, renamed the columns and appended the new columns and deleted the old column. If I now use the get_dummies function on all the columns, it creates e.g. a 'Genre1_Action', 'Genre1_Adventure', ..., 'Genre3Thriller', according to the text values displayed in the cell of the Genre cells. What I want to achieve is that each Genre gets its single columns for each movie.

# create a small test subset
subset1 = movie_data [0:9]
print("Original Dataset")
print(subset1)
# Split movie year and year in separate values -> append them to the df -> clean the Year column
tempY = subset1['MovieNameYear'].str.split('(').apply(pd.Series)
tempY.columns = ['MovieName','Year']
subset1 = pd.concat([subset1,tempY], axis=1, join='inner')
subset1['Year'] = subset1['Year'].str.replace(')','')
del subset1['MovieNameYear']

# split the column 'Genre' with the with the pipe separator in seperate columns
# name the columns of the temp value with the splitted values
# join the through split created columns to the existing subset and delete the original multi value column
tempG = subset1['Genre'].str.split('|').apply(pd.Series)
tempG.columns = ['Genre1','Genre2','Genre3']
subset1 = pd.concat([subset1, tempG], axis=1, join='inner')
del subset1['Genre']
print("Cleaned Dataset")
print(subset1)

dummiesTemp = pd.get_dummies(data=subset1, columns=['Genre1','Genre2','Genre3'])
print(dummiesTemp)

Solution

  • If I understand you well, you want to have a column per genre, indicating T/F. I would advice you to look at the get_dummies function

    import pandas as pd
    s = pd.Series(list('abca'))
    pd.get_dummies(s)
    

    Update - if you have columns with double values, you can split them before or after. Example of splitting after (would guess its quickest, but one should test). Code could be prettier, but hope its clear.

    import pandas as pd
    import numpy as np
    
    s = pd.Series(['a', 'b', 'c', 'a|b', 'a|d'])
    d = pd.get_dummies(s)
    
    columns = list(d)
    for col in columns:
        if '|' in col:
            for l in col.split('|'):
                if l in columns:
                    d[l] = np.maximum(d[l].values, d[col].values)
                else:
                    d[l] = d[col]