Search code examples
python-3.xjupyterhub

Im working on a python problem to optimize the script


  1. to make the row values of option_labels column into column headers
  2. if the option_labels is present for a particular user_id, I would apply the value of option_values in the created new columns, else it would be 0.

sample data is: (data.csv)

 user_id       country        option_values        option_labels

 abc456         Germany        256gb                  SSD
 abc123         Brazil         i5                    intel 
 xyz456         France         128gb                  SSD
 xyz123         Turkey         i7                    intel 
 abc123         Brazil         2gb                   nvidia
 abc456         Germany        32gb                   RAM
 xyz123         Turkey         4gb                   nvidia
 xyz456         France         16gb                   RAM

sample output would be:

 user_id       country        option_values     option_labels     intel         nvidia       SSD        RAM 

 abc456         Germany        256gb             SSD                0              0        256gb        0
 abc123         Brazil         i5                intel              i5             0          0          0
 xyz456         France         256gb             SSD                0              0        128gb        0
 xyz123         Turkey         i7                intel              i7             0          0          0
 abc123         Brazil         2gb               nvidia             0              2gb        0          0  
 abc456         Germany        32gb              RAM                0              0          0          32gb
 xyz123         Turkey         4gb               nvidia             0              4gb        0          0
 xyz456         France         16gb              RAM                0              0          0          16gb

I have done this process with the below sample code,

 import pandas as pd
 import numpy as np

 data_sample = pd.read_csv("data.csv")
 feature_list = data_sample["option_label"].unique().tolist()
 user_list = data_sample["user_id"].unique().tolist()
 country_list = data_sample["country"].unique().tolist()
 opt_val_list = data_sample["opt_val"].unique().tolist()

 def filterd_id(check_id):
     single_id_data= data_sample[data_sample['user_id'] == check_id]
     return single_id_data

 def finding_features(single_id_data):
     user_features = single_id_data["option_labels"].unique().tolist()
     return user_features

 def check_feature(feature_list, user_features): 
     feature_prs_not = []
     for i in feature_list:
         if(i in user_features):
             result = opt_val_list
         else:
             result = 0 
         feature_prs_not.append(result)          
     return feature_prs_not 

 user_id = []
 country = []

 for i in user_list: 
     check_id = i
     user_id.append(i)
     single_id_data = filterd_id(check_id)
     c = single_id_data["country"].unique().tolist()
     country.append(c)
     user_features = finding_features(single_id_data)
     feature_prst_not = check_feature(feature_list,user_features)    
     df = pd.DataFrame([feature_prst_not], columns = feature_list)
     df_feature = df_feature.append(df)
 df_user_id = pd.DataFrame(user_id, columns=['all_user_id'])
 df_country = pd.DataFrame(country, columns=['country_name'])

its taking much more time to run (for eg.. 8-9 hours) for my original data of nearly 100k ids. I'm still in the learning phase in Python, Im trying to optimize now to reduce the run time of the script.


Solution

  • If you want it faster you need to vectorize. I believe this code produces the same output as yours

    import numpy as np
    
    for val in df['option_labels'].unique():
        df[val] = np.where(df['option_labels']==val, df['option_values'], 0)
    

    That's how I reproduced your data

    from io import StringIO
    
    df = pd.read_csv(StringIO(''' 
    "user_id","country","option_values","option_labels"
    "abc456","Germany","256gb","SSD"
    "abc123","Brazil","i5","intel" 
    "xyz456","France","128gb","SSD"
    "xyz123","Turkey","i7","intel" 
    "abc123","Brazil","2gb","nvidia"
    "abc456","Germany","32gb","RAM"
    "xyz123","Turkey","4gb","nvidia"
    "xyz456","France","16gb","RAM"'''))