sample data is: (data.csv)
user_id country option_values option_labels
abc456 Germany 256gb SSD
abc123 Brazil i5 intel
xyz456 France 128gb SSD
xyz123 Turkey i7 intel
abc123 Brazil 2gb nvidia
abc456 Germany 32gb RAM
xyz123 Turkey 4gb nvidia
xyz456 France 16gb RAM
sample output would be:
user_id country option_values option_labels intel nvidia SSD RAM
abc456 Germany 256gb SSD 0 0 256gb 0
abc123 Brazil i5 intel i5 0 0 0
xyz456 France 256gb SSD 0 0 128gb 0
xyz123 Turkey i7 intel i7 0 0 0
abc123 Brazil 2gb nvidia 0 2gb 0 0
abc456 Germany 32gb RAM 0 0 0 32gb
xyz123 Turkey 4gb nvidia 0 4gb 0 0
xyz456 France 16gb RAM 0 0 0 16gb
I have done this process with the below sample code,
import pandas as pd
import numpy as np
data_sample = pd.read_csv("data.csv")
feature_list = data_sample["option_label"].unique().tolist()
user_list = data_sample["user_id"].unique().tolist()
country_list = data_sample["country"].unique().tolist()
opt_val_list = data_sample["opt_val"].unique().tolist()
def filterd_id(check_id):
single_id_data= data_sample[data_sample['user_id'] == check_id]
return single_id_data
def finding_features(single_id_data):
user_features = single_id_data["option_labels"].unique().tolist()
return user_features
def check_feature(feature_list, user_features):
feature_prs_not = []
for i in feature_list:
if(i in user_features):
result = opt_val_list
else:
result = 0
feature_prs_not.append(result)
return feature_prs_not
user_id = []
country = []
for i in user_list:
check_id = i
user_id.append(i)
single_id_data = filterd_id(check_id)
c = single_id_data["country"].unique().tolist()
country.append(c)
user_features = finding_features(single_id_data)
feature_prst_not = check_feature(feature_list,user_features)
df = pd.DataFrame([feature_prst_not], columns = feature_list)
df_feature = df_feature.append(df)
df_user_id = pd.DataFrame(user_id, columns=['all_user_id'])
df_country = pd.DataFrame(country, columns=['country_name'])
its taking much more time to run (for eg.. 8-9 hours) for my original data of nearly 100k ids. I'm still in the learning phase in Python, Im trying to optimize now to reduce the run time of the script.
If you want it faster you need to vectorize. I believe this code produces the same output as yours
import numpy as np
for val in df['option_labels'].unique():
df[val] = np.where(df['option_labels']==val, df['option_values'], 0)
That's how I reproduced your data
from io import StringIO
df = pd.read_csv(StringIO('''
"user_id","country","option_values","option_labels"
"abc456","Germany","256gb","SSD"
"abc123","Brazil","i5","intel"
"xyz456","France","128gb","SSD"
"xyz123","Turkey","i7","intel"
"abc123","Brazil","2gb","nvidia"
"abc456","Germany","32gb","RAM"
"xyz123","Turkey","4gb","nvidia"
"xyz456","France","16gb","RAM"'''))