I have a dataframe productusage
like :
featureSk | PersonNumber |
---|---|
A | 1001 |
B | 1001 |
C | 1003 |
C | 1004 |
A | 1002 |
B | 1005 |
I need to create a python function that has a list of person numbers as input and outputs a dateframe which has the values of featureSk
column from productusage
as columns. Basically there should be a column for each featureSk
value and a 0 on a row if the PersonNumber
doesnt exist in productusage
and a 1 if it does exist in productusage
output should be a pandas dataframe like :
PersonNumber | A | B | C |
---|---|---|---|
1001 | 1 | 1 | 0 |
1002 | 0 | 0 | 0 |
1003 | 0 | 0 | 1 |
This is what I tried
def add_featureSk_to_dataframe(Person_list):
Person_list = pd.DataFrame(Person_list)
df = productusage
unique_values = df[featureSk].unique()
for value in unique_vaues:
for person in Persons_list:
df = df.withColumn(value, lambda person: 1 if person in Persons_list else 0)
return df
person_test = [1001,1002,1003]
add_featureSk_to_dataframe(person_test)
Getting an error that featureSk is not defined even though the productusage
is defined
def person_has_product(person_list):
df = dfPersonQuery
#Distinct product names
products = df.select("featureSk").distinct()
# Filter df for the required persons
filtered_df = df.filter(col("personnumber").isin(person_list))
# Perform crosstab on the person and product columns
cross_tab_result = filtered_df.crosstab("personnumber", "featureSk").withColumnRenamed("personnumber_featureSk", "personnumber")
# Iterate through the distinct products in featureSk column
for column in cross_tab_result.drop("personnumber").columns:
cross_tab_result = cross_tab_result.withColumn(column,when(col(column) > 0, 1).otherwise(0))
return print(cross_tab_result.toPandas())
person_lst =[1001, 1002, 1003]
person_has_product(person_lst)
`