Search code examples
pythonpyspark

Python function to add binary columns to a pyspark df


I have a dataframe productusage like :

featureSk PersonNumber
A 1001
B 1001
C 1003
C 1004
A 1002
B 1005

I need to create a python function that has a list of person numbers as input and outputs a dateframe which has the values of featureSk column from productusage as columns. Basically there should be a column for each featureSk value and a 0 on a row if the PersonNumber doesnt exist in productusage and a 1 if it does exist in productusage

output should be a pandas dataframe like :

PersonNumber A B C
1001 1 1 0
1002 0 0 0
1003 0 0 1

This is what I tried

def add_featureSk_to_dataframe(Person_list):
    Person_list = pd.DataFrame(Person_list)
    df = productusage
    unique_values = df[featureSk].unique()
    for value in unique_vaues:
      for person in Persons_list:
        df = df.withColumn(value, lambda person: 1 if person in Persons_list else 0)
    return df
person_test = [1001,1002,1003]
add_featureSk_to_dataframe(person_test)

Getting an error that featureSk is not defined even though the productusage is defined


Solution

  • def person_has_product(person_list):
        df = dfPersonQuery
        
        #Distinct product names
        products = df.select("featureSk").distinct()
        
        # Filter df for the required persons
        filtered_df = df.filter(col("personnumber").isin(person_list))
    
        # Perform crosstab on the person and product columns
        cross_tab_result = filtered_df.crosstab("personnumber", "featureSk").withColumnRenamed("personnumber_featureSk", "personnumber")
        # Iterate through the distinct products in featureSk column
        
        for column in cross_tab_result.drop("personnumber").columns:
            cross_tab_result = cross_tab_result.withColumn(column,when(col(column) > 0, 1).otherwise(0))
    
        return print(cross_tab_result.toPandas())
        
    person_lst =[1001, 1002, 1003]
    person_has_product(person_lst)
    `