Search code examples
pythonsqlpyspark

How to make a table from given pets to aggregate the count of the other pets?


I am writing a PySpark code from the following table.

user pets
aa [["dog"], ["cat"], ["lizard"]]
bb [["dog"], ["spider"]]
cc [["dog"], ["cat"], ["monkey"]]
... ...

Using explode_outer, I unnested the table to below.

user pets
aa "dog"
aa "cat"
aa "lizard"
bb "dog"
bb "spider"
cc "dog"
cc "cat"
cc "monkey"
... ...

I would like to make another table that contains these data below.

pet dog cat lizard spider monkey
dog 0 2 1 1 1
cat 2 0 1 0 1
lizard 1 1 0 0 0
spider 1 0 0 0 0
monkey 1 1 0 0 0

For instance, in terms of spider, it only lives with dog in the user 'bb'. So that if the data is meaningfully large, I can answer the quote, "Find top 3 pets that are the best suits with your dog."

However, I have no idea how to design appropriate code from the second table to the third one.


Solution

  • The solution based only on the Spark engine. Finding all the combinations, separating them into columns and doing a crosstab. The result omits repeating entries.

    from pyspark.sql import functions as F
    df = spark.createDataFrame(
        [('aa', [["dog"], ["cat"], ["lizard"]]),
         ('bb', [["dog"], ["spider"]]),
         ('cc', [["dog"], ["cat"], ["monkey"]])],
        ['user', 'pets'])
    
    df = df.withColumn('pets', F.flatten('pets'))
    combinations = (F.filter(
        F.transform(
            F.flatten(F.transform(
                'pets',
                lambda x: F.arrays_zip(F.array_repeat(x, F.size('pets')), 'pets')
            )),
            lambda x: F.array(x['0'], x['pets'])
        ),
        lambda x: x[0] < x[1]
    ))
    df = df.withColumn('pets', F.explode(combinations))
    df = df.withColumn('pet0', F.col('pets')[0])
    df = df.withColumn('pet1', F.col('pets')[1])
    
    df = df.crosstab("pet0", "pet1")
    
    df.show()
    # +---------+---+------+------+------+
    # |pet0_pet1|dog|lizard|monkey|spider|
    # +---------+---+------+------+------+
    # |      dog|  0|     1|     1|     1|
    # |      cat|  2|     1|     1|     0|
    # +---------+---+------+------+------+