Search code examples
vectorpysparkword2vec

PySpark: create a vector from values in a group


I currently have a dataset of transaction histories of users in the following format:

+---------+------------+------------+
| user_id | order_date | product_id |
+---------+------------+------------+
|       1 |   20190101 |        123 |
|       1 |   20190102 |        331 |
|       1 |   20190301 |       1029 |
+---------+------------+------------+

I'm trying to transform the dataset to be used for an Item2Vec model -- which I believe has to look like this:

+---------+-------------------+
| user_id |      seq_vec      |
+---------+-------------------+
|    1    |  [123, 331, 1029] |
-------------------------------

I'm assuming the dataset has to be formatted this way from looking at examples of Word2Vec (https://spark.apache.org/docs/2.2.0/ml-features.html#word2vec).

Is there a built-in PySpark method of creating a vector from the values in product_id column if I'm grouping by user_id?


Solution

  • collect_list does the trick

    import pyspark.sql.functions as F
    
    rawData = [(1, 20190101, 123),
               (1, 20190102, 331),
               (1, 20190301, 1029)]
    
    df = spark.createDataFrame(rawData).toDF("user_id", "order_date", "product_id")
    
    df.groupBy("user_id").agg(F.collect_list("product_id").alias("vec")).show()
    
    +-------+----------------+
    |user_id|             vec|
    +-------+----------------+
    |      1|[123, 331, 1029]|
    +-------+----------------+