I currently have a dataset of transaction histories of users in the following format:
+---------+------------+------------+
| user_id | order_date | product_id |
+---------+------------+------------+
| 1 | 20190101 | 123 |
| 1 | 20190102 | 331 |
| 1 | 20190301 | 1029 |
+---------+------------+------------+
I'm trying to transform the dataset to be used for an Item2Vec model -- which I believe has to look like this:
+---------+-------------------+
| user_id | seq_vec |
+---------+-------------------+
| 1 | [123, 331, 1029] |
-------------------------------
I'm assuming the dataset has to be formatted this way from looking at examples of Word2Vec (https://spark.apache.org/docs/2.2.0/ml-features.html#word2vec).
Is there a built-in PySpark method of creating a vector from the values in product_id
column if I'm grouping by user_id
?
collect_list
does the trick
import pyspark.sql.functions as F
rawData = [(1, 20190101, 123),
(1, 20190102, 331),
(1, 20190301, 1029)]
df = spark.createDataFrame(rawData).toDF("user_id", "order_date", "product_id")
df.groupBy("user_id").agg(F.collect_list("product_id").alias("vec")).show()
+-------+----------------+
|user_id| vec|
+-------+----------------+
| 1|[123, 331, 1029]|
+-------+----------------+