I'm new at pyspark and I was trying to do some tokenization on my data. I have my first dataframe: reviewID|text|stars
I made a tokenization on "text" according to the pyspark documentation:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
countTokens = udf(lambda words: len(words), IntegerType())
tokenized = tokenizer.transform(df2)
tokenized.select("text", "words") \
.withColumn("howmanywords", countTokens(col("words"))).show(truncate=False)
I got my tokens but now I would like to have transformed dataframe that looks like that:
words|stars
"Words" are my tokens.
So I need to join my first dataframe and tokenized dataframe to get something like that. Could you please help me? How can I add a column to the another dataframe?
Ok I got it now. I just needed to make:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
tokenized = tokenizer.transform(df2)
tokenized.select("text", "words", "stars").show(truncate=False)
It works!