Search code examples
pythondataframepysparknlptokenize

How to add column to one dataframe from another in pyspark?


I'm new at pyspark and I was trying to do some tokenization on my data. I have my first dataframe: reviewID|text|stars

I made a tokenization on "text" according to the pyspark documentation:

tokenizer = Tokenizer(inputCol="text", outputCol="words")

countTokens = udf(lambda words: len(words), IntegerType())

tokenized = tokenizer.transform(df2)
tokenized.select("text", "words") \
    .withColumn("howmanywords", countTokens(col("words"))).show(truncate=False)

I got my tokens but now I would like to have transformed dataframe that looks like that:

words|stars

"Words" are my tokens.

So I need to join my first dataframe and tokenized dataframe to get something like that. Could you please help me? How can I add a column to the another dataframe?


Solution

  • Ok I got it now. I just needed to make:

    tokenizer = Tokenizer(inputCol="text", outputCol="words")
    
    
    tokenized = tokenizer.transform(df2)
    tokenized.select("text", "words", "stars").show(truncate=False)
    

    It works!