Search code examples
pythonmachine-learningpysparktokenapache-spark-ml

How can I print my tokens when using pyspark.ml.feature.Tokenizer?


I would like to look at the tokens that were created when I used pyspark.ml.feature.Tokenizer. How can I do that? If I have this piece of code:

tokenizer = Tokenizer(inputCol="SystemInfo", outputCol="words")

I tried to print the tokens using print(vars(tokenizer)), but of course that returns only the attributes. The full code can be found here: https://learn.microsoft.com/de-de/azure/hdinsight/spark/apache-spark-ipython-notebook-machine-learning


Solution

  • You need to transform and show, that is it. Here is a quick example to guide you. I hope it helps.

    from pyspark.ml.feature import Tokenizer
    
    df = spark.createDataFrame([
        (0, 'Hello and good day'),
        (1, 'This is a simple demostration'),
        (2, 'Natural and unnatural language processing')
        ], ['id', 'sentence'])
    
    df.show(truncate=False)
    # +---+-----------------------------------------+
    # |id |sentence                                 |
    # +---+-----------------------------------------+
    # |0  |Hello and good day                       |
    # |1  |This is a simple demostration            |
    # |2  |Natural and unnatural language processing|
    # +---+-----------------------------------------+
    
    tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
    tokenized = tokenizer.transform(df)
    
    tokenized.select('words').show(truncate=False)
    # +-----------------------------------------------+
    # |words                                          |
    # +-----------------------------------------------+
    # |[hello, and, good, day]                        |
    # |[this, is, a, simple, demostration]            |
    # |[natural, and, unnatural, language, processing]|
    # +-----------------------------------------------+