Search code examples
apache-sparkpysparkparquet

Spark does not recognize new lines, &amp, etc. from String


I'm trying to process text data (Twitter tweets) with PySpark. Emojis and special characters are being red correctly but "\n", "&amp" appear to be escaped. Spark does not recognize them. Probably others too. One example tweet in my Spark DF would look like this:

  • "Hello everyone\n\nHow is it going? 😉 Take care & enjoy"

I would like Spark to read them correctly. The files are stored as parquet and I'm reading them like this:

tweets = spark.read.format('parquet')\
.option('header', 'True')\
.option('encoding', 'utf-8')\
.load(path)

Below are some sample input data, which I took from the original JSONL files (I stored the data as parquet later).

  • "full_text": "RT @OurWarOnCancer: Where is our FEDERAL vaccination education campaign for HPV?! Where is our FEDERAL #lungcancer screening program?! (and\u2026"

  • "full_text": "\u2b55\ufe0f#HPV is the most important cause of
    #CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \n\u2b55\ufe0fThat means they can be PREVENTED"

Reading directly from JSONL files results in the same recognizing problems.

tweets = spark.read.\
.option('encoding', 'utf-8')\
.json(path)

How can Spark recognize them correctly? Thank you in advance.


Solution

  • the below code might be helpful to solve your problem,

    Input taken:

    "Hello everyone\n\nHow is it going? 😉 Take care & enjoy"
    
    "full_text": "RT @OurWarOnCancer: Where is our FEDERAL vaccination education campaign for HPV?! Where is our FEDERAL #lungcancer screening program?! (and\u2026 &"
    "full_text": "\u2b55\ufe0f#HPV is the most important cause of #CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \n\u2b55\ufe0fThat means they can be PREVENTED @theNCI @NCIprevention @AmericanCancer @cancereu @uicc @IARCWHO @EuropeanCancer @KanserSavascisi @AUTF_DEKANLIK @OncoAlert"
    
    

    code to solve the problem:

    from pyspark.sql.functions import *
    
    df=spark.read.csv("file:///home/sathya/Desktop/stackoverflo/raw-data/input.tweet")
    
    df1=df.withColumn("cleandata",regexp_replace('_c0', '&|\\\\n', ''))
    df1.select("cleandata").show(truncate=False)
    
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |cleandata                                                                                                                                                                                                                                                                                                                    |
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |Hello everyoneHow is it going? 😉 Take care & enjoy                                                                                                                                                                                                                                                                          |
    |"full_text": "RT @OurWarOnCancer: Where is our FEDERAL vaccination education campaign for HPV?! Where is our FEDERAL #lungcancer screening program?! (and\u2026 &"                                                                                                                                                           |
    |"full_text": "\u2b55\ufe0f#HPV is the most important cause of #CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \u2b55\ufe0fThat means they can be PREVENTED @theNCI @NCIprevention @AmericanCancer @cancereu @uicc @IARCWHO @EuropeanCancer @KanserSavascisi @AUTF_DEKANLIK @OncoAlert"|
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+