Search code examples
apache-sparkpysparkapache-spark-sqlrddkey-value

Transform data into rdd and analyze


I am new in spark and have below data in csv format, which I want to convert in proper format.

Csv file with no header

Student_name=abc, student_grades=A, Student_gender=female
Student_name=Xyz, student_grades=B, Student_gender=male

Now I want to put it in rdd with creation of header

Student_Name   student_grades   student_gender 
abc            A                female
Xyz            B                male

Also I want to get list of students with grades as A, B and C


Solution

  • What you could do is infer the schema from the first line of the file, and then transform the dataframe accordingly that is:

    1. Remove the column name from the row values.
    2. Rename the columns

    Here is how you could do it. First, let's read your data from a file and display it.

    // the options are here to get rid of potential spaces around the ",".
    val df = spark.read
        .option("ignoreTrailingWhiteSpace", true)
        .option("ignoreLeadingWhiteSpace", true)
        .csv("path/your_file.csv")
    
    df.show(false)
    +----------------+----------------+---------------------+
    |_c0             |_c1             |_c2                  |
    +----------------+----------------+---------------------+
    |Student_name=abc|student_grades=A|Student_gender=female|
    |Student_name=Xyz|student_grades=B|Student_gender=male  |
    +----------------+----------------+---------------------+
    

    Then, we extract a mapping between the default names and the new names using the first row of the dataframe.

    val row0 = df.head
    val cols = df
        .columns
        .map(c => c -> row0.getAs[String](c).split("=").head )
    

    Finally we get rid of the name of the columns with a split on "=" and rename the columns using our mapping:

    val new_df = df
        .select(cols.map{ case (old_name, new_name) =>
            split(col(old_name), "=")(1) as new_name 
        } : _*)
    
    new_df.show(false)
    +------------+--------------+--------------+
    |Student_name|student_grades|Student_gender|
    +------------+--------------+--------------+
    |abc         |A             |female        |
    |Xyz         |B             |male          |
    +------------+--------------+--------------+