apache-spark pyspark apache-spark-sql rdd key-value

Transform data into rdd and analyze

I am new in spark and have below data in csv format, which I want to convert in proper format.

Csv file with no header

Student_name=abc, student_grades=A, Student_gender=female
Student_name=Xyz, student_grades=B, Student_gender=male

Now I want to put it in rdd with creation of header

Student_Name   student_grades   student_gender 
abc            A                female
Xyz            B                male

Also I want to get list of students with grades as A, B and C

Solution

What you could do is infer the schema from the first line of the file, and then transform the dataframe accordingly that is:

Remove the column name from the row values.
Rename the columns

Here is how you could do it. First, let's read your data from a file and display it.

// the options are here to get rid of potential spaces around the ",".
val df = spark.read
    .option("ignoreTrailingWhiteSpace", true)
    .option("ignoreLeadingWhiteSpace", true)
    .csv("path/your_file.csv")

df.show(false)
+----------------+----------------+---------------------+
|_c0             |_c1             |_c2                  |
+----------------+----------------+---------------------+
|Student_name=abc|student_grades=A|Student_gender=female|
|Student_name=Xyz|student_grades=B|Student_gender=male  |
+----------------+----------------+---------------------+

Then, we extract a mapping between the default names and the new names using the first row of the dataframe.

val row0 = df.head
val cols = df
    .columns
    .map(c => c -> row0.getAs[String](c).split("=").head )

Finally we get rid of the name of the columns with a split on "=" and rename the columns using our mapping:

val new_df = df
    .select(cols.map{ case (old_name, new_name) =>
        split(col(old_name), "=")(1) as new_name 
    } : _*)

new_df.show(false)
+------------+--------------+--------------+
|Student_name|student_grades|Student_gender|
+------------+--------------+--------------+
|abc         |A             |female        |
|Xyz         |B             |male          |
+------------+--------------+--------------+