I am new in spark and have below data in csv format, which I want to convert in proper format.
Csv file with no header
Student_name=abc, student_grades=A, Student_gender=female
Student_name=Xyz, student_grades=B, Student_gender=male
Now I want to put it in rdd with creation of header
Student_Name student_grades student_gender
abc A female
Xyz B male
Also I want to get list of students with grades as A, B and C
What you could do is infer the schema from the first line of the file, and then transform the dataframe accordingly that is:
Here is how you could do it. First, let's read your data from a file and display it.
// the options are here to get rid of potential spaces around the ",".
val df = spark.read
.option("ignoreTrailingWhiteSpace", true)
.option("ignoreLeadingWhiteSpace", true)
.csv("path/your_file.csv")
df.show(false)
+----------------+----------------+---------------------+
|_c0 |_c1 |_c2 |
+----------------+----------------+---------------------+
|Student_name=abc|student_grades=A|Student_gender=female|
|Student_name=Xyz|student_grades=B|Student_gender=male |
+----------------+----------------+---------------------+
Then, we extract a mapping between the default names and the new names using the first row of the dataframe.
val row0 = df.head
val cols = df
.columns
.map(c => c -> row0.getAs[String](c).split("=").head )
Finally we get rid of the name of the columns with a split
on "=" and rename the columns using our mapping:
val new_df = df
.select(cols.map{ case (old_name, new_name) =>
split(col(old_name), "=")(1) as new_name
} : _*)
new_df.show(false)
+------------+--------------+--------------+
|Student_name|student_grades|Student_gender|
+------------+--------------+--------------+
|abc |A |female |
|Xyz |B |male |
+------------+--------------+--------------+