I am starting with Scala, Spark and MLlib. I want to implement an example from a Kaggle
The data format is awful and I have lots of problem cleaning and preparing the data to process them. I ask for your help.
The data is:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
I got errors in the empty fields such as
...,"",... (first line, field "cabin")
...,"A/5 21171",.... (also first line, field "ticket")
I would like to filter the lines with empty fields (remove them from my RDD) and also those lines which have the ticket like this A/5 21171 (I just want the number).
Thanks for your help again! ;)
Rather than RDD
s, you should consider using DataSet
s both for performance and ease of use--particularly if you are new to Scala. Taking the DataSet
approach, you can do this:
val titanicDs = sparkSession.read
.option("header", true)
.csv("titanic.csv")
.na
.drop
.withColumn("TicketSplit", split($"Ticket", " "))
.withColumn("Ticket", when(size($"TicketSplit") === "2", $"TicketSplit".getItem(1)).otherwise($"TicketSplit".getItem(0)))
.drop("TicketSplit")
There is a lot going on here:
header
option to true so Spark realizes the first row is a header imposing structure on the data and uses those column names in the DataFrame
.na
method returns a DataFrameNaFunctions
object that is very helpful for working with missing data. In this case, the combination of na.drop
eliminates all rows containing any data that are null
.TicketSplit
where I use the wonderful functions
library to split the raw Ticket
data on the space character into an array of either length 1 (if there is only a number) or 2 (if there is text followed by a space and number). when
and otherwise
from the functions
library to modify the raw Ticket
column according to the size of the array in the TicketSplit
column. Whatever the size of the array in the TicketSplit
column, ultimately only the number is preserved by either getting the first element of a 1-element array at index 0 or the second element of a 2-element array at index 1.TicketSplit
column because it has served its purpose.