Search code examples
scalaapache-sparkapache-spark-mllibkaggle

Cleaning and preparind data for Spark


I am starting with Scala, Spark and MLlib. I want to implement an example from a Kaggle

The data format is awful and I have lots of problem cleaning and preparing the data to process them. I ask for your help.

The data is:

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S

I got errors in the empty fields such as

...,"",... (first line, field "cabin")

...,"A/5 21171",.... (also first line, field "ticket")

I would like to filter the lines with empty fields (remove them from my RDD) and also those lines which have the ticket like this A/5 21171 (I just want the number).

Thanks for your help again! ;)


Solution

  • Rather than RDDs, you should consider using DataSets both for performance and ease of use--particularly if you are new to Scala. Taking the DataSet approach, you can do this:

    val titanicDs = sparkSession.read
        .option("header", true)
        .csv("titanic.csv")
        .na
        .drop
        .withColumn("TicketSplit", split($"Ticket", " "))
        .withColumn("Ticket", when(size($"TicketSplit") === "2", $"TicketSplit".getItem(1)).otherwise($"TicketSplit".getItem(0)))
        .drop("TicketSplit")
    

    There is a lot going on here:

    • Set the header option to true so Spark realizes the first row is a header imposing structure on the data and uses those column names in the DataFrame.
    • The na method returns a DataFrameNaFunctions object that is very helpful for working with missing data. In this case, the combination of na.drop eliminates all rows containing any data that are null.
    • I add a new temporary column called TicketSplit where I use the wonderful functions library to split the raw Ticket data on the space character into an array of either length 1 (if there is only a number) or 2 (if there is text followed by a space and number).
    • I use when and otherwise from the functions library to modify the raw Ticket column according to the size of the array in the TicketSplit column. Whatever the size of the array in the TicketSplit column, ultimately only the number is preserved by either getting the first element of a 1-element array at index 0 or the second element of a 2-element array at index 1.
    • Drop that TicketSplit column because it has served its purpose.
    • Enjoy a nice cold drink.