Search code examples
apache-sparkhdfspysparkdata-cleaningdata-integration

How to conduct Data Cleaning with Spark-Python based on HDFS


Currently, I focus on the data preprocessing in the Data Mining Project. To be specific, I want to do the data cleaning with PySpark based on HDFS. I'm very new to those things, so I want to ask how to do that?

For example, there is a table in the HDFS containing the following entries:

attrA   attrB   attrC      label
1       a       abc        0
2               abc        0
4       b       abc        1
4       b       abc        1
5       a       abc        0

After cleaning all the entries, row 2 <2, , abc, 0> should have a default or imputed value for attrB, and row 3 or 3 should be eliminated. So how can I implement that with PySpark?


Solution

  • Well on the basis of whatever you asked, there are two things that you want to achieve, first remove duplicate rows which can be achieved by the distinct function

    df2 = df.distinct().show()
    

    will give you the distinct rows of the dataframe.

    Second is imputing missing values, which can be achieved by the fillna function

    df2 = df.na.fill({'attrB': 'm'}).show()