Currently, I focus on the data preprocessing in the Data Mining Project. To be specific, I want to do the data cleaning with PySpark based on HDFS. I'm very new to those things, so I want to ask how to do that?
For example, there is a table in the HDFS containing the following entries:
attrA attrB attrC label
1 a abc 0
2 abc 0
4 b abc 1
4 b abc 1
5 a abc 0
After cleaning all the entries, row 2 <2, , abc, 0>
should have a default or imputed value for attrB, and row 3 or 3 should be eliminated. So how can I implement that with PySpark?
Well on the basis of whatever you asked, there are two things that you want to achieve, first remove duplicate rows which can be achieved by the distinct function
df2 = df.distinct().show()
will give you the distinct rows of the dataframe.
Second is imputing missing values, which can be achieved by the fillna function
df2 = df.na.fill({'attrB': 'm'}).show()