I am trying to move the rows with NULL values in all 4 columns DOB, Address1, address2 and Postcode to a new data frame and keep original datafarme with clean records
i have tried solving it by using the following code
import numpy as np
import pandas as pd
BadRecords = Data.dropna(subset=['DOB','Address1','Address2','PostCode'], how='any')
print(BadRecords)
The current code is printing the entire dataset. It should only filter the records where DOB, Address1, Address2 and postcode all 4 are NULLs
To get records with null values you can filter the original set like this:
from pyspark.sql.functions import col, isnull
badRecords = Data.filter(isnull(col('DOB')) & isnull(col('Address1')) & isnull(col('Address2')) & isnull(col('PostCode')))
display(badRecords)
dropna
function returns a new dataframe omitting rows with null values, so with that you can only get "good" records
goodRecords = Data.dropna(subset=['DOB','Address1','Address2','PostCode'], how='all')
Also notice that how='any'
will drop rows with at least one of values null, so if you want to filter rows only when all of them are null, you need to use 'all' setting.