python-2.7 apache-spark-sql azure-databricks

Python - Move the rows to new dataframe as Badrecord if DOB and Address1 and Address2 and PostCode have NULL

I am trying to move the rows with NULL values in all 4 columns DOB, Address1, address2 and Postcode to a new data frame and keep original datafarme with clean records

i have tried solving it by using the following code

import numpy as np
import pandas as pd
BadRecords = Data.dropna(subset=['DOB','Address1','Address2','PostCode'], how='any') 
print(BadRecords)

The current code is printing the entire dataset. It should only filter the records where DOB, Address1, Address2 and postcode all 4 are NULLs

Solution

To get records with null values you can filter the original set like this:

from pyspark.sql.functions import col, isnull
badRecords = Data.filter(isnull(col('DOB')) & isnull(col('Address1')) & isnull(col('Address2')) & isnull(col('PostCode')))
display(badRecords)

dropna function returns a new dataframe omitting rows with null values, so with that you can only get "good" records

goodRecords = Data.dropna(subset=['DOB','Address1','Address2','PostCode'], how='all')

Also notice that how='any' will drop rows with at least one of values null, so if you want to filter rows only when all of them are null, you need to use 'all' setting.