Search code examples
python-2.7apache-spark-sqlazure-databricks

Python - Move the rows to new dataframe as Badrecord if DOB and Address1 and Address2 and PostCode have NULL


I am trying to move the rows with NULL values in all 4 columns DOB, Address1, address2 and Postcode to a new data frame and keep original datafarme with clean records

i have tried solving it by using the following code

import numpy as np
import pandas as pd
BadRecords = Data.dropna(subset=['DOB','Address1','Address2','PostCode'], how='any') 
print(BadRecords)                           

The current code is printing the entire dataset. It should only filter the records where DOB, Address1, Address2 and postcode all 4 are NULLs


Solution

  • To get records with null values you can filter the original set like this:

    from pyspark.sql.functions import col, isnull
    badRecords = Data.filter(isnull(col('DOB')) & isnull(col('Address1')) & isnull(col('Address2')) & isnull(col('PostCode')))
    display(badRecords)
    

    dropna function returns a new dataframe omitting rows with null values, so with that you can only get "good" records

    goodRecords = Data.dropna(subset=['DOB','Address1','Address2','PostCode'], how='all') 
    

    Also notice that how='any' will drop rows with at least one of values null, so if you want to filter rows only when all of them are null, you need to use 'all' setting.