Search code examples
pythonpandascsvdata-cleaningpseudocode

In Python, run each row in a csv through tests and output a new csv showing which test each row failed


In python, I would like to run a csv through test cases checking for data anomalies while keep track of each test it fails.

This is my first big project in Python, though I have Python experience and can do basic one-liners using pandas like df.drop_duplicates(subset=['UniqueID']), I just am not sure what the right direction would be.

  • Data Example:
MnLast MnFist MnDead? MnInactive? SpLast SpFirst SPInactive? SpDead? Addee Sal
Doe John No No Doe Jane No No Mr. John Doe Mr. John

Main(Mn) record isn't blank, Spouse(Sp) record isn't blank, neither record is marked deceased but Addee or Sal doesn't have '&' or 'and'. This indicates the Addressee(Addee) or Salutation(Sal) is incorrect, as Addressee or Salutation should have a variation of: "Mr. and Mrs. John doe"

  • Pseudo code:
Read csv

for each row in csv

  #test case 1
  if [ {( (MNFirst AND MnLast) =! BLANK ) AND ( (SpLast AND SpFirst) =! BLANK )} AND 
  (( SpDead? AND MnDead?) =! Yes)] AND [(Addee OR Sal) Does not contains ('&' or 'and')] 
  
     output failing row to new csv tracking what case it failed

  else 

      nothing
  • My goal

Read a csv file, run the file through several test cases(there are several). Then output a new csv, with a new column indicating each case it failed. So if my Data Example failed 3 different cases, the new column would show a number corresponding to the case it failed. The csv output would show the following:

CaseFailed MnLast MnFist MnDead? MnInactive? SpLast SpFirst SPInactive? SpDead? Addee Sal
1, 5, 8 Doe john No No Doe Jane No No Mr. John Doe Mr. John

Any help to point me in the right direction would be greatly appreciated.


Solution

  • import pandas as pd 
    import numpy as np
    
    data = pd.read_csv(csv_file, encoding='latin-1')
    
    # Create array to track failed cases.
    data['Failed Test']= ''
        data = data.replace(np.nan,'')
        data.insert(0, 'ID', range(0, len(data)))
    
    # Test 1: The spouse shows a deceased date, but martial status is not marked as widowed
      testcase1 = data[((data['SRDeceasedDate'] != '') & (data['MrtlStat'] != 'Widowed'))]
        ids = testcase1.index.tolist()
        for i in ids:
          data.at[i,'Failed Test']+=', 1'
    
    # Test 2: Spouse name information is filled in but marital status shows single. 
     df = data[((data['SRLastName'] != '') | (data['SRFirstName'] != ''))]
        testcase2 = df[df['MrtlStat'] == 'single']
        ids = testcase2.index.tolist()
        for i in ids:
          data.at[i,'Failed Test']+=', 2'
    
    # sort and separate  which rows have failed a test
    failed = data[(data['Failed test'] != '')]
    passed = data[(data['Failed test'] == '')]
    failed['Failed Test'] = failed['Failed Test'].str[1:]
    failed = failed[(failed['Failed Test'] != '')]
    
    # Clean up
    del failed["ID"]
    del passed["ID"]
    
    # Print results 
    failed['Test Case Failed'].value_counts()
    print("There was a total of",data.shape[0], "rows.", "There were" ,data.shape[0] - failed.shape[0], "rows passed and" ,failed.shape[0], "rows failed at least one test case")
    
    # output failed rows
    failed.to_csv("C:/Users/Output/failed.csv", index=False,) 
    
    # output passed rows
    passed.to_csv("C:/Users/Output/passed.csv", index=False,)