Search code examples
pythonpandasdataframecontainstolist

In Python, I am comparing dataframes containing strings to decide if it should pass or fail. How can I stop data from passing when it should fail?


I have over 20 test cases that check a CSV for data anomalies due to data entry. This test case (#15) compares the salutation and addressee to marital status.

# Test case  15
# Compares MrtlStat to  PrimAddText and PrimSalText
df = data[data['MrtlStat'].str.contains("Widow|Divorced|Single")]
df = df[df['PrimAddText'].str.contains("AND|&", na=False)]
data_15 = df[df['PrimSalText'].str.contains("AND|&", na=False)]

# Adds row to list of failed data
ids = data_15.index.tolist()

# Keep track of data that failed test case 15 
for i in ids:
  data.at[i,'Test Case Failed']+=', 15'

If MrtlStat contains Widow, Divorced, or Single while PrimAddText or PrimSalTexts contains AND or &, it should fail the test. This test works only if BOTH PrimSalTexts and PrimAddText contain AND or &.

Table showing data that passes but should fail:

PrimAddText PrimSalText MrtlStat
Mrs. Judith Elfrank Mr. & Mrs. Elfrank & Michael Widowed
Mr. & Mrs.Karl Magnusen Mr. Magnusen Widowed

Table showing data that fails as expected:

PrimAddText PrimSalText MrtlStat
Mr. & Mrs. Elfrank Mr. & Mrs. Elfrank & Michael Widowed

How can I adjust the test to work if only one of the columns (PrimSalTexts or PrimAddText) contains AND or &?


Solution

  • You have an AND condition b/w the second and third condition, you can separate these out and capturing the result from each condition. finally combine the two lists together

    # Test case  15
    # Compares MrtlStat to  PrimAddText and PrimSalText
    df = data[data['MrtlStat'].str.contains("Widow|Divorced|Single")]
    data_15_A = df[df['PrimAddText'].str.contains("AND|&", na=False)]
    data_15_B = df[df['PrimSalText'].str.contains("AND|&", na=False)]
    
    # Adds row to list of failed data
    ids = data_15_A.index.tolist() + data_15_B.index.tolist()
    
    # Keep track of data that failed test case 15 
    for i in ids:
      data.at[i,'Test Case Failed']+=', 15'