pythoncsvvalidationenumeration

Validate csv by checking if enumeration columns contains any invalid coded values


We recieve many different csv files from external labs and centers. When recieving such a file, we first need to do some QA checks before further processing. So make sure the data is correct, at least on a technical level.

We have some Python scripts to check the number of columns, check date values, min/max range etc. But now we also want to check wether the enumerated columns are correct. So for example, if a column visit is a coded value and may only contain baseline, fup_6_m, fup_12_m then it shouldn't contain anything else like fup_36_m.

We have the metadata specifications, so the column names and the lists of coded values (aka enumeration) are known beforehand.

This is the Python script I've got so far:

# check if coded values are correct
import pandas as pd
import io

## load data from csv files
##df = pd.read_csv (r'patlist_mcl2017.csv', sep = ",", decimal=".")

# TESTING: create data frame from text
str_patients = """patid,dob,sex,height,score,visit
1072,16-01-1981,M,154,1,fup_12_m
1091,20-12-1991,M,168,4,baseline
1126,25-12-1999,M,181,3,fup_6_m
1139,14-04-1980,Y,165,1,baseline
1171,05-11-1984,M,192,2,fup_12_m
1237,17-08-1983,F,170,3,fup_6_m
1334,26-08-1985,F,160,5,fup_6_m
1365,14-09-1976,M,184,3,fup_24_m
1384,28-12-1993,F,152,1,baseline
1456,27-09-1998,F,164,5,fup_12_m
"""
df = pd.read_csv(io.StringIO(str_patients), sep = ",", decimal=".")

print(df)

# allowed values for enumeration columnms
allowed_enum = {
    'sex': ['M', 'F'],
    'score': [0, 1, 2, 3, 4],
    'visit': ['baseline', 'fup_6_m', 'fup_12_m']
}

# check enumeration
for column_name, allowed_values in allowed_enum.items():
    df_chk = df[~df[column_name].isin(allowed_values)].groupby(column_name).size().reset_index(name='Count')
    if not df_chk.empty:
        print("Found invalid values for column '%s':" % column_name)
        print(df_chk)

It works and the output is like this:

Found invalid values for column 'sex':
  sex  Count
0   Y      1
Found invalid values for column 'score':
   score  Count
0      5      2
Found invalid values for column 'visit':
      visit  Count
0  fup_24_m      1

But the different files can contain many columns, and for better reporting we'd like to get the output as one dataframe, so something like this:

  Column_name  Invalid  Count
0 Sex            Y          1
1 Score          5          2
2 visit       fup_24_m      1

So my question is:

  • What is the best way to collect the invalid values in a dataframe, like above?
  • Or, is there maybe a better way for checking/validating these kind of coded values?

Solution

  • You could try

    ...
    dfs = {
        column_name: df[~df[column_name].isin(allowed_values)]
                     .value_counts(subset=column_name)
                     .to_frame().reset_index(names="Invalid")
        for column_name, allowed_values in allowed_enum.items()
    }
    out = pd.concat(dfs, names=("Column_name", None)).droplevel(1)
    

    to get

                  Invalid  count
    Column_name                 
    sex                 Y      1
    score               5      2
    visit        fup_24_m      1
    

    for the sample dataframe (another .reset_index would give you the format in the question).

    Or, similiar to Zach Young's proposal, you could do

    ...
    columns = (
        df.loc[~df[column_name].isin(allowed_values), column_name]
        for column_name, allowed_values in allowed_enum.items()
    )
    out = pd.concat(columns, axis=1, sort=True)
    

    to get a sub-dataframe which contains only the invalid values

       sex  score     visit
    3    Y    NaN       NaN
    6  NaN    5.0       NaN
    7  NaN    NaN  fup_24_m
    9  NaN    5.0       NaN