Validate csv by checking if enumeration columns contains any invalid coded values

We recieve many different csv files from external labs and centers. When recieving such a file, we first need to do some QA checks before further processing. So make sure the data is correct, at least on a technical level.

We have some Python scripts to check the number of columns, check date values, min/max range etc. But now we also want to check wether the enumerated columns are correct. So for example, if a column visit is a coded value and may only contain baseline, fup_6_m, fup_12_m then it shouldn't contain anything else like fup_36_m.

We have the metadata specifications, so the column names and the lists of coded values (aka enumeration) are known beforehand.

This is the Python script I've got so far:

# check if coded values are correct
import pandas as pd
import io

## load data from csv files
##df = pd.read_csv (r'patlist_mcl2017.csv', sep = ",", decimal=".")

# TESTING: create data frame from text
str_patients = """patid,dob,sex,height,score,visit
df = pd.read_csv(io.StringIO(str_patients), sep = ",", decimal=".")


# allowed values for enumeration columnms
allowed_enum = {
    'sex': ['M', 'F'],
    'score': [0, 1, 2, 3, 4],
    'visit': ['baseline', 'fup_6_m', 'fup_12_m']

# check enumeration
for column_name, allowed_values in allowed_enum.items():
    df_chk = df[~df[column_name].isin(allowed_values)].groupby(column_name).size().reset_index(name='Count')
    if not df_chk.empty:
        print("Found invalid values for column '%s':" % column_name)

It works and the output is like this:

Found invalid values for column 'sex':
  sex  Count
0   Y      1
Found invalid values for column 'score':
   score  Count
0      5      2
Found invalid values for column 'visit':
      visit  Count
0  fup_24_m      1

But the different files can contain many columns, and for better reporting we'd like to get the output as one dataframe, so something like this:

  Column_name  Invalid  Count
0 Sex            Y          1
1 Score          5          2
2 visit       fup_24_m      1

So my question is:

  • What is the best way to collect the invalid values in a dataframe, like above?
  • Or, is there maybe a better way for checking/validating these kind of coded values?


  • You could try

    dfs = {
        column_name: df[~df[column_name].isin(allowed_values)]
        for column_name, allowed_values in allowed_enum.items()
    out = pd.concat(dfs, names=("Column_name", None)).droplevel(1)

    to get

                  Invalid  count
    sex                 Y      1
    score               5      2
    visit        fup_24_m      1

    for the sample dataframe (another .reset_index would give you the format in the question).

    Or, similiar to Zach Young's proposal, you could do

    columns = (
        df.loc[~df[column_name].isin(allowed_values), column_name]
        for column_name, allowed_values in allowed_enum.items()
    out = pd.concat(columns, axis=1, sort=True)

    to get a sub-dataframe which contains only the invalid values

       sex  score     visit
    3    Y    NaN       NaN
    6  NaN    5.0       NaN
    7  NaN    NaN  fup_24_m
    9  NaN    5.0       NaN