Search code examples
pythonpandascsv

How to distinguish between "" (Empty String) and "missing entry" values in CSV file with python?


This might be an old problem. Let's say, there is a CSV file, for example:

col1,col2,col3,col4
"Empty String","","None_Value",
"None_Value",,"Empty String",""

As you can see, there maybe ("") Empty String Data or (,,) missing entries. While reading this CSV file, I am trying to keep them as it is. So, after reading the csv file ("") --> ("") and (,,) --> as (,,) or None.

I am using python to read the CSV files and before Python 3.12 version, csv.reader() treated both empty string and missing entry value as "". In python 3.12, new quoting parameters were added for csv.reader() and csv.writer(). One of them is: csv.QUOTE_NOTNULL

Instructs writer objects to quote all fields which are not None. This is similar to QUOTE_ALL, except that if a field value is None an empty (unquoted) string is written.

Instructs reader objects to interpret an empty (unquoted) field as None and to otherwise behave as QUOTE_ALL.

However, I noticed that it only works with csv.writer(), not csv.reader(). Apparently, this is a bug in python 3.12. GitHub Issue Link

Does anyone know how to handle this correctly, perhaps with Pandas or another library? I tried with Pandas a little bit, but I think even Pandas can't separate them (I might be wrong).


Solution

  • or another library

    This works out of the box with polars:

    import polars as pl
    
    df = pl.read_csv("tmp.csv")
    
    print(df)
    

    Output:

    ┌──────────────┬──────┬──────────────┬──────┐
    │ col1         ┆ col2 ┆ col3         ┆ col4 │
    │ ---          ┆ ---  ┆ ---          ┆ ---  │
    │ str          ┆ str  ┆ str          ┆ str  │
    ╞══════════════╪══════╪══════════════╪══════╡
    │ Empty String ┆      ┆ None_Value   ┆ null │
    │ None_Value   ┆ null ┆ Empty String ┆      │
    └──────────────┴──────┴──────────────┴──────┘
    

    You can convert it back to Python data structures if you want:

    print(df.to_dicts())
    

    Output:

    [{'col1': 'Empty String', 'col2': '', 'col3': 'None_Value', 'col4': None}, {'col1': 'None_Value', 'col2': None, 'col3': 'Empty String', 'col4': ''}]