This might be an old problem. Let's say, there is a CSV file, for example:
col1,col2,col3,col4
"Empty String","","None_Value",
"None_Value",,"Empty String",""
As you can see, there maybe ("") Empty String Data or (,,) missing entries. While reading this CSV file, I am trying to keep them as it is. So, after reading the csv file ("") --> ("") and (,,) --> as (,,) or None.
I am using python to read the CSV files and before Python 3.12 version, csv.reader() treated both empty string and missing entry value as "". In python 3.12, new quoting parameters were added for csv.reader() and csv.writer(). One of them is: csv.QUOTE_NOTNULL
Instructs writer objects to quote all fields which are not None. This is similar to QUOTE_ALL, except that if a field value is None an empty (unquoted) string is written.
Instructs reader objects to interpret an empty (unquoted) field as None and to otherwise behave as QUOTE_ALL.
However, I noticed that it only works with csv.writer(), not csv.reader(). Apparently, this is a bug in python 3.12. GitHub Issue Link
Does anyone know how to handle this correctly, perhaps with Pandas or another library? I tried with Pandas a little bit, but I think even Pandas can't separate them (I might be wrong).
or another library
This works out of the box with polars
:
import polars as pl
df = pl.read_csv("tmp.csv")
print(df)
Output:
┌──────────────┬──────┬──────────────┬──────┐
│ col1 ┆ col2 ┆ col3 ┆ col4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞══════════════╪══════╪══════════════╪══════╡
│ Empty String ┆ ┆ None_Value ┆ null │
│ None_Value ┆ null ┆ Empty String ┆ │
└──────────────┴──────┴──────────────┴──────┘
You can convert it back to Python data structures if you want:
print(df.to_dicts())
Output:
[{'col1': 'Empty String', 'col2': '', 'col3': 'None_Value', 'col4': None}, {'col1': 'None_Value', 'col2': None, 'col3': 'Empty String', 'col4': ''}]