How can a CSV file be loaded in Pandas with appropriate data types, taking into account mixed data types in a column?

Specifically, if a value is enclosed in quotes (e.g., '123'), it should be treated as a string; if without quotes (e.g., 123), it should be treated as a numerical format.

import pandas as pd

# Creating a DataFrame with mixed data types
data = {'columns': [1, 2, 3, '4', '5', '6']}
df = pd.DataFrame.from_dict(data)

# Checking unique data types before saving to CSV
selected_column = 'columns'
unique_types_before = {selected_column: set(df[selected_column].apply(type))}
print(unique_types_before)  # Output: {'columns': {<class 'int'>, <class 'str'>}}

# Saving DataFrame to CSV
df.to_csv('example.csv')

# Loading DataFrame from the saved CSV file
dataframe = pd.read_csv('example.csv')

# Checking unique data types after loading from CSV
unique_types_after = {selected_column: set(dataframe[selected_column].apply(type))}
print(unique_types_after)  # Output: {'columns': {<class 'int'>}}

I encountered an issue where the data types in my DataFrame were not preserved correctly upon saving to a CSV file. After investigating, I manually created a CSV file named "manual_example.csv" with the following content:

columns
1
2
3
4
5
6
"7"

To address this, I'm looking for guidance on how to load "manual_example.csv" into a DataFrame while ensuring that the data types in the rows are correctly interpreted. The goal is to treat values enclosed in quotes (e.g., '123') as strings and those without quotes (e.g., 123) as numerical formats.

Here is the code I've used so far:

import pandas as pd
from io import StringIO

# Replace 'example.csv' with the actual path to your CSV file
csv_file_path = 'manual_example.csv'

# Read the content of the CSV file
with open(csv_file_path, 'r', encoding='utf-8') as file:
    csv_content = file.read()

# Use StringIO to create a file-like object
csv_file = StringIO(csv_content)

# Read CSV with appropriate data types
df1 = pd.read_csv(csv_file, quotechar='"', dtype={'columns': object})
unique_types_df1 = {selected_column: set(df1[selected_column].apply(type))}
print(unique_types_df1)  # Output: {'columns': {<class 'str'>}}

I appreciate any insights or suggestions on how to handle the data types correctly during the CSV file loading process.

Solution

pandas will strip " while reading the csv-file, as this is necessary to correctly unescape text (think ""5" was all he said!"). You can use the converters- (instead of dtypes) parameter to manually convert from csv-text to objects of various kinds, yet you'll need to disable quote-processing, which can lead to problems if the format of the csv-file contains text which must be quoted.

import csv

# ...

def converter(obj: str):
    if obj.startswith('"') and obj.endswith('"'):
        # `"7"` is processed like `str(7)`
        return obj[1:-1]
    # everything else better be a int
    return int(obj)


df1 = pd.read_csv(csv_file, quoting=csv.QUOTE_NONE, converters={'columns': converter})