Search code examples
pythonpandasstringdataframetypes

How to fix column with numeric values that is taken as a string field because of empty strings in Pandas dataframe?


I have a df with some columns that are supposed to be numeric fields. However, there are empty strings existing in these columns, which leads to their data being incorrectly assigned as 'object'. How do I remove those empty strings and covert the column to the correct data type (int/float)?

Suppose column A looks like this:

2
1
0
''
NULL
Name: A, dtype: object

Expected column A should be

2
1
0
NULL
NULL 
Name: A, dtype: int

I tried the below but it seems the data type is not changed, potentially all numbers are still considered as strings.

print(df[col].dtypes)
df[col].replace(r'^\s*$', np.nan, regex=True, inplace=True)
print(df[col].dtypes)

Output:

object
object

I don't want to use str.isnumeric() to covert data type first because it will impact other real string columns. Is there a better approach?


Solution

  • I'm assuming the columns that have numeric data only have empty strings (""or " ") along with the numbers, since you're trying to .replace(r'^\s*$', np.nan...).

    If this is the case, you can call pandas.to_numeric and it will convert all numbers to numeric type, and the empty strings to nan.

    Empty strings will not raise an error, but other strings will.

    To handle other columns that contain other strings (not numerical), you can use a try...except:

    df = pd.DataFrame(
        {"number_col": [1, ""], "string_col": ["some string", "other string"]}
    )
    
      number_col    string_col
    0          1   some string
    1             other string
    
    for col in df.columns:
        try:
            df[col] = pd.to_numeric(df[col])
        except ValueError:
            continue
    
       number_col    string_col
    0         1.0   some string
    1         NaN  other string