I have a df with some columns that are supposed to be numeric fields. However, there are empty strings existing in these columns, which leads to their data being incorrectly assigned as 'object'. How do I remove those empty strings and covert the column to the correct data type (int/float)?
Suppose column A looks like this:
2
1
0
''
NULL
Name: A, dtype: object
Expected column A should be
2
1
0
NULL
NULL
Name: A, dtype: int
I tried the below but it seems the data type is not changed, potentially all numbers are still considered as strings.
print(df[col].dtypes)
df[col].replace(r'^\s*$', np.nan, regex=True, inplace=True)
print(df[col].dtypes)
Output:
object
object
I don't want to use str.isnumeric()
to covert data type first because it will impact other real string columns. Is there a better approach?
I'm assuming the columns that have numeric data only have empty strings (""
or " "
) along with the numbers, since you're trying to .replace(r'^\s*$', np.nan...)
.
If this is the case, you can call pandas.to_numeric
and it will convert all numbers to numeric type, and the empty strings to nan
.
Empty strings will not raise an error, but other strings will.
To handle other columns that contain other strings (not numerical), you can use a try...except
:
df = pd.DataFrame(
{"number_col": [1, ""], "string_col": ["some string", "other string"]}
)
number_col string_col
0 1 some string
1 other string
for col in df.columns:
try:
df[col] = pd.to_numeric(df[col])
except ValueError:
continue
number_col string_col
0 1.0 some string
1 NaN other string