Search code examples
pythonpandasstringformatnumber-formatting

Pandas: convert a series which contains strings like "10%" and "0.10" into numeric


What is the best way to convert a Pandas series that contains strings of the type "10%" and "0.10" into numeric values?

I know that if I have a series with just "0.10" type strings I can just do pd.to_numeric.

I also know that if I have a series of "10%" type strings I can do str.replace("%","") and then do pd.to_numeric and divide by 100.

The issue I have is for a series with a mix of "0.10" and "10%" type strings. How do I best convert this into a series with the correct numeric types.

I think I could do it by first making a temporary series with True / False depending on if the string has "%" in it or not and then based on that applying a function. But this seems inefficient.

Is there a better way?

What I Have Tried for Reference:

mixed = pd.Series(["10%","0.10","5.5%","0.02563"])
mixed.str.replace("%","").astype("float")/100

0    0.100000
1    0.001000
2    0.055000
3    0.000256
dtype: float64
# This doesn't work, because even the 0.10 and 0.02563 are divided by 100.

Solution

  • The easiest solution is to select entries using a mask and handle them in bulk:

    from pandas import Series, to_numeric
    
    mixed = Series(["10%", "0.10", "5.5%", "0.02563"])
    
    # make an empty series with similar shape and dtype float
    converted = Series(index=mixed.index, dtype='float')
    
    # use a mask to select specific entries
    mask = mixed.str.contains("%")
    
    converted.loc[mask] = to_numeric(mixed.loc[mask].str.replace("%", "")) / 100
    converted.loc[~mask] = to_numeric(mixed.loc[~mask])
    
    print(converted)
    # 0    0.10000
    # 1    0.10000
    # 2    0.05500
    # 3    0.02563
    # dtype: float64