Search code examples
stringpandasnumericstrip

Series with mixed elements such as NaNs and alpharithmetic values, from which the numbers need to be kept and converted to float


I have a dataframe column which appears as follows and has the following characteristics:

>>> df.dtypes
location     object
sensor_1     object
sensor_2    float64

>>> df['sensor_1'].head(4)
0    3 m3/h
1       NaN
2       NaN
3       NaN
Name: sensor_1, dtype: object

>>> type(df['sensor_1'][0])
str

>>> type(df['sensor_1'][1])
float

My goal is to keep the numeric part and recognise it as float from "sensor_1", taking into consideration the fact that there Nulls which are recognised already as numeric, as I understand.

I tried a few things which did not work:

pd.to_numeric(df['sensor_1'], errors='coerce')  #it did not change anything
df['sensor_1'].apply(lambda x: x.str[:-5].astype(float) if pd.notnull(x) else x)  
 #tried to strip the last 5 characters if not null and then convert the remaining part to float

AttributeError: 'str' object has no attribute 'str'
df['sensor_1'].to_string()  #unsure how to go on from there

So... running out of ideas really and asking for your help. Thank you ^_^


Solution

  • Use Series.str.extract, but first convert values to strings and last to floats:

    df['sensor_1'] = (df['sensor_1'].astype(str)
                                    .str.extract('((\d+\.*\d*))', expand=False)
                                    .astype(float))