Search code examples
pythonpandasstringdataframeextract

How to extract numbers from strings on ALL the columns in python


Using str.extract() to extract numbers from strings works (see below), but it's time-consuming when I have to repeat this on 30+ columns.

Originally like this:

yyyy
2014    110.7438016528926\K
2015    103.7533512064343\O
2016    111.6531165311653\L
2017                  nan\L
2018                  nan\E
Name: dips_cert_earning_premium_weekly, dtype: object

Then I wrote these codes:

df['dips_cert_earning_premium_weekly'] = df['dips_cert_earning_premium_weekly'].str.extract(r'(\d+(?:\.\d+)?)').astype(float)

Output:

yyyy
2014    110.743802
2015    103.753351
2016    111.653117
2017           NaN
2018           NaN
Name: dips_cert_earning_premium_weekly, dtype: float64

Is there a more efficient way to do this?


Solution

  • I'm not sure how your whole dataset looks like or what you mean by slow: is it taking minutes? seconds? hours?. However, if it's possible you could try an approach not using regexp. For instance:

    letters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\\"
    def remove_letters(x):
        try:
            return float(x.strip(letters))
        except:
            return None
    
    df['dips_cert_earning_premium_weekly'] = list(map(remove_letters,  df['dips_cert_earning_premium_weekly']))
    

    If the dataset only has one column with 31 rows, your version takes 0.0008417534828186036 seconds while the version I propose takes 0.0003292644023895264 seconds so it's a slight difference but it's worth trying. (I calculated this times by taking the times for N=1000 runs and then I took the average of all times)

    Otherwise, try to profile your code to see where the bottleneck is and try different implementations until you find the fastest one.

    Edit:

    To fully answer the question with the new data: To iterate over the dataframe and apply this to each column, you can try the following code:

    df = pd.DataFrame.from_dict(data)
    for (columnName, columnData) in df.iteritems():
       df[columnName] = list(map(remove_letters,  df[columnName])) # This is the same line from before. Only difference is that we are applying it to all columns.