Search code examples
pythonregexpandasdataframefindall

Pandas - Choose several floats from the same string in Pandas to operate with them


I have a dataframe extracted with Pandas for which one of the colums looks something like this:

Image: A few elements of the column of interest of the Dataframe

What I want to do is to extract the numerical values (floats) in this column, which by itself I could do. The issue comes because I have some cells, like the cell 20 in the image, in which I have more than one number, so I would like to make an average of these values. I think that for that I would first need to recognize the different groups of numerical values in the string (each float number) and then extract them as floats to then operate with them. I don't know how to do this.

Edit: I have found an solution to this using the re.findall command from regex. This is based on the answer of a question in this thread Find all floats or ints in a given string.

for index,value in z.iteritems():
z[index]=statistics.mean([float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',value)])

Note that I haven't included match for integers, and only account for values up to 99, just due to the type of data that I have.

However, I get a warning with this approach, due to the loop (there is no warning when I do it only for one element of the series):

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

Although I don't see any issue happening with my data, is this warning important?


Solution

  • I have found a solution based on what I wrote previously in the Edit of the original post:

    It consists on using the re.findall() command with regex, as posted in this thread Find all floats or ints in a given string:

    statistics.mean([float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',string)])
    

    Then, to loop over the dataframe column, just use the lambda x: method with the pandas apply command (df.apply). For this, I have defined a function (redshift_to_num) executing the operation above, and then apply this function to each element in the dataframe column:

    import re
    import pandas as pd
    import statistics
    
    def redshift_to_num(string):
        measures=[float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',string)]
        mean=statistics.mean(measures)
        return mean
    
    df.Redshift=df.Redshift.apply(lambda x: redshift_to_num(x))
    

    Notes:

    • The data of interest in my case is stored in the dataframe column df.Redshift.
    • In the re.findall command I haven't included match for integers, and only account for values up to 99, just due to the type of data that I have.