Search code examples
pythonpandassplitfractions

Splitting a mixed number string from a dataframe column and converting it to a float


I have a dataframe with a column of strings that are a mix of whole numbers and mixed fractions. I would like to convert column 'y' to floats.

x            y         z
0            4      Info
1        8 1/2      Info
2          3/4      Info
3           10      Info
4            4      Info
5        6 1/4      Info

The logic I am considering is to split column 'y' by ' ' and '/' to create three separate columns that would look like this.

x         base        b        c         z
0            4        0        0      Info
1            8        1        2      Info
2            0        3        4      Info
3           10        0        0      Info
4            4        0        0      Info
5            6        1        4      Info

From here I could

def convertReplace(df):
    convert = lambda x: float(x)
    df['base'].apply(convert)
    df['b'].apply(convert)
    df['c'].apply(convert)
    decimal = lambda x,y: x/y        
    try:
        df['d'] = decimal(df['b'],df['c'])
        df['y'] = df['base'] + df['d']
    except:
        df['y'] = df['base']
    return df

This might work but I can't get the column to split using the method found here.

df = pd.DataFrame(df.y.str.split(' ',1).str.split('/',1).tolist(),columns = ['base','b','c'])

The error says it expects 3 arguments each time when it may be 1, 2, or 3. Even this thread doesn't use multiple separators.

The actual dataframe has over 400k rows. Efficiency would be great but I'm more interested in just getting it done. Is this logic correct or is there a more concise way to do this? Any help is appreciated.


Solution

  • You could try the fractions module. Here's a one-liner:

    import fractions
    df['y_float'] = df['y'].apply(lambda frac: float(sum([fractions.Fraction(x) for x in frac.split()])))
    

    This gives:

           y     z  y_float
    0      4  Info     4.00
    1  8 1/2  Info     8.50
    2    3/4  Info     0.75
    3     10  Info    10.00
    4      4  Info     4.00
    5  6 1/4  Info     6.25
    

    [EDIT] Corrected version accounting for negative fractions, as well as invalid text:

    I realized the above approach would not work for negative fractions, so here is that taken in to account. As it turns out, a one-liner for this would be very tricky!

    def get_sign(num_str):
        """
        Verify the sign of the fraction
        """
        return 1-2*num_str.startswith('-')
    
    def is_valid_fraction(text_str):
        """
        Check if the string provided is a valid fraction.
        Here I just used a quick example to check for something of the form of the fraction you have. For something more robust based on what your data can potentially contain, a regex approach would be better.
        """
        return text_str.replace(' ', '').replace('-', '').replace('/', '').isdigit()
    
    def convert_to_float(text_str):
        """
        Convert an incoming string to a float if it is a fraction
        """
         if is_valid_fraction(text_str):
             sgn = get_sign(text_str)
             return sgn*float(sum([abs(fractions.Fraction(x)) for x in text_str.split()]))
         else:
             return pd.np.nan # Insert a NaN if it is invalid text
    

    So now you will have this:

    >>> df['y_float'] = df['y'].apply(lambda frac: convert_to_float(frac))
    >>> df
                  y     z  y_float
    0             4  Info     4.00
    1         8 1/2  Info     8.50
    2           3/4  Info     0.75
    3            10  Info    10.00
    4             0  Info     0.00
    5         6 1/4  Info     6.25
    6        -3 2/5  Info    -3.40
    7          -4/5  Info    -0.80
    8  gibberish100  Info      NaN