Search code examples
pythonpandasdata-cleaningdata-wrangling

Not getting decimals when extracting values


So I am practicing data wrangling and I have encountered an issue.

food['GPA'].unique()

And the output is

array(['2.4', '3.654', '3.3', '3.2', '3.5', '2.25', '3.8', '3.904', '3.4',
       '3.6', '3.1', nan, '4', '2.2', '3.87', '3.7', '3.9', '2.8', '3',
       '3.65', '3.89', '2.9', '3.605', '3.83', '3.292', '3.35',
       'Personal ', '2.6', '3.67', '3.73', '3.79 btch', '2.71', '3.68',
       '3.75', '3.92', 'Unknown', '3.77', '3.63', '3.882'], dtype=object)

My idea is to convert them to strings first and then extract the floats and integers from them. But when I run the code

food['GPA'] = food['GPA'].astype(str).str.extract('(\d*\.\d+|\d+)', expand=False)
food['GPA'] = pd.to_numeric(food['GPA'], errors='coerce')

all the values in the GPA column are being converted to 3.0 and 4.0 instead of retaining their decimal values.

food['GPA'].unique()
[3. 2. 4.]

Can anyone help me figure out why the decimals are being lost, and how to preserve them?


Solution

  • You need to add an r to make a raw string so the backslashes will be interpreted correctly

    food['GPA'] = food['GPA'].astype(str).str.extract(r'(\d*\.\d+|\d+)', expand=False)
    food['GPA'] = pd.to_numeric(food['GPA'], errors='coerce')
    print(food['GPA'].unique())