Search code examples
pythonpandasdataframedata-cleaning

Dealing with numeric range-like data in a dataframe


I have a dataframe that has values like 1,362 - 2,037, first, how can I clean them? like getting the mean of these two numbers (10 - 2015), normalizing them or anything else? second, how can I remove the comma in 1,362 to 1362 in all rows?

Currently Have:

col1 col2 col3
Yes 579 402 - 1,120
No 1,082 1,361 - 2,037

Expected Output:

(mean, normalization, etc.)

col1 col2 col3
Yes 579 761
No 1082 1699

shape of dataframe:

df = pd.read_csv("/content/cleanedNASA.csv")
df.describe
...
> [31 rows x 9 columns]

thanks.


Solution

  • df['y'] = pd.to_numeric(df['x']
        .str.replace('[^0-9\- ]', '') # 1. remove non-(digit/dash/space) characters
        .str.split(' - ')             # 2. split by ' - '
        .explode()                    # 3. explode the lists
    ).mean(level=0)                   # 4. calculate average by index
    
    df
    

    Output:

                   x     y
    0    402 - 1,120   761
    1            674   674
    2          3,196  3196
    3          1,304  1304
    4            853   853
    5  1,361 - 2,037  1699