I have a dataframe that has values like 1,362 - 2,037
, first, how can I clean them? like getting the mean of these two numbers (10 - 20
➡ 15
), normalizing them or anything else? second, how can I remove the comma in 1,362
to 1362
in all rows?
col1 | col2 | col3 |
---|---|---|
Yes | 579 | 402 - 1,120 |
No | 1,082 | 1,361 - 2,037 |
(mean, normalization, etc.)
col1 | col2 | col3 |
---|---|---|
Yes | 579 | 761 |
No | 1082 | 1699 |
shape of dataframe:
df = pd.read_csv("/content/cleanedNASA.csv")
df.describe
...
> [31 rows x 9 columns]
thanks.
df['y'] = pd.to_numeric(df['x']
.str.replace('[^0-9\- ]', '') # 1. remove non-(digit/dash/space) characters
.str.split(' - ') # 2. split by ' - '
.explode() # 3. explode the lists
).mean(level=0) # 4. calculate average by index
df
Output:
x y
0 402 - 1,120 761
1 674 674
2 3,196 3196
3 1,304 1304
4 853 853
5 1,361 - 2,037 1699