I am working on a data cleaning exercise where I need to remove special characters like '$#@' from the 'price' column, which is of object type (string). After that, I need to convert it to float type. However, the decimal point position changes when I run the code. For example, 9.99 becomes 999.00. I have tried different sets of codes, but some of them change the values to NaN.
#Step 1 I created a data frame with special data to clean it.
#Create a dictionary of wine data
wine_data = { ' country': ['Italy ', 'It aly ', ' $Chile ', 'Sp ain', '$Spain', 'ITALY', '# Chile', ' Chile', 'Spain', ' Italy'], 'price ': [24.99, np.nan, 12.99, '$9.99', 11.99, 18.99, '@10.99', np.nan, '#13.99', 22.99], '#volume': ['750ml', '750ml', 750, '750ml', 750, 750, 750, 750, 750, 750], 'ran king': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'al cohol@': [13.5, 14.0, np.nan, 12.5, 12.8, 14.2, 13.0, np.nan, 12.0, 13.8], 'total_PHeno ls': [150, 120, 130, np.nan, 110, 160, np.nan, 140, 130, 150], 'color# _INTESITY': [10, np.nan, 8, 7, 8, 11, 9, 8, 7, 10], 'HARvest_ date': ['2021-09-10', '2021-09-12', '2021-09-15', np.nan, '2021-09-25', '2021-09-28', '2021-10-02', '2021-10-05', '2021-10-10', '2021-10-15'] }
df = pd.DataFrame(wine_data)
Step 2
To clean the 'price' column and remove special characters, a new column named 'price' was created. The str.replace() method was employed with the regular expression '\D' to remove any non-numeric characters. However, in positions 3, 6, and 8, the decimal point was shifted to the right resulting in values like 999.00 instead of 9.99.
I tried the next codes:
#Great! but, it changes the decimal point in some of the values
df['price'] = df['price'].replace({'\D': ''}, regex=True).astype(float)
#Not Working! price values are changed into NaN
df['price'] = df['price'].str.replace('\D', '')
#Not Working
#I tried to fill it with '0' NaN. But, other values were changed into NaN
df['price'] = df['price'].fillna('0').str.replace(r'\D', r'') df['price'] = df['price'].fillna('0').str.replace(r'\D', r'', regex=True).astype(float)
I make a conscious effort to practice and improve my data cleaning skills by creating problems for myself. However, there are times when I am unable to solve them on my own.your text
You could achieve this by making sure converted to str
type initially from object
type, then replacing the specific special characters by empty string and then finally converting back to float
type
df['price'] = df['price'].astype(str).str.replace("[@#/$]","" ,regex=True).astype(float)