Search code examples
pythonpandasdataframereplacedata-cleaning

How can I remove special characters in python like ('$9.99', '@10.99', '#13.99') from a string column, without moving the decimal point?


I am working on a data cleaning exercise where I need to remove special characters like '$#@' from the 'price' column, which is of object type (string). After that, I need to convert it to float type. However, the decimal point position changes when I run the code. For example, 9.99 becomes 999.00. I have tried different sets of codes, but some of them change the values to NaN.

#Step 1 I created a data frame with special data to clean it. #Create a dictionary of wine data wine_data = { ' country': ['Italy ', 'It aly ', ' $Chile ', 'Sp ain', '$Spain', 'ITALY', '# Chile', ' Chile', 'Spain', ' Italy'], 'price ': [24.99, np.nan, 12.99, '$9.99', 11.99, 18.99, '@10.99', np.nan, '#13.99', 22.99], '#volume': ['750ml', '750ml', 750, '750ml', 750, 750, 750, 750, 750, 750], 'ran king': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'al cohol@': [13.5, 14.0, np.nan, 12.5, 12.8, 14.2, 13.0, np.nan, 12.0, 13.8], 'total_PHeno ls': [150, 120, 130, np.nan, 110, 160, np.nan, 140, 130, 150], 'color# _INTESITY': [10, np.nan, 8, 7, 8, 11, 9, 8, 7, 10], 'HARvest_ date': ['2021-09-10', '2021-09-12', '2021-09-15', np.nan, '2021-09-25', '2021-09-28', '2021-10-02', '2021-10-05', '2021-10-10', '2021-10-15'] }

Create a Pandas data frame from the dictionary

df = pd.DataFrame(wine_data)

Step 2

To clean the 'price' column and remove special characters, a new column named 'price' was created. The str.replace() method was employed with the regular expression '\D' to remove any non-numeric characters. However, in positions 3, 6, and 8, the decimal point was shifted to the right resulting in values like 999.00 instead of 9.99.

I tried the next codes:

#Great! but, it changes the decimal point in some of the values df['price'] = df['price'].replace({'\D': ''}, regex=True).astype(float)

enter image description here

#Not Working! price values are changed into NaN df['price'] = df['price'].str.replace('\D', '')

enter image description here

#Not Working #I tried to fill it with '0' NaN. But, other values were changed into NaN df['price'] = df['price'].fillna('0').str.replace(r'\D', r'') df['price'] = df['price'].fillna('0').str.replace(r'\D', r'', regex=True).astype(float)

I make a conscious effort to practice and improve my data cleaning skills by creating problems for myself. However, there are times when I am unable to solve them on my own.your text


Solution

  • You could achieve this by making sure converted to str type initially from object type, then replacing the specific special characters by empty string and then finally converting back to float type

    df['price'] = df['price'].astype(str).str.replace("[@#/$]","" ,regex=True).astype(float)