Search code examples
pythonpandasdecimalnoisedtype

How to get rid of noise (redundant commas/dots) in decimal values - Python


I have a dataset df with two columns ID and Value. Both are of Dtype "object". However, I would like to convert the column Value to Dtype "double" with a dot as decimal separator. The problem is that the values of this column contain noise due to the presence of too many commas (e.g. 0,1,,) - or after replacement too many dots (e.g. 0.1..). As a result, when I try to convert the Dtype to double, I get the error message: could not convert string to float: '0.2.'

Example code:

#required packages
import pandas as pd
import numpy as np
  
# initialize list of lists
data = [[1, '0,1'], [2, '0,2,'], [3, '0,01,,']]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['ID', 'Value'])

#replace comma with dot as separator
df = df.replace(',', '.', regex=True)

#examine dtype per column
df.info()

#convert dtype from object to double
df = df.astype({'Value': np.double}) #this is where the error message appears

The preferred outcome is to have the values within the column Value as 0.1, 0.2 and 0.01 respectively.

How can I get rid of the redundant commas or, after replacement, dots in the values of the column Values?


Solution

  • One option: use string functions to convert and strip the values. For example:

    #required packages                                                                  
    import pandas as pd                                                                 
    import numpy as np                                                                  
                                                                                        
    # initialize list of lists                                                          
    data = [[1, '0,1'], [2, '0,2,'], [3, '0,01,,']]                                     
                                                                                        
    # Create the pandas DataFrame                                                       
    df = pd.DataFrame(data, columns=['ID', 'Value'])                                    
                                                                                        
    #replace comma with dot as separator                                                
    df['Value'] = df['Value'].str.replace(',', '.', 1).str.rstrip(',') 
                                                                                        
    #examine dtype per column                                                           
    df.info()                                                                           
                                                                                        
    #convert dtype from object to double                                                
    df = df.astype({'Value': np.double})
          
    print("------ df:")                                                                              
    print(df)
    

    prints:

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 3 entries, 0 to 2
    Data columns (total 2 columns):
     #   Column  Non-Null Count  Dtype 
    ---  ------  --------------  ----- 
     0   ID      3 non-null      int64 
     1   Value   3 non-null      object
    dtypes: int64(1), object(1)
    memory usage: 176.0+ bytes
    ----- df:
       ID  Value
    0   1   0.10
    1   2   0.20
    2   3   0.01