Search code examples
pythonpandasdataframetype-conversion

Converting dtypes in messy pandas data-frame? some numeric columns have a small number of string values


I have a big data-frame. I want to convert them to the appropriate dtype. The problem is that in several numeric columns there are strings. I know about convert_dtypes and to_numeric.

  • The problem with convert_dtypes is that it doesn't infer a column as int/float as soon as there strings there; to_numeric on the other hand has "coerce" which turns all the invalid examples to nan.
  • The problem with to_numeric is that there are several columns that are strings, so I can't just run it on all columns.

So I am looking for a function that convert dtypes to numeric if there is a certain % of numeric values in it. It would be great if one could set the threshold for this.

(As mentioned before the dataset is large, so I would prefer some solution that handles all the columns automatically.)


Solution

  • Use custom function with convert columns to numeric and if match condition return numeric column else original column in DataFrame.apply:

    print (df)
       a  b  c  d  e
    0  1  5  4  3  8
    1  7  8  9  f  9
    2  c  c  g  g  4
    3  4  t  r  e  4
    
    def f(x, thresh):
        y = pd.to_numeric(x, errors='coerce')
        return y if y.notna().mean() > thresh else x
    
    thresh = 0.7
    df1 = df.apply(f, args= (thresh,))
    print (df1)
         a  b  c  d  e
    0  1.0  5  4  3  8
    1  7.0  8  9  f  9
    2  NaN  c  g  g  4
    3  4.0  t  r  e  4
    
    print (df1.dtypes)
    a    float64
    b     object
    c     object
    d     object
    e      int64
    dtype: object
    

    Modified solution with missing values (if exist):

    print (df)
       a  b    c  d  e
    0  1  5    4  3  8
    1  7  8  NaN  f  9
    2  c  c  NaN  g  4
    3  4  t    r  e  4
    
    def f(x, thresh):
        y = pd.to_numeric(x, errors='coerce')
        return y if (y.notna() | x.isna()).mean() > thresh else x
    
    thresh = 0.7
    df1 = df.apply(f, args= (thresh,))
    print (df1)
         a  b    c  d  e
    0  1.0  5  4.0  3  8
    1  7.0  8  NaN  f  9
    2  NaN  c  NaN  g  4
    3  4.0  t  NaN  e  4
    
    print (df1.dtypes)
    a    float64
    b     object
    c    float64
    d     object
    e      int64
    dtype: object