Search code examples
pandasregressiondata-analysis

Data standardization of feat having lt/gt values among absolute values


One of the datasets I am dealing with has few features which have lt/gt values along with absolute values. Please refer to an example below -

>>> df = pd.DataFrame(['<10', '23', '34', '22', '>90', '42'], columns=['foo'])
>>> df
   foo
0  <10
1   23
2   34
3   22
4  >90
5   42

note - foo is % value. ie 0 <= foo <= 100

How are such data transformed to run regression models on?


Solution

  • One thing you could do is, for values <10, impute the median value (5). Similarly, for those >90, impute 95.

    Then add two extra boolean columns:

    df = pd.DataFrame(['<10', '23', '34', '22', '>90', '42'], columns=['foo'])
    dummies = pd.get_dummies(df, columns=['foo'])[['foo_<10', 'foo_>90']]
    df = df.replace('<10', 5).replace('>90', 95)
    df = pd.concat([df, dummies], axis=1)
    df
    

    This will give you

      foo  foo_<10  foo_>90
    0   5        1        0
    1  23        0        0
    2  34        0        0
    3  22        0        0
    4  95        0        1
    5  42        0        0