Search code examples
pandastype-conversion

Pandas, astype(int) applied to float column returns negative numbers


My task is to read data from excel to dataframe. The data is a bit messy and to clean that up I've done:

df_1 = pd.read_excel(offers[0])
df_1 = df_1.rename(columns={'Наименование [Дата Файла: 29.05.2019 время: 10:29:42 ]':'good_name', 
                     'Штрихкод':'barcode', 
                     'Цена шт. руб.':'price',
                     'Остаток': 'balance'
                    })
df_1 = df_1[new_columns]
# I don't know why but without replacing NaN with another char code doesn't work
df_1.barcode = df_1.barcode.fillna('_')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to numeric
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)
df_1.head()

It returns column barcode with type float64 (why so?)

0    0.000000e+00
1    7.613037e+12
2    7.613037e+12
3    7.613034e+12
4    7.613035e+12
Name: barcode, dtype: float64

Then I try to convert that column to integer.

df_1.barcode = df_1.barcode.astype(int)

But I keep getting silly negative numbers.

df_1.barcode[0:5]
0             0
1   -2147483648
2   -2147483648
3   -2147483648
4   -2147483648

Name: barcode, dtype: int32

Thanks to @Will and @micric eventually I've got a solution.

df_1 = pd.read_excel(offers[0])
df_1 = df_1[new_columns]
# replacing NaN with 0, it'll help to convert the column explicitly to dtype integer
df_1.barcode = df_1.barcode.fillna('0')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to integer
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer')

Resume:

  • pd.to_numeric converts NaN to float64. As a result from column with both NaN and not-Nan values we should expect column dtype float64.
  • Check size of number you're dealing with. int32 has its limit, which is 2**32 = 4294967296. Thanks a lot for your help, guys!

Solution

  • Many questions in one.

    So your expected dtype...

    pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)
    

    pd.to_numeric downcast to integer would give you an integer, however, you have NaNs in your data and pandas needs to use a float64 type to represent NaNs