Search code examples
pythoncsvportwekaanalysis

How to read an object dtypte attribute as int in Pandas column oR "ValueError: invalid literal for int() with base 10: '0x0303'"


I'm trying to analysis a dataset by python. The dataset has many types (int, float, string), I converted all types except 2 attributes called (source port , destination port) whose dtype is object.

when explore this attributes in python :

     Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   sport   668522 non-null  object
 1   dport   668522 non-null  object

The values are:

  sport dport
0  6226    80
1  6227    80
2  6228    80
3  6229    80
4  6230    80

In my view, there are just number values, why does python deal with the port as an object?
I tried also using the Weka tool, but the program can't read values, can anyone explain to me the reason, or how to solve the problem. The port is an important feature, it is useful in mining the data, I don't want to drop it from a dataset.

update: The dataset format (CSV). The sample of values above up. There are 2 features ( source port, in short "sport" ) and ( destination port, in short "dport")

In python, to read values :

import pandas as pd 
dt = pd.read_csv("port.csv")

when print dt show values but when using ML algorithm like k-means can't deal with it.

on the other hand, in Weka, after importing the csv file, was displayed the following message "Attribute is neither numeric nor norminal"


Solution

  • We can convert the dtype of column to any data type by using astype. So you don't need to drop the column instead change the dtype.

    import pandas as pd
    
    #create DataFrame
    df = pd.DataFrame({'player': ['A', 'B', 'C', 'D', 'E'],
                       'points': ['25', '27', '0x0303', '17', '20'],
                       'assists': ['5', '7', '10', '8', '9']})
    print(df.dtypes)
    #convert 'points' column to integer
    #df['points'] = df['points'].astype(int) 
    # error as there are no non-numeric value in column 'points'
    
    df['points'] = pd.to_numeric(df['points'],errors='coerce').astype('Int64') 
    # 'coerce' will ignore all the non-numeric values and replace it with Nan
    
    #check dtype after conversion
    print("after data type conversion \n", df.dtypes)
    
    
    # ouptut will look like this
    player     object
    points     object
    assists    object
    dtype: object
    after data type conversion 
    player     object
    points      int64
    assists    object
    dtype: object
    

    This answer might help you why pandas use object as dtype?