Search code examples
pythoncsvtypeerrorone-hot-encoding

OneHotEncoding error when applying to an empty field


The code consists of applying the OneHotEncoding technique to two fields of a binetflow file: Proto and State. I have to do this to 5 files. I was able to apply the code below with perfection to the first two. When it gets to the third it throws the error:

TypeError: '<' not supported between instances of 'str' and 'float'.

I'm sure the error's in line: 0.000000,icmp,,60,60.0,0 of the file in which the field State's empty.

I want to simply ignore the One hot Encoding and copy the State field the way it is, which is empty and jump to the next line.

df = opendataset()

df['State2'] = df['State']
df['Proto2'] = df['Proto']
df['Dur'] = df.Dur.apply(lambda n: '%.6f' % n)

le = LabelEncoder()
dfle = df
dfle.State = le.fit_transform(dfle.State)
X = dfle[['State']].values
Y = dfle[['Proto']].values
ohe = OneHotEncoder()
OnehotX = ohe.fit_transform(X).toarray()
OnehotY = ohe.fit_transform(Y).toarray()

dx = pd.DataFrame(data=OnehotX)
dy = pd.DataFrame(data=OnehotY)

dfle['State'] = (dx[dx.columns[0:]].apply(lambda x:''.join(x.dropna().astype(int).astype(str)), axis=1))
dfle['Proto'] = (dy[dy.columns[0:]].apply(lambda y:''.join(y.dropna().astype(int).astype(str)), axis=1))

enter image description here

08-03 Edit

This (below) is the TraceBack when I run the code above. As you can see, the error is dfle.State = le.fit_transform(dfle.State) and consequently OnehotX = ohe.fit_transform(X).toarray().

Traceback (most recent call last):

File "C:/Users/V/PycharmProjects/PreProcess/testfile.py", line 39, in dfle.State = le.fit_transform(dfle.State)

File "C:\Users\V\PycharmProjects\PreProcess\venv\lib\site-packages\sklearn\preprocessing\label.py", line 236, in fit_transform self.classes_, y = _encode(y, encode=True)

File "C:\Users\V\PycharmProjects\PreProcess\venv\lib\site-packages\sklearn\preprocessing\label.py", line 108, in _encode return _encode_python(values, uniques, encode)

File "C:\Users\V\PycharmProjects\PreProcess\venv\lib\site-packages\sklearn\preprocessing\label.py", > line 63, in _encode_python uniques = sorted(set(values))

TypeError: '<' not supported between instances of 'str' and 'float'

NEW CODE: I tried to do what Hemerson Tacon said and apply Try/Exception to the parts where the traceback throws an error but it warns me that it has an error and throws another error.

le = LabelEncoder()
dfle = df

try:
    dfle.State = le.fit_transform(dfle.State)
except TypeError:
    pass
X = dfle[['State']].values
Y = dfle[['Proto']].values
ohe = OneHotEncoder()
try:
    OnehotX = ohe.fit_transform(X).toarray()
except ValueError:
    pass

OnehotY = ohe.fit_transform(Y).toarray()

dx = pd.DataFrame(data=OnehotX)
dy = pd.DataFrame(data=OnehotY)

dfle['State'] = (dx[dx.columns[0:]].apply(lambda x:''.join(x.dropna().astype(int).astype(str)), axis=1))
dfle['Proto'] = (dy[dy.columns[0:]].apply(lambda y:''.join(y.dropna().astype(int).astype(str)), axis=1))

NEW ERROR:

Traceback (most recent call last): File "C:/Users/V/PycharmProjects/PreProcess/testfile.py", line 53, in ** dx = pd.DataFrame(data=OnehotX) NameError: name 'OnehotX' is not defined**

LAST EDIT 09/03

The solution to the problem was to simply add the line df.replace() to the code. So when it reads it replaces NaN for the word empty fixing the problem.

dfle['State'].replace(np.nan,"empty", inplace=True)

df = opendataset()

df['State2'] = df['State']
df['Proto2'] = df['Proto']
df['Dur'] = df.Dur.apply(lambda n: '%.6f' % n)

le = LabelEncoder()
dfle = df

dfle['State'].replace(np.nan,"empty", inplace=True)

dfle.State = le.fit_transform(dfle.State)

X = dfle[['State']].values
Y = dfle[['Proto']].values
ohe = OneHotEncoder()

OnehotX = ohe.fit_transform(X).toarray()
OnehotY = ohe.fit_transform(Y).toarray()

dx = pd.DataFrame(data=OnehotX)
dy = pd.DataFrame(data=OnehotY)

Solution

  • You could put your code in question inside a try block and catch the TypeError exception, check if is the case where the State's field is empty and if true ignore it as you said, and if not true raise the error again.

    If you had posted the actual code that calls the OneHotEncoding to your data would be easier to answer you and provide some code in the answer.

    Edit

    The OnehotX variable is defined only inside the try block. You need to define it outside and before this block to fix the error. Something like OnehotX = None would work. Also, I reinforce what I said before, in the except block would be a good practice to test if the exception is due to the problem you have identified, this means, test if the State field is empty.