Search code examples
pythonnlpone-hot-encoding

how to do one hot encoding for text in a paragraph at the sentence level?


I have sentences stored in text file which looks like this.

radiologicalreport =1.  MDCT OF THE CHEST   History: A 58-year-old male, known case lung s/p LUL segmentectomy.  Technique: Plain and enhanced-MPR CT chest is performed using 2 mm interval.  Previous study: 03/03/2018 (other hospital)  Findings:   Lung parenchyma: The study reveals evidence of apicoposterior segmentectomy of LUL showing soft tissue thickening adjacent surgical bed at LUL, possibly post operation.

My ultimate goal is to apply LDA to classify each sentence to one topic. Before that, I want to do one hot encoding to the text. The problem I am facing is I want to one hot encode per sentence in a numpy array to be able to feed it into LDA. If I want to one hot encode the full text, I can easily do it using these two lines.

sent_text = nltk.sent_tokenize(text)
hot_encode=pd.Series(sent_text).str.get_dummies(' ')

However, my goal is to one hot encoding per sentence in a numpy array. So, I try the following code.

from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import nltk
import pandas as pd


from nltk.tokenize import TweetTokenizer, sent_tokenize

with open('radiologicalreport.txt', 'r') as myfile:
report=myfile.read().replace('\n', '') 
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in 
nltk.sent_tokenize(report)]
tokens_np = array(tokens_sentences)

label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(tokens_np)

# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)

onehot_encoded = onehot_encoder.fit_transform(integer_encoded)

I get an error at this line as "TypeError: unhashable type: 'list'"

integer_encoded = label_encoder.fit_transform(tokens_np)

And hence cannot proceed further. Also, my tokens_sentences look like this as shown in the enter image description hereimage.

Please Help!!


Solution

  • You are trying to transform labels to numerical values using fit_transform (in your example labels are lists of words -- tokens_sentences).

    But non-numerical labels can be transformed only if they are hashable and comparable (see the docs). Lists are not hashable but you can convert them to tuples:

    tokens_np = array([tuple(s) for s in tokens_sentences]) 
    # also ok: tokens_np = [tuple(s) for s in tokens_sentences]
    

    and then you can get your sentences encoded in integer_encoded

    label_encoder = LabelEncoder()
    integer_encoded = label_encoder.fit_transform(tokens_np)