Search code examples
machine-learningdeep-learninglstm

Implication of deep learning model on nlp tabular data?


I have the tabular data. Some of the columns contain the text data. As you can see In the picture.enter image description here

After cleaning the text I converted text into matrixes and object data types into categories.

ids_list = ['arg_id','key_point_id']
for cont in ids_list:
    train_df[cont] = train_df[cont].astype('category')

Here I am using the tokenizer.

from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
train_df["argument"]=train_df["argument"].apply(lambda x:tokenizer(x)['input_ids'])
train_df["topic"]=train_df["topic"].apply(lambda x:tokenizer(x)['input_ids'])
train_df["key_point"]=train_df["key_point"].apply(lambda x:tokenizer(x).['input_ids'])

The final result after making tokens enter image description here After converting in trainX and trainY = label.

Here I am lost How to implement the lstm in here?.

I have seen many examples mostly they have one column other is target column. So I am confused here about how to adjust this data on lstm.

Here is the data if you want to look at it: link to data


Solution

  • It is not clear from your question which column you will use as feature and which column as target. Assuming you want to predict label using the column argument, then you should feed the data that is tokenized to LSTM model using an embedding layer. Here is an example for one feature column and one target column with pytorch:

    import torch as th
    import torch.nn as nn
    
    vocab_size = tokenizer.vocab_size
    EMBED_DIM = 128 # embedding layer will convert each word into a [128, 1] shaped tensor
    LSTM_HIDDEN_SIZE = 32
    
    class LSTMModelOneColOneTgt(nn.Module):
        def __init__(self):
            super().__init__()
            self.embedding = nn.Embedding(vocab_size, EMBED_DIM)
            self.lstm = nn.LSTM(EMBED_DIM, LSTM_HIDDEN_SIZE)
            ...
        
        def forward(self, x):
            x = self.embedding(x)
            x = self.lstm(x)
            ...
    

    You can flatten/pool the output of LSTM layer and then feed it into dense layers of your choice to build a classfier. If you want to use more than one column for features then you can either merge these columns into one and use the topology above or use other LSTM layers for them. For example, to input argument and topic columns separately:

    import torch as th import torch.nn as nn

    vocab_size = tokenizer.vocab_size
    EMBED_DIM = 128 # using the same embedding dims for both column inputs
    LSTM_HIDDEN_SIZE = 32 # using the same hidden size for both column inputs
    
    class LSTMModelTwoColOneTgt(nn.Module):
        def __init__(self):
            super().__init__()
            self.embedding = nn.Embedding(vocab_size, EMBED_DIM)
            self.lstm_argument = nn.LSTM(EMBED_DIM, LSTM_HIDDEN_SIZE)
            self.lstm_topic = nn.LSTM(EMBED_DIM, LSTM_HIDDEN_SIZE)
            ...
    
        def forward(self, x_argument, x_topic):
            x_arg = self.embedding(x_arg)
            x_arg = self.lstm(x_arg)
            x_topic = self.embedding(x_topic)
            x_topic = self.lstm(x_topic)
            x = th.cat[x_arg, x_topic], dim=1)
            ...