Implication of deep learning model on nlp tabular data?

I have the tabular data. Some of the columns contain the text data. As you can see In the picture.

After cleaning the text I converted text into matrixes and object data types into categories.

ids_list = ['arg_id','key_point_id']
for cont in ids_list:
    train_df[cont] = train_df[cont].astype('category')

Here I am using the tokenizer.

from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
train_df["argument"]=train_df["argument"].apply(lambda x:tokenizer(x)['input_ids'])
train_df["topic"]=train_df["topic"].apply(lambda x:tokenizer(x)['input_ids'])
train_df["key_point"]=train_df["key_point"].apply(lambda x:tokenizer(x).['input_ids'])

The final result after making tokens After converting in trainX and trainY = label.

Here I am lost How to implement the lstm in here?.

I have seen many examples mostly they have one column other is target column. So I am confused here about how to adjust this data on lstm.

Here is the data if you want to look at it: link to data

Solution

It is not clear from your question which column you will use as feature and which column as target. Assuming you want to predict label using the column argument, then you should feed the data that is tokenized to LSTM model using an embedding layer. Here is an example for one feature column and one target column with pytorch:

import torch as th
import torch.nn as nn

vocab_size = tokenizer.vocab_size
EMBED_DIM = 128 # embedding layer will convert each word into a [128, 1] shaped tensor
LSTM_HIDDEN_SIZE = 32

class LSTMModelOneColOneTgt(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, EMBED_DIM)
        self.lstm = nn.LSTM(EMBED_DIM, LSTM_HIDDEN_SIZE)
        ...
    
    def forward(self, x):
        x = self.embedding(x)
        x = self.lstm(x)
        ...

You can flatten/pool the output of LSTM layer and then feed it into dense layers of your choice to build a classfier. If you want to use more than one column for features then you can either merge these columns into one and use the topology above or use other LSTM layers for them. For example, to input argument and topic columns separately:

import torch as th import torch.nn as nn

vocab_size = tokenizer.vocab_size
EMBED_DIM = 128 # using the same embedding dims for both column inputs
LSTM_HIDDEN_SIZE = 32 # using the same hidden size for both column inputs

class LSTMModelTwoColOneTgt(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, EMBED_DIM)
        self.lstm_argument = nn.LSTM(EMBED_DIM, LSTM_HIDDEN_SIZE)
        self.lstm_topic = nn.LSTM(EMBED_DIM, LSTM_HIDDEN_SIZE)
        ...

    def forward(self, x_argument, x_topic):
        x_arg = self.embedding(x_arg)
        x_arg = self.lstm(x_arg)
        x_topic = self.embedding(x_topic)
        x_topic = self.lstm(x_topic)
        x = th.cat[x_arg, x_topic], dim=1)
        ...