python python-3.x nlp pytorch huggingface-transformers

Reduce the number of hidden units in hugging face transformers (BERT)

I have been given a large csv each line of which is a set of BERT tokens made with hugging face BertTokenizer (https://huggingface.co/transformers/main_classes/tokenizer.html). 1 line of this file looks as follows:

101, 108, 31278, 90939, 70325, 196, 199, 71436, 10107, 29190, 10107, 106, 16680, 68314, 10153, 17015, 15934, 10104, 108, 10233, 12396, 14945, 10107, 10858, 11405, 13600, 13597, 169, 57343, 64482, 119, 119, 119, 100, 11741, 16381, 10109, 68830, 10110, 20886, 108, 10233, 11127, 21768, 100, 14120, 131, 120, 120, 188, 119, 11170, 120, 12132, 10884, 10157, 11490, 12022, 10113, 10731, 10729, 11565, 14120, 131, 120, 120, 188, 119, 11170, 120, 162, 11211, 11703, 12022, 11211, 10240, 44466, 100886, 102

and there are 9 million lines like this

Now, I am trying to get embeddings from these tokens like this:

def embedding:
    tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=False)
    model = BertModel.from_pretrained('bert-base-multilingual-cased')
    input_ids = torch.tensor([101, 108, 31278, 90939, 70325, 196, 199, 71436, 10107, 29190, 10107, 106, 16680, 68314, 10153, 17015, 15934, 10104, 108, 10233, 12396, 14945, 10107, 10858, 11405, 13600, 13597, 169, 57343, 64482, 119, 119, 119, 100, 11741, 16381, 10109, 68830, 10110, 20886, 108, 10233, 11127, 21768, 100, 14120, 131, 120, 120, 188, 119, 11170, 120, 12132, 10884, 10157, 11490, 12022, 10113, 10731, 10729, 11565, 14120, 131, 120, 120, 188, 119, 11170, 120, 162, 11211, 11703, 12022, 11211, 10240, 44466, 100886, 102]).unsqueeze(0)  # Batch size 1
    outputs = model(input_ids)
    last_hidden_states = outputs[0][0][0]  # The last hidden-state is the first element of the output tuple

Output of this is embedding correspending to the line. The size is 768*1 tensor. Semantically, everything is ok. But, when I do this for the full file the output is 768 * 9,0000,000 torch tensors. So I get a memory error even with large machine with a 768 GB of RAM. Here is how I call this function: tokens['embeddings'] = tokens['text_tokens'].apply(lambda x: embedding(x))

tokens is the pandas data frame with 9 million lines each of which contains BERT tokens.

Is it possible to reduce the default size of the hidden units, which is 768 here: https://huggingface.co/transformers/main_classes/model.html

Thank you for your help.

Solution

Changing the dimensionality would mean changing all the model parameters, i.e., retraining the model. This could be achievable by knowledge distillation, but it will be probably still quite computationally demanding.

You can also use some dimensionality reduction techniques on the BERT outputs, like PCA (available e.g., in scikit-learn). In that case, I would suggest taking several thousand BERT vectors, fit the PCA and then apply the PCA on all the remaining vectors.