python regex string nlp feature-engineering

Sugestions on the best way to work with NLP mixed some numerical and categorical features

I'm working with a dataset of medicinal products across different countries, with each country having it's own data source. This results in the data not always being quite 'standardized' (for a lack of a better word), so one of the problems I'm trying to solve is to have the dosage in the same format across all countries. I've been doing it 'manually' for each country using regex, while having into account some criteria that I want to use as features in the model. For example: the number of active substances of the product, the pharmaceutical form and if some specific active substance is present in the product. By doing this 'manually' for like 1/3 of the countries, I've got a reasonable amount of records to train a model.

Name   ActiveSubstances   NumberOfActSubst   PharmaceuticalForm   Dosage        DosageFinal

X      ['Y','Z']          2                  Tablet               '20mg/5mg'    '20 mg + 5 mg'

A      ['B']              1                  Tablet               '(50 microg+10mg)/ml''50 µg/ml + 10mg/ml'

I want this DosageFinal field to be filled automatically. What would be the best way to approach this task? I looked into parallel networks and the idea would be to use one NN to get the embeddings of the text variables, and another NN to collect the embeddings of the only numeric feature and later concatenate the embeddings. Am I overcomplicating it?

Solution

You would use embeddings to understand the semantic meaning of the text.

For your situation, I would recommend looking at this as a Translation task, or a simple text Generation.

Generation

Use any decoder to generate the text in the right format.
Use a Few-Show learning inside the prompt, and it will already understand the pattern.

Do a quick test; Go to any free AI-Chat platform (e.g. HFchat, ChatGPt, etc.), instruct it with a few examples, and you would get the right answers.
If you build the prompt correctly you will get SOTA answers.

Some ideas to help the model would be: transform each country independently, or each medication.
Also, if you give it a good enough prompt (few-shots) - it will do great.

Translation

If you have enough data samples to train an LM - try to use BART, T5, etc.
And you might be able to create a model to generate these texts for you.

Good luck.