python nlp regex-negation data-extraction

Find negation of particular keywords in text

I am working on information extraction from medical texts (very new to NLP!). At the moment I am interested to find and extract the medications which are mentioned in a predefined list of drugs. For example, consider the text:

"John was prescribed aspirin due to hight temperature"

Thus, given the list of medications (in Python language):

list_of_meds = ['aspirin', 'ibuprofen', 'paracetamol']

The extracted drug is aspirin. That's fine.

Now consider another case:

"John was prescribed ibuprofen, because he could not tolerate paracetamol"

Now, if I extract the drugs using the list (for example with regular expression), then the extracted drugs are ibuprofen and paracetamol.

QUESTION How to separate actually prescribed and untolerated drugs? Is there a way to label prescribed (used) and other mentioned drugs?

Solution

This is a complex problem. To capture the nuances around negation, you need to step into the world of dependency parsing and relationship extraction. Couple of paths you can take to add sophistication to your current approach and the add-on by @Jordan are:

Using a relationship extraction NLP library (e.g. Watson, Core-NLP, Spacy) that you train with example sentences like the one you gave to extract triplet relations like (John, prescribed, ibuprofen) and (John, not tolerate, paracetamol). This will require investment in annotating sample data.
Rolling your own relationship extractor by starting with the dependency parse that shows how different parts of the sentence are related. This will require both programming time as well as training.

Handling negation in relations is not a solved problem. The state of the art around this is usually associated with sentiment analysis. An introduction on using dependency parsing to identify and handle negation is available at this Stanford NLP Sentiment Analysis using RNN page