Search code examples
pythonpandasnlpspacyn-gram

Python NLP Spacy : improve bi-gram extraction from a dataframe, and with named entities?


I am using Python and spaCy as my NLP library, working on a big dataframe that contains feedback about different cars, which looks like this:

enter image description here

  • 'feedback' column contains natural language text to be processed,
  • 'lemmatized' column contains lemmatized version of the feedback text,
  • 'entities' column contains named entities extracted from the feedback text (I've trained the pipeline so that it will recognise car models and brands, labelling these as 'CAR_BRAND' and 'CAR_MODEL')

I then created the following function, which applies the Spacy nlp token to each row of my dataframe and extract any [noun + verb], [verb + noun], [adj + noun], [adj+ proper noun] combinations.

def bi_gram(x):
    doc = nlp_token(x)
    result = []
    text = ''
    for i in range(len(doc)):
        j = i+1
        if j < len(doc):
            if (doc[i].pos_ == "NOUN" and doc[j].pos_ == "VERB") or (doc[i].pos_ == "VERB" and doc[j].pos_ == "NOUN") or (doc[i].pos_ == "ADJ" and doc[j].pos_ == "NOUN") or (doc[i].pos_ == "ADJ" and doc[j].pos_ == "PROPN"):
                text = doc[i].text + " " + doc[j].text
                result.append(text)
        i = i+1
        return result

Then I applied this function to 'lemmatized' column:

df['bi_gram'] = df['lemmatized'].apply(bi_gram)

This is where I have a problem...

  1. This is producing only one bigram per row maximum. How can I tweak the code so that more than one bigram can be extracted and put in a column? (Also are there more linguistic combinations I should try?)

  2. Is there a possibility to find out what people are saying about 'CAR_BRAND' and 'CAR_MODEL' named entities extracted in the 'entities' column? For example 'Cool Porsche' - Some brands or models are made of more than two words so it's tricky to tackle.

I am very new to NLP.. If there is a more efficient way to tackle this, any advice will be super helpful! Many thanks for your help in advance.


Solution

  • spaCy has a built-in pattern matching engine that's perfect for your application – it's documented here and in a more extensive usage guide. It allows you to define patterns in a readable and easy-to-maintain way, as lists of dictionaries that define the properties of the tokens to be matched.

    Set up the pattern matcher

    import spacy
    from spacy.matcher import Matcher
    
    nlp = spacy.load("en_core_web_sm") # or whatever model you choose
    
    matcher = Matcher(nlp.vocab)
    
    # your patterns
    patterns = {
        "noun_verb": [{"POS": "NOUN"}, {"POS": "VERB"}],
        "verb_noun": [{"POS": "VERB"}, {"POS": "NOUN"}],
        "adj_noun": [{"POS": "ADJ"}, {"POS": "NOUN"}],
        "adj_propn": [{"POS": "ADJ"}, {"POS": "PROPN"}],
    }
    
    # add the patterns to the matcher
    for pattern_name, pattern in patterns.items():
        matcher.add(pattern_name, [pattern])
    

    Extract matches

    doc = nlp("The dog chased cats. Fast cats usually escape dogs.")
    matches = matcher(doc)
    

    matches is a list of tuples containing

    • a match id,
    • the start index of the matched bit and
    • the end index (exclusive).

    This is a test output adopted from the spaCy usage guide:

    for match_id, start, end in matches:
        
        # Get string representation
        string_id = nlp.vocab.strings[match_id]
    
        # The matched span
        span = doc[start:end]
        
        print(repr(span.text))
        print(match_id, string_id, start, end)
        print()
    

    Result

    'dog chased'
    1211260348777212867 noun_verb 1 3
    
    'chased cats'
    8748318984383740835 verb_noun 2 4
    
    'Fast cats'
    2526562708749592420 adj_noun 5 7
    
    'escape dogs'
    8748318984383740835 verb_noun 8 10
    

    Some ideas for improvement

    • Named entity recognition should be able to detect multi-word expressions, so brand and/or model names that consist of more than one token shouldn't be an issue if everything is set up correctly
    • Matching dependency patterns instead of linear patterns might slightly improve your results

    That being said, what you're trying to do – kind of sentiment analysis -is quite a difficult task that's normally engaged with machine learning approaches and heaps of training data. So don't expect too much from simple heuristics.