Search code examples
pandasdataframenlpmultiple-columnsspacy-3

Obtaining the index of a word between two columns in pandas


I am checking on which words the SpaCy Spanish lemmatizer works on using the .has_vector method. In the two columns of the datafame I have the output of the function that indicates which words can be lemmatized and in the other one the corresponding phrase.

I would like to know how I can extract all the words that have False output to correct them so that I can lemmatize.

So I created the function:

def lemmatizer(text):
doc = nlp(text)
return ' '.join([str(word.has_vector) for word in doc])

And applied it to the column sentences in the DataFrame

df["Vectors"] = df.reviews.apply(lemmatizer)

And put in another data frame as:

df2= pd.DataFrame(df[['Vectors', 'reviews']])

The output is

index             Vectors              reviews
  1     True True True False        'La pelicula es aburridora'

Solution

  • Two ways to do this:

    import pandas
    import spacy
    
    nlp = spacy.load('en_core_web_lg')
    df = pandas.DataFrame({'reviews': ["aaabbbcccc some example words xxxxyyyz"]})
    

    If you want to use has_vector:

    def get_oov1(text):
        return [word.text for word in nlp(text) if not word.has_vector]
    

    Alternatively you can use the is_oov attribute:

    def get_oov2(text):
        return [word.text for word in nlp(text) if word.is_oov]
    

    Then as you already did:

    df["oov_words1"] = df.reviews.apply(get_oov1)
    df["oov_words2"] = df.reviews.apply(get_oov2)
    

    Which will return:

    >                                   reviews              oov_words1              oov_words2
      0  aaabbbcccc some example words xxxxyyyz  [aaabbbcccc, xxxxyyyz]  [aaabbbcccc, xxxxyyyz]
    

    Note:

    When working with both of these ways it is important to know that this is model dependent, and usually has no backbone in smaller models and will always return a default value!

    That means when you run the exact same code but e.g. with en_core_web_sm you get this:

    >                                  reviews oov_words1                                    oov_words2
      0  aaabbbcccc some example words xxxxyyyz         []  [aaabbbcccc, some, example, words, xxxxyyyz]
    

    Which is because has_vector has a default value of False and is then not set by the model. is_oov has a default value of True and then is not by the model either. So with the has_vector model it wrongly shows all words as unknown and with is_oov it wrongly shows all as known.