Search code examples
pythonnlpmixed-case

Extract proper nouns and corresponding sentences from a dataset using python


I have a dataset containing list of sentences that have both proper nouns and common nouns in them. Example -

  1. Google is a website
  2. the universe is expanding constantly
  3. I wish I had bagels for bReakfasT
  4. the GUITAR sounded a bit off-key
  5. The rumors of Moira Rose's death are greatly exaggerated online
  6. Mahatma Gandhi was a national treasure
  7. i strongly believe that beyonce is overrated

The casing of the dataset can also be mixed.

I want to extract all the proper nouns AND the corresponding sentences where they appear in two separate columns -

Output example

Is there any way to do this in Python? I am quite new to concepts of NLP and Python overall. Thanks!


Solution

  • you can try with any languauge model like , spacy or nltk as mention by @ivanp

    I have just used spacy model ,

     import spacy
     import string
     nlp = spacy.load("en_core_web_sm") # load pretrained model 
    
     def proper_noun_extraction(x):
         prop_noun = []
         doc = nlp(string.capwords(x))
         for tok in doc:
             if tok.pos_ == 'PROPN':
                prop_noun.append(str(tok))
         if len(prop_noun) !=0:
            return (' '.join(prop_noun), x)
         else:
            return ('no proper noun found', None)
    
    tuple_noun_sent = df['sentence'].apply(lambda x:proper_noun_extraction(x))
    
    resultant_df = pd.DataFrame(tuple_noun_sent.tolist(), columns = ['proper_noun', 'sentence'])
    

    enter image description here