Search code examples
regexpython-3.xnltkspacydata-extraction

Extracting main subject from a sentence in python


I am trying to extract the main subject from a sentence contained in a text file. For example, the file contains data as given below

I never used tobacco
They smoke tobacco
I do not like today's weather
Good weather
Exercise 3 to 4 times a week
No exercise
Family history of Cancer
No Cancer
,,· Alcohol use
Amazing football match
Pathetic football match
Has Depression

I have to extract the main subject and print it as follows:

I never used tobacco | Tobacco | False
They smoke tobacco | Tobacco | True
I do not like today's weather | Weather | False
Good weather | Weather | True
Exercise 3 to 4 times a week | Exercise | True
No exercise | Exercise | False
Family history of Cancer | Cancer | True
No Cancer | Cancer | False
,,· Alcohol use. | Alcohol | True
Amazing football match | Football Match| True
Pathetic football match | Football Match | False
Has Depression | Depression | True

I am trying Spacy for it but not able to get the desired output. I tokenized the sentences using Spacy then used part of speech tagging to extract the nouns but still not getting what is required. Can anyone help that how it could be done?


Solution

  • There is not an exact solution to it but the below code which I used is somewhat helpful:

    negatedwords = read_words_from_file('false.txt') # file containing all the negation words
    #read_words_from_file() will read words from file
    
    from collections import Counter
    import spacy
    nlp = spacy.load('en_core_web_md')
    
    count = Counter(line.split())
    negated_word_found = False
    for key, val in count.items():
        key = key.rstrip('.,?!\n') # removing punctuations
        if key in negatedwords :
            negated_word_found= True
    
    if negated_word_found== True:
        file_write.write("False")
    else:
        file_write.write("True")
    
    file_write.write(" | ")
    document = nlp(line)
    
    for word in document:
        look_for_word = word.text
        word_pos = word.pos_
        if ((word_pos =="NOUN" or word_pos =="ADJ" or word_pos == "PROPN" ) and look_for_word!="use" ): #The pos_ tag for 'use' is showed as NOUN  
            file_write.write(look_for_word)
            file_write.write(' ')
    
    false.txt
    never
    Never
    no
    No
    NO
    not
    NOT
    Not
    NEVER
    don't
    Don't
    DON'T