Search code examples
pythonparsingnlplexical-analysis

Parsing Meaning from Text


I realize this is a broad topic, but I'm looking for a good primer on parsing meaning from text, ideally in Python. As an example of what I'm looking to do, if a user makes a blog post like:

"Manny Ramirez makes his return for the Dodgers today against the Houston Astros",

what's a light-weight/ easy way of getting the nouns out of a sentence? To start, I think I'd limit it to proper nouns, but I wouldn't want to be limited to just that (and I don't want to rely on a simple regex that assumes anything Title Capped is a proper noun).

To make this question even worse, what are the things I'm not asking that I should be? Do I need a corpus of existing words to get started? What lexical analysis stuff do I need to know to make this work? I did come across one other question on the topic and I'm digging through those resources now.


Solution

  • Use the NLTK, in particular chapter 7 on Information Extraction.

    You say you want to extract meaning, and there are modules for semantic analysis, but I think IE is all you need--and honestly one of the only areas of NLP computers can handle right now.

    See sections 7.5 and 7.6 on the subtopics of Named Entity Recognition (to chunk and categorize Manny Ramerez as a person, Dodgers as a sports organization, and Houston Astros as another sports organization, or whatever suits your domain) and Relationship Extraction. There is a NER chunker that you can plugin once you have the NLTK installed. From their examples, extracting a geo-political entity (GPE) and a person:

    >>> sent = nltk.corpus.treebank.tagged_sents()[22]
    >>> print nltk.ne_chunk(sent) 
    (S
      The/DT
      (GPE U.S./NNP)
      is/VBZ
      one/CD
      ...
      according/VBG
      to/TO
      (PERSON Brooke/NNP T./NNP Mossman/NNP)
      ...)
    

    Note you'll still need to know tokenization and tagging, as discussed in earlier chapters, to get your text in the right format for these IE tasks.