Search code examples
pythonnlpnltktext-miningtext-classification

NLTK title classifier


Apologies in advance if this has already been questioned/answered, but I couldn't find any answer close to my problem. I am also somewhat noob as to dealing with Python, so sorry too for the long post.

I am trying to build a Python script that, based on a user-given Pubmed query (i.e., "cancer"), retrieves a file with N article titles, and evaluates their relevance to the subject in question.

I have successfully built the "pubmed search and save" part, having it return a .txt file containing titles of articles (each line corresponds to a different article title), for instance:

Feasibility of an ovarian cancer quality-of-life psychoeducational intervention.

A randomized trial to increase physical activity in breast cancer survivors.

Having this file, the idea is to use it into a classifier and get it to answer if the titles in the .txt file are relevant to a subject, for which I have a "gold standard" of titles that I know are relevant (i.e., I want to know the precision and recall of the queried set of titles against my gold standard). For example: Title 1 has the word "neoplasm" X times and "study" N times, therefore it is considered as relevant to "cancer" (Y/N).

For this, I have been using NLTK to (try to) classify my text. I have pursued 2 different approaches, both unsuccessfully:

Approach 1

Loading the .txt file, preprocessing it (tokenization, lower-casing, removing stopwords), converting the text to NLTK text format, finding the N most-common words. All this runs without problems.

f = open('SR_titles.txt')
raw = f.read() 
tokens = word_tokenize(raw)
words = [w.lower() for w in tokens]
words = [w for w in words if not w in stopwords.words("english")]
text = nltk.Text(words)
fdist = FreqDist(text)
>>><FreqDist with 116 samples and 304 outcomes>

I am also able to find colocations/bigrams in the text, which is something that might be important afterward.

text.collocations()
>>>randomized controlled; breast cancer; controlled trial; physical
>>>activity; metastatic breast; prostate cancer; randomised study; early
>>>breast; cancer patients; feasibility study; psychosocial support;
>>>group psychosocial; group intervention; randomized trial

Following NLTKs tutorial, I built a feature extractor, so the classifier will know which aspects of the data it should pay attention to.

def document_features(document):
  document_words = set(document)
  features = {}
  for word in word_features:
      features['contains({})'.format(word)] = (word in document_words)
  return features

This would, for instance, return something like this:

{'contains(series)': False, 'contains(disorders)': False,
'contains(group)': True, 'contains(neurodegeneration)': False,
'contains(human)': False, 'contains(breast)': True}

The next thing would be to use the feature extractor to train a classifier to label new article titles, and following NLTKs example, I tried this:

featuresets = [(document_features(d), c) for (d,c) in text]

Which gives me the error:

ValueError: too many values to unpack

Quickly googled this and found that this has something to do with tuples, but did not get how can I solve it (like I said, I'm somewhat noob in this), unless by creating a categorized corpus (I would still like to understand how can I solve this tuple problem).

Therefore, I tried approach 2, following Jacob Perkings Text Processing with NLTK Cookbook:

Started by creating a corpus and attributing categories. This time I had 2 different .txt files, one for each subject of title articles.

reader = CategorizedPlaintextCorpusReader('.', r'.*\,
    cat_map={'hd_titles.txt': ['HD'], 'SR_titles.txt': ['Cancer']})

With "reader.raw()" I get something like this:

u"A pilot investigation of a multidisciplinary quality of life intervention for men with biochemical recurrence of prostate cancer.\nA randomized controlled pilot feasibility study of the physical and psychological effects of an integrated support programme in breast cancer.\n"

The categories for the corpus seem to be right:

reader.categories()
>>>['Cancer', 'HD']

Then, I try to construct a list of documents, labeled with the appropriate categories:

documents = [(list(reader.words(fileid)), category)
          for category in reader.categories()
          for fileid in reader.fileids(category)]

Which returns me something like this:

[([u'A', u'pilot', u'investigation', u'of', u'a', u'multidisciplinary',
u'quality', u'of', u'life', u'intervention', u'for', u'men', u'with', 
u'biochemical', u'recurrence', u'of', u'prostate', u'cancer', u'.'], 
'Cancer'), 
 ([u'Trends', u'in', u'the', u'incidence', u'of', u'dementia', u':', 
u'design', u'and', u'methods', u'in', u'the', u'Alzheimer', u'Cohorts', 
u'Consortium', u'.'], 'HD')]

Next step would be creating a list of labeled feature sets, for which I used the next function, that takes a corpus and a feature_detector function (that would be document_features referred above). It then constructs and returns a mapping of the form {label: [featureset]}.

def label_feats_from_corpus(corp, feature_detector=document_features):
    label_feats = collections.defaultdict(list)
    for label in corp.categories():
        for fileid in corp.fileids(categories=[label]):
            feats = feature_detector(corp.words(fileids=[fileid]))
            label_feats[label].append(feats)
    return label_feats 

lfeats = label_feats_from_corpus(reader)
>>>defaultdict(<type 'list'>, {'HD': [{'contains(series)': True, 
'contains(disorders)': True, 'contains(neurodegeneration)': True, 
'contains(anilinoquinazoline)': True}], 'Cancer': [{'contains(cancer)': 
True, 'contains(of)': True, 'contains(group)': True, 'contains(After)': 
True, 'contains(breast)': True}]})

(the list is a lot bigger and everything is set as True).

Then I want to construct a list of labeled training instances and testing instances.

The split_label_feats() function takes a mapping returned from label_feats_from_corpus() and splits each list of feature sets into labeled training and testing instances.

def split_label_feats(lfeats, split=0.75):
    train_feats = []
    test_feats = []
    for label, feats in lfeats.items():
        cutoff = int(len(feats) * split)
        train_feats.extend([(feat, label) for feat in feats[:cutoff]])
        test_feats.extend([(feat, label) for feat in feats[cutoff:]])
    return train_feats, test_feats

train_feats, test_feats = split_label_feats(lfeats, split=0.75)
len(train_feats)
>>>0
len(test_feats)
>>>2
print(test_feats)
>>>[({'contains(series)': True, 'contains(China)': True, 
'contains(disorders)': True, 'contains(neurodegeneration)': True}, 
'HD'), ({'contains(cancer)': True, 'contains(of)': True, 
'contains(group)': True, 'contains(After)': True, 'contains(breast)': 
True}, 'Cancer')]

I should've ended up with a lot more labeled training instances and labeled testing instances, I guess.

This brings me to where I am now. I searched stackoverflow, biostars, etc and could not find how to deal with both problems, so any help would be deeply appreciated.

TL;DR: Can't label a single .txt file to classify text, and can't get a corpus correctly labeled (again, to classify text).

If you've read this far, thank you as well.


Solution

  • You're getting an error on the following line:

    featuresets = [(document_features(d), c) for (d,c) in text]
    

    Here, you are supposed to convent each document (i.e. each title) to a dictionary of features. But to train with the results, the train() method needs both the feature dictionaries and the correct answer ("label"). So the normal workflow is to have a list of (document, label) pairs, which you transform to (features, label) pairs. It looks like your variable documents has the right structure, so if you just use it instead of text, this should work correctly:

    featuresets = [(document_features(d), c) for (d,c) in documents]
    

    As you go forward, get in the habit of inspecting your data carefully and figuring out what will (and should) happen to them. If text is a list of titles, it makes no sense to unpack each title to a pair (d, c). That should have pointed you in the right direction.