Search code examples
pythonrparsingnltkreticulate

R - Parsing Python NLTK Trees via Reticulate


I am trying to make use of Python's NLTK package from within R using the Reticulate package. For the most part, I have been successful.

Now, I would like to perform named entity recognition (i.e. to determine which tokens represent named entities and what type of named entity they represent.) using NLTK's ne_chunk() function. My problem is that the function returns an object of the class nltk.tree.Tree, which I cannot figure out how to parse in R.

If ne_chunk() is fed up to ten token-tag pairs, it will return a result which can be converted into a character using as.character(), which can be parsed via regular expression functions (this is just a hack and I am not satisfied with it). Over ten pairs, however, and it will return a shorthand representation of the tree, from which no meaningful data can be extracted using R methods.

Here is a minimally-reproducible example:

library(reticulate)
nltk <- import("nltk")

sent_tokenize <- function(text, language = "english") {
  nltk$tokenize$sent_tokenize(text, language)
}
word_tokenize <- function(text, language = "english", preserve_line = FALSE) {
  nltk$tokenize$word_tokenize(text, language, preserve_line)
}
pos_tag <- function(tokens, tagset = NULL, language = "eng") {
  nltk$pos_tag(tokens, tagset, language)
}
ne_chunk <- function(tagged_tokens, binary = FALSE) {
  nltk$ne_chunk(tagged_tokens, binary)
}

text <- "Christopher is having a difficult time parsing NLTK Trees in R."
tokens <- word_tokenize(text)
tagged_tokens <- pos_tag(tokens)
ne_tagged_tokens <- ne_chunk(tagged_tokens)

Here is the shorthand that is returned when the text from the previous example is processed:

> ne_tagged_tokens
List (11 items)

Here are the classes to which ne_tagged_tokens belongs:

> class(ne_tagged_tokens)
[1] "nltk.tree.Tree"        "python.builtin.list"   "python.builtin.object"

I am not interested in suggestions to use alternative, pre-existing R packages.


Solution

  • I guess the problem lies in reticulate not being able to read customized Python objects, which is common, so you have to pass Python objects as close as native Python types between R and Python interfaces.

    There's a way to change the output format of ne_chunks to string (bracketed parse format), using Tree.pformat():

    >>> from nltk import word_tokenize, pos_tag, ne_chunk
    >>> sent = "Christopher is having a difficult time parsing NLTK Trees in R."
    >>> ne_chunk(pos_tag(word_tokenize(sent)))
    Tree('S', [Tree('GPE', [('Christopher', 'NNP')]), ('is', 'VBZ'), ('having', 'VBG'), ('a', 'DT'), ('difficult', 'JJ'), ('time', 'NN'), ('parsing', 'VBG'), Tree('ORGANIZATION', [('NLTK', 'NNP'), ('Trees', 'NNP')]), ('in', 'IN'), Tree('GPE', [('R', 'NNP')]), ('.', '.')])
    >>> ne_chunk(pos_tag(word_tokenize(sent))).pformat()
    '(S\n  (GPE Christopher/NNP)\n  is/VBZ\n  having/VBG\n  a/DT\n  difficult/JJ\n  time/NN\n  parsing/VBG\n  (ORGANIZATION NLTK/NNP Trees/NNP)\n  in/IN\n  (GPE R/NNP)\n  ./.)'
    

    And to read it back in, use Tree.fromstring():

    >>> tree_str = ne_chunk(pos_tag(word_tokenize(sent))).pformat()
    >>> from nltk import Tree
    >>> Tree.fromstring(tree_str)
    Tree('S', [Tree('GPE', ['Christopher/NNP']), 'is/VBZ', 'having/VBG', 'a/DT', 'difficult/JJ', 'time/NN', 'parsing/VBG', Tree('ORGANIZATION', ['NLTK/NNP', 'Trees/NNP']), 'in/IN', Tree('GPE', ['R/NNP']), './.'])
    

    So I would guess doing this in R might work:

    text <- "Christopher is having a difficult time parsing NLTK Trees in R."
    ne_tagged_tokens <- ne_chunk(pos_tag(word_tokenize(tagged_tokens)))$pformat()
    print(ne_tagged_tokens)
    

    But reading the strings back into R objects shouldn't be possible since it couldn't handle non-native Python Tree objects so the some_func <- function(...{nltk$some_func(...)}) won't work with Tree since it's not a function.


    If you want to manipulate the output of ne_chunk Tree objects into a list of named entities, then you would have to do something like this in Python: NLTK Named Entity recognition to a Python list

    Then again, if you are requiring so many functions in Python that you don't really want to recode or use other R libraries, why aren't you writing in Python instead of sticking to R.