Search code examples
rnlpopennlppos-tagger

Extracting noun+noun or (adj|noun)+noun from Text


Is it possible to extract noun+noun or (adj|noun)+noun using the R package openNLP? That is, I would like to use linguistic filtering to extract candidate noun phrases. Could you direct me how to do? Many thanks.


Thanks for the responses. here is the code:

library("openNLP")

acq <- "Gulf Applied Technologies Inc said it sold its subsidiaries engaged in
        pipeline and terminal operations for 12.2 mln dlrs. The company said 
        the sale is subject to certain post closing adjustments, 
        which it did not explain. Reuter." 

acqTag <- tagPOS(acq)    
acqTagSplit = strsplit(acqTag," ")
acqTagSplit

qq = 0
tag = 0

for (i in 1:length(acqTagSplit[[1]])){
    qq[i] <-strsplit(acqTagSplit[[1]][i],'/')
    tag[i] = qq[i][[1]][2]
}

index = 0

k = 0

for (i in 1:(length(acqTagSplit[[1]])-1)) {
    
    if ((tag[i] == "NN" && tag[i+1] == "NN") | 
        (tag[i] == "NNS" && tag[i+1] == "NNS") | 
        (tag[i] == "NNS" && tag[i+1] == "NN") | 
        (tag[i] == "NN" && tag[i+1] == "NNS") | 
        (tag[i] == "JJ" && tag[i+1] == "NN") | 
        (tag[i] == "JJ" && tag[i+1] == "NNS"))
    {      
            k = k +1
            index[k] = i
    }

}

index

Reader can refer index on acqTagSplit to do noun+noun or (adj|noun)+noun extraction. (The code is not optimal, but it works. If you have any idea, please let me know.)

I have an additional problem:

Justeson and Katz (1995) proposed another linguistic filtering to extract candidate noun phrases:

((Adj|Noun)+|((Adj|Noun)*(Noun-Prep)?)(Adj|Noun)*)Noun

I cannot understand its meaning well. Could you do me a favor and explain it? Or show how to code the filtering rule in the R language? Many thanks.


Solution

  • It is possible.

    EDIT:

    You got it. Use the POS tagger and split on spaces: ll <- strsplit(acqTag,' '). From there iterate on the length of the input list (length of ll) like: for (i in 1:37){qq <-strsplit(ll[[1]][i],'/')} and get the part of speech sequence you're looking for.

    After splitting on spaces it is just list processing in R.