Search code examples
pythonnltkchunkingpart-of-speech

Extracting the Strings from a Chunk


I am using NLTK POS-Tagging to extract information from a text, in this example I am looking for an IBAN. For some texts the code returns more than one chunk, but I don't mind that, I will sort the correct one out later with a RegEx. Now here is my question: is there a prettier way to get the Strings of the chunk so I can work with them or save them?

Of course you could go the artifical way (i.e. iterate through all lines in ibanChunk, then ibanChunk.replace(..) etc.) but there must be a better way, at least that's what I'm hoping.

tagged_sents = list(corp.tagged_sents())
tagger = ClassifierBasedGermanTagger(train=tagged_sents)
tagged_sents = tagger.tag(filtered_sentence)

ibanChunkGram = r"""Chunk: {(<VMPP><CARD>*)|(<FM><CARD>+)}"""
chunkParser = nltk.RegexpParser(ibanChunkGram)
ibanChunk = chunkParser.parse(tagged_sents)

print(ibanChunkGram)

Right now the output of the line looks like this:

(Chunk DE01/FM 2345/CARD 6789/CARD 0000/CARD 0000/CARD 00/CARD)

and what I want to have is:

DE01 2345 6789 0000 0000 00

Edit: Here is a minimalExample:

This is a minimal example of POS-tagging. I want to extract an IBAN (DE01 2345 6789 0000 0000 00) and I hope The Machine 01 can find it quick.

And this is the output of my code:

(S
  This/NE
  is/FM
  a/FM
  minimal/FM
  example/FM
  of/FM
  POS-tagging/FM
  ./$.
  I/FM
  want/FM
  to/FM
  extract/FM
  IBAN/FM
  (/$(
  (Chunk DE01/FM 2345/CARD 6789/CARD 0000/CARD 0000/CARD 00/CARD)
  )/$(
  and/NE
  I/NE
  hope/VAFIN
  The/NE
  Machine/NE
  01/CARD
  can/XY
  find/XY
  it/XY
  quick/XY
  ./$.)

Solution

  • Alright, I have figured it out myself by now. In case anyone is ever stumbling upon that problem too, here is my solution: ibanChunk, as it is called in my case, is a list of tuples, but a Chunk turned out to be a tree, not a tuple, so I used that as my advantage, here is my code:

    for elem in ibanChunk:
        if isinstance(elem, nltk.Tree):
            ibanString = ""
            for (text, tag) in elem:
                ibanString += text
            chunkList.append(ibanString)
    

    And there you have the text of all chunks in a list as Strings.