Search code examples
pythonbeautifulsouptext-extraction

"list index out of range" error within nested loop structure but not outside of it


this is my first question here ever.

I'm trying to extract only the word forms from a text corpus and write them into a text file.

the corpus looks like this:

<corpus>
<text id="t0">
<s>
Computerlinguistik  NN  NOUN    Computerlinguistik
</s>
<s>
In  APPR    ADP In
der ART DET der
Computerlinguistik  NN  NOUN    Computerlinguistik
    _SP SPACE     
oder    KON CCONJ   oder
linguistischen  ADJA    ADJ linguistischen
Datenverarbeitung   NN  NOUN    Datenverarbeitung
...
</s>
...

So

  1. A sentence is marked with <s>...</s>
  2. The words of a sentence are split up into newlines
  3. Each line has the word form (and some tab separated annotations, e.g. part of speech tag)

My approach

To get to the word form my approach is:

  1. Making a list with all the sentences without xml tags
  2. Split each sentence of that list at '\n'
  3. Split each line at any whit space character
  4. Write the first element of that "line list" into a .txt file

The issue

However, I'm getting an list index out of range error when trying to access the first element within the loop:

# getting the xml-like content:
soupWiki = BeautifulSoup(open('MeinWikiKorpus.vrt'))

# getting a list of all sentences (< s >...< /s >) without xml tags:
wikiSentences = [sentence.get_text() for sentence in soupWiki.find_all('s')]

for s in wikiSentences:
    # splitting each sentence by '\n'
    for line in (s.splitlines()):
        # splitting each line into it's elements (word form, POS-Tag, ...)
        lElements = line.split()
        print(lElements[0])

However, when I try to access the first element out side of all the loops, it works.

I'm sure it's just a silly mistake and by writing this question I might have figured it out already but some how I'm stuck here.

Thanks in advance!


Solution

  • You're executing:

            lElements = line.split()
    

    There's a few things going on here.

    1. Some lines are blank, so .split() finds zero elements.
    2. We repeatedly assign to lElements -- it will retain final value after the loop is done.
    3. Final line is non-blank.

    Before de-referencing the 0th element you will want to check with a guard.

    Verbosely: if len(lElements) > 0:

    Concisely:

            if lElements:
                print(lElements[0])