this is my first question here ever.
I'm trying to extract only the word forms from a text corpus and write them into a text file.
the corpus looks like this:
<corpus>
<text id="t0">
<s>
Computerlinguistik NN NOUN Computerlinguistik
</s>
<s>
In APPR ADP In
der ART DET der
Computerlinguistik NN NOUN Computerlinguistik
_SP SPACE
oder KON CCONJ oder
linguistischen ADJA ADJ linguistischen
Datenverarbeitung NN NOUN Datenverarbeitung
...
</s>
...
So
To get to the word form my approach is:
- Making a list with all the sentences without xml tags
- Split each sentence of that list at '\n'
- Split each line at any whit space character
- Write the first element of that "line list" into a .txt file
However, I'm getting an list index out of range
error when trying to access the first element within the loop:
# getting the xml-like content:
soupWiki = BeautifulSoup(open('MeinWikiKorpus.vrt'))
# getting a list of all sentences (< s >...< /s >) without xml tags:
wikiSentences = [sentence.get_text() for sentence in soupWiki.find_all('s')]
for s in wikiSentences:
# splitting each sentence by '\n'
for line in (s.splitlines()):
# splitting each line into it's elements (word form, POS-Tag, ...)
lElements = line.split()
print(lElements[0])
However, when I try to access the first element out side of all the loops, it works.
I'm sure it's just a silly mistake and by writing this question I might have figured it out already but some how I'm stuck here.
Thanks in advance!
You're executing:
lElements = line.split()
There's a few things going on here.
.split()
finds zero elements.Before de-referencing the 0th element you will want to check with a guard.
Verbosely: if len(lElements) > 0:
Concisely:
if lElements:
print(lElements[0])