python file text-classification filewriter

How to solve fix 'list index out of range' while accessing large amount of data from file?

I am working on a classifier that will access 200000 data items from a dataset but it only accesses about 1400 data correctly and shows list index out of range.

How can I access all of the items from the dataset?

Here the structure of the dataset.

investing: can you profit in agricultural commodities?
bad weather is one factor behind soaring food prices. can you make hay with farm stocks? possibly: but be prepared to harvest gains on a moment's ...
http://rssfeeds.usatoday.com/~r/usatodaycommoney-topstories/~3/qbhb22sut9y/2011-05-19-can-you-make-gains-in-grains_n.htm
0
20 May 2011 15:13:57
ut
business

no tsunami but fifa's corruption storm rages on
though jack warner's threatened soccer "tsunami" remains stuck in the doldrums, the corruption storm raging around fifa shows no sign of abating after another extraordinary week for the game's governing body.
http://feeds.reuters.com/~r/reuters/sportsnews/~3/ffa6ftdsudg/us-soccer-fifa-idustre7563p620110607
1
07 Jun 2011 17:54:54
reuters
sport

critic's corner weekend: 'fringe' wraps third season
joshua jackson's show goes out with a bang. plus: amazing race nears the finish line.
http://rssfeeds.usatoday.com/~r/usatoday-lifetopstories/~3/duk9oew5auc/2011-05-05-critics-corner_n.htm
2
06 May 2011 23:36:21
ut
entertainment

Here is the code:

with open('news', 'r') as f:
    text = f.read()
    news = text.split("\n\n")
    count = {'sport': 0, 'world': 0, "us": 0, "business": 0, "health": 0, "entertainment": 0, "sci_tech": 0}
    for news_item in news:
        lines = news_item.split("\n")
        print(lines[6])
        file_to_write = open('data/' + lines[6] + '/' + str(count[lines[6]]) + '.txt', 'w+')
        count[lines[6]] = count[lines[6]] + 1
        file_to_write.write(news_item)  # python will convert \n to os.linesep
        file_to_write.close()

it shows the following output.


IndexError                                Traceback (most recent call last)
<ipython-input-1-d04a79ce68f6> in <module>
      5     for news_item in news:
      6         lines = news_item.split("\n")
----> 7         print(lines[6])
      8         file_to_write = open('data/' + lines[6] + '/' + str(count[lines[6]]) + '.txt', 'w+')
      9         count[lines[6]] = count[lines[6]] + 1

IndexError: list index out of range

Solution

You are assuming that you always have 7 or more lines in each block. Perhaps your file ends in \n\n, or you have some blocks that are corrupted.

Simply test for the length and skip the block:

for news_item in news:
    lines = news_item.split("\n")
    if len(lines) < 7:
        continue

Note that you really don't need to read the whole file into memory here, you can also loop over the file object and read additional lines from a file object. Personally, I'd create a separate generator object that picks out specific lines from the file:

def block_line_at_n(fobj, n):
    while True:
        for i, line in enumerate(fobj):
            if line == "\n":
                # end of block, start a new block
                break
            if i == n:
                yield line
        else:
            # end of the file, exit
            return

with open('news', 'r') as f:
    for line in block_line_at_n(f, 6):