Search code examples
pythonloopsdictionaryiteratordefaultdict

How to iterate through a defaultdict(list) in Python?


How do i iterate through a defaultdict(list) in Python? Is there a better way of having a dictionary of lists in Python? I've tried the normal iter(dict) but I've got the error:

>>> import para
>>> para.print_doc('./sentseg_en/essentials.txt')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "para.py", line 31, in print_doc
    for para in iter(doc):
TypeError: iteration over non-sequence

The main class:

import para
para.print_doc('./foo/bar/para-lines.txt')

The para.pyc:

# -*- coding: utf-8 -*-
## Modified paragraph into a defaultdict(list) structure
## Original code from http://code.activestate.com/recipes/66063/
from collections import defaultdict
class Paragraphs:
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    # Separator here refers to the paragraph seperator,
    #  the default separator is '\n'.
    def __init__(self, filename, separator=None):
        # Set separator if passed into object's parameter,
        #  else set default separator as '\n'
        if separator is None:
            def separator(line): return line == '\n'
        elif not callable(separator):
            raise TypeError, "separator argument must be callable"
        self.separator = separator
        # Reading lines from files into a dictionary of lists
        self.doc = defaultdict(list)
        paraIndex = 0
        with open(filename) as readFile:
            for line in readFile:
                if line == separator:
                    paraIndex+=1
                else:
                    self.doc[paraIndex].append(line)

# Prints out populated doc from txtfile
def print_doc(filename):
    text = Paragraphs(filename)
    for para in iter(text.doc):
        for sent in text.doc[para]:
            print "Para#%d, Sent#%d: %s" % (
                para, text.doc[para].index(sent), sent)

An e.g. of ./foo/bar/para-lines.txt looks like this:

This is a start of a paragraph.
foo barr
bar foo
foo foo
This is the end.

This is the start of next para.
foo boo bar bar
this is the end.

The output of the main class should look like this:

Para#1,Sent#1: This is a start of a paragraph.
Para#1,Sent#2: foo barr
Para#1,Sent#3: bar foo
Para#1,Sent#4: foo foo
Para#1,Sent#5: This is the end.

Para#2,Sent#1: This is the start of next para.
Para#2,Sent#2: foo boo bar bar
Para#2,Sent#3: this is the end.

Solution

  • The recipe you linked to is rather old. It was written in 2001 before Python had more modern tools like itertools.groupby (introduced in Python2.4, released in late 2003). Here is what your code could look like using groupby:

    import itertools
    import sys
    
    with open('para-lines.txt', 'r') as f:
        paranum = 0
        for is_separator, paragraph in itertools.groupby(f, lambda line: line == '\n'):
            if is_separator:
                # we've reached paragraph separator
                print
            else:
                paranum += 1
                for n, sentence in enumerate(paragraph, start = 1):
                    sys.stdout.write(
                        'Para#{i:d},Sent#{n:d}: {s}'.format(
                            i = paranum, n = n, s = sentence))