Python CSV row value based flow control

I am working with a CSV that has the following structure:

"2012-09-01 20:03:15",""

The data is a cleaned up dump of my browsing history. I am interested in counting the first five unique domains per a given day. Here is what I have so far:

from urlparse import urlparse
import csv
from collections import Counter

domains = Counter()

with open("history.csv") as f:
    for row in csv.reader(f):
        d = row[0]
        dt = d[11:19]
        dt = dt.replace(":","")
        dd = d[0:10]
        if (dt < "090000") and (dt > "060000"):
            url = row[1]
            p = urlparse(url)
            ph = p.hostname
            print dd + "," + dt + "," + ph
            domains += Counter([ph])
t = str(domains.most_common(20))

With d, dt, and dd, I am separating the date and time. With the above example row, dt = 20:03:15, and dd = 2012-09-01. The "if (dt < "090000") and (dt > "060000")" is just to say that I am only interested in counting websites visited between 6am and 9am. How would I say "count only the first five websites that were visited before 6am, each day"? There are hundreds of rows for any given day, and the rows are in chronological order.


  • import csv
    from collections import defaultdict, Counter
    from datetime import datetime
    from urlparse import urlsplit
    indiv = Counter()
    domains = defaultdict(lambda: defaultdict(int))
    with open("history.csv", "rb") as f:
        for timestr, url in csv.reader(f):
            dt = datetime.strptime(timestr, "%Y-%m-%d %H:%M:%S")
            if 6 <= dt.hour < 11: # between 6am and 11am
                today_domains = domains[]
                domain = urlsplit(url).hostname
                if len(today_domains) < 5 and domain not in today_domains:
                    today_domains[domain] += 1
                    indiv += Counter([domain])
    for domain in indiv:
        print '%s,%d' % (domain, indiv[domain])