Search code examples
pythoncsvcontrol-flowurlparse

Python CSV row value based flow control


I am working with a CSV that has the following structure:

"2012-09-01 20:03:15","http://example.com"

The data is a cleaned up dump of my browsing history. I am interested in counting the first five unique domains per a given day. Here is what I have so far:

from urlparse import urlparse
import csv
from collections import Counter

domains = Counter()

with open("history.csv") as f:
    for row in csv.reader(f):
        d = row[0]
        dt = d[11:19]
        dt = dt.replace(":","")
        dd = d[0:10]
        if (dt < "090000") and (dt > "060000"):
            url = row[1]
            p = urlparse(url)
            ph = p.hostname
            print dd + "," + dt + "," + ph
            domains += Counter([ph])
t = str(domains.most_common(20))

With d, dt, and dd, I am separating the date and time. With the above example row, dt = 20:03:15, and dd = 2012-09-01. The "if (dt < "090000") and (dt > "060000")" is just to say that I am only interested in counting websites visited between 6am and 9am. How would I say "count only the first five websites that were visited before 6am, each day"? There are hundreds of rows for any given day, and the rows are in chronological order.


Solution

  • import csv
    from collections import defaultdict, Counter
    from datetime import datetime
    from urlparse import urlsplit
    
    indiv = Counter()
    
    domains = defaultdict(lambda: defaultdict(int))
    with open("history.csv", "rb") as f:
        for timestr, url in csv.reader(f):
            dt = datetime.strptime(timestr, "%Y-%m-%d %H:%M:%S")
            if 6 <= dt.hour < 11: # between 6am and 11am
                today_domains = domains[dt.date()]
                domain = urlsplit(url).hostname
                if len(today_domains) < 5 and domain not in today_domains:
                    today_domains[domain] += 1
                    indiv += Counter([domain])
    for domain in indiv:
        print '%s,%d' % (domain, indiv[domain])