I am working with a CSV that has the following structure:
"2012-09-01 20:03:15","http://example.com"
The data is a cleaned up dump of my browsing history. I am interested in counting the first five unique domains per a given day. Here is what I have so far:
from urlparse import urlparse
import csv
from collections import Counter
domains = Counter()
with open("history.csv") as f:
for row in csv.reader(f):
d = row[0]
dt = d[11:19]
dt = dt.replace(":","")
dd = d[0:10]
if (dt < "090000") and (dt > "060000"):
url = row[1]
p = urlparse(url)
ph = p.hostname
print dd + "," + dt + "," + ph
domains += Counter([ph])
t = str(domains.most_common(20))
With d, dt, and dd, I am separating the date and time. With the above example row, dt = 20:03:15, and dd = 2012-09-01. The "if (dt < "090000") and (dt > "060000")" is just to say that I am only interested in counting websites visited between 6am and 9am. How would I say "count only the first five websites that were visited before 6am, each day"? There are hundreds of rows for any given day, and the rows are in chronological order.
import csv
from collections import defaultdict, Counter
from datetime import datetime
from urlparse import urlsplit
indiv = Counter()
domains = defaultdict(lambda: defaultdict(int))
with open("history.csv", "rb") as f:
for timestr, url in csv.reader(f):
dt = datetime.strptime(timestr, "%Y-%m-%d %H:%M:%S")
if 6 <= dt.hour < 11: # between 6am and 11am
today_domains = domains[dt.date()]
domain = urlsplit(url).hostname
if len(today_domains) < 5 and domain not in today_domains:
today_domains[domain] += 1
indiv += Counter([domain])
for domain in indiv:
print '%s,%d' % (domain, indiv[domain])