I have a data file and I perform few operations on the data. I can get solutions for all other operations just fine. I am not able to calculate the median only.
Input: Few lines from huge input.
00904bcabb02 00904bf7d758 676.0
0030657cc312 00904b1f1154 120.0
00306597852d 00904b48a3b6 572.0
00904b1f1154 00904bcabb02 120.0
00904b1f1154 00904bf7d758 120.0
00904b48a3b6 00904ba7a3eb 572.0
00022d1aa531 0006254f5810 2.0
00022dac729c 0006254f5810 2.0
00022dbd5c9e 0006254f5810 2.0
0006254f5810 0050dad80267 2.0
0006254f5810 00904be2b271 2.0
00022d097904 004096f41eb8 20.0
00022d2d30dd 004096f41eb8 20.0
004096f41eb8 00904b1e7852 20.0
00022d1406df 00022d36a6df 8.0
00022d36a6df 00022d8cb682 8.0
00022d36a6df 0030654a05fa 8.0
0004230dd7de 000423cbac29 33.0
0004231e4f43 000423cbac29 33.0
0030659b49f1 00904b310619 29.0
For every pair of col[0] col[1]
I find the freq and the corresponding value's Average and Sum. I am trying to find the Median in set of pairtime
. I am using numpy.median
but that does not seem to be working. Any suggestion appreciated. Thanks
Code:
from collections import defaultdict
import numpy as np
paircount = defaultdict(int)
pairtime = defaultdict(float)
pairper = defaultdict(float)
timeavg = defaultdict(float)
timefreq = defaultdict(int)
#get number of pair occurrences and total time
with open('Input.txt', 'r') as f, open('Output.txt', 'w') as o:
for numline, line in enumerate((line.split() for line in f), start=1):
pair = line[0], line[1]
paircount[pair] += 1
pairtime[pair] += float(line[2])
#timeavg = pairtime[pair]/paircount[pair]
#pairper = dict((pair, c * 100.0 / numline) for (pair, c) in paircount.iteritems())
for pair, freq in paircount.iteritems():
timeavg = pairtime[pair] / freq
med = np.median(np.pairtime[pair])
#print pair[0], pair[1], c, pairper[pair], pairtime[pair]
o.write("%s %s %s %.2f %.2f %s \n" % (pair[0], pair[1], freq, pairtime[pair], timeavg, med))
print 'done'
Error:
Traceback (most recent call last):
File "pair_one.py", line 20, in <module>
med = np.median(np.pairtime[pair])
AttributeError: 'module' object has no attribute 'pairtime'
Replace:
med = np.median(np.pairtime[pair])
with:
med = np.median(pairtime[pair])
pairtime
is a local variable, and not a numpy
attribute.
EDIT
As @Fred S has pointed out, pairtime[pair]
contains only the sum of the times, and not the complete series. I didn't notice it before. Since you will calculate many statistics from the time series, I believe a better approach would be to keep the whole time series instead of just the sum as @Fred S did in his answer. Then you can calculate all your statistics on the time series.
Here is a shot at a possible solution:
from collections import defaultdict
import numpy as np
pairtimelist = defaultdict(list)
with open('Input.txt', 'r') as f, open('Output.txt', 'w') as o:
for numline, line in enumerate((line.split() for line in f), start=1):
pair = line[0], line[1]
pairtimelist[pair].append(float(line[2]))
for pair in pairtimelist.iterkeys():
timeavg = np.mean(pairtimelist[pair])
timemed = np.median(pairtimelist[pair])
timesum = np.sum(pairtimelist[pair])
freq = len(pairtimelist[pair])
o.write("%s %s %s %.2f %.2f %s \n" % (pair[0], pair[1], freq, timesum, timeavg, timemed))