Python Gurus,
In the past, I've been using Perl to go through very large text files for data mining. Recently I've decided to switch over since I believe Python makes it easier for me to go through my code and figure out what's going on. The unfortunate (or maybe fortunate?) thing about Python is that it's extremely difficult to store and organize data when compared to Perl since I can't create hashes of hashes via autovivification. I'm also not able to sum the elements of the dictionary of dictionaries.
Maybe there's an elegant solution to my problem.
I have hundreds of files with several hundred rows of data (all can fit in memory). The goal is to combine these two files, but with certain criteria:
For each level (only showing one level below) I need to create a row for each defect class that was found in all the files. Not all files have the same defects.
For each level and defect class sum up all the GEC & BEC values found in all the files.
Final output should look like (updated sample output, typo):
Level defectClass BECtotals GECtotals
1415PA, 0, 643, 1991
1415PA, 1, 1994, 6470
...and so on.....
File one:
Level, defectClass, BEC, GEC
1415PA, 0, 262, 663
1415PA, 1, 1138, 4104
1415PA, 107, 2, 0
1415PA, 14, 3, 4
1415PA, 15, 1, 0
1415PA, 2, 446, 382
1415PA, 21, 5, 0
1415PA, 23, 10, 5
1415PA, 4, 3, 16
1415PA, 6, 52, 105
File two:
level, defectClass, BEC, GEC
1415PA, 0, 381, 1328
1415PA, 1, 856, 2366
1415PA, 107, 7, 11
1415PA, 14, 4, 1
1415PA, 2, 315, 202
1415PA, 23, 4, 7
1415PA, 4, 0, 2
1415PA, 6, 46, 42
1415PA, 7, 1, 7
I'm having the biggest problem being able to do the summations on the dictionaries. This is the code I have so far (not working):
import os
import sys
class AutoVivification(dict):
"""Implementation of perl's autovivification feature. Has features from both dicts and lists,
dynamically generates new subitems as needed, and allows for working (somewhat) as a basic type.
"""
def __getitem__(self, item):
if isinstance(item, slice):
d = AutoVivification()
items = sorted(self.iteritems(), reverse=True)
k,v = items.pop(0)
while 1:
if (item.start < k < item.stop):
d[k] = v
elif k > item.stop:
break
if item.step:
for x in range(item.step):
k,v = items.pop(0)
else:
k,v = items.pop(0)
return d
try:
return dict.__getitem__(self, item)
except KeyError:
value = self[item] = type(self)()
return value
def __add__(self, other):
"""If attempting addition, use our length as the 'value'."""
return len(self) + other
def __radd__(self, other):
"""If the other type does not support addition with us, this addition method will be tried."""
return len(self) + other
def append(self, item):
"""Add the item to the dict, giving it a higher integer key than any currently in use."""
largestKey = sorted(self.keys())[-1]
if isinstance(largestKey, str):
self.__setitem__(0, item)
elif isinstance(largestKey, int):
self.__setitem__(largestKey+1, item)
def count(self, item):
"""Count the number of keys with the specified item."""
return sum([1 for x in self.items() if x == item])
def __eq__(self, other):
"""od.__eq__(y) <==> od==y. Comparison to another AV is order-sensitive
while comparison to a regular mapping is order-insensitive. """
if isinstance(other, AutoVivification):
return len(self)==len(other) and self.items() == other.items()
return dict.__eq__(self, other)
def __ne__(self, other):
"""od.__ne__(y) <==> od!=y"""
return not self == other
for filename in os.listdir('/Users/aleksarias/Desktop/DefectMatchingDatabase/'):
if filename[0] == '.' or filename == 'YieldToDefectDatabaseJan2014Continued.csv':
continue
path = '/Users/aleksarias/Desktop/DefectMatchingDatabase/' + filename
for filename2 in os.listdir(path):
if filename2[0] == '.':
continue
path2 = path + "/" + filename2
techData = AutoVivification()
for file in os.listdir(path2):
if file[0:13] == 'SummaryRearr_':
dataFile = path2 + '/' + file
print('Location of file to read: ', dataFile, '\n')
fh = open(dataFile, 'r')
for lines in fh:
if lines[0:5] == 'level':
continue
lines = lines.strip()
elements = lines.split(',')
if techData[elements[0]][elements[1]]['BEC']:
techData[elements[0]][elements[1]]['BEC'].append(elements[2])
else:
techData[elements[0]][elements[1]]['BEC'] = elements[2]
if techData[elements[0]][elements[1]]['GEC']:
techData[elements[0]][elements[1]]['GEC'].append(elements[3])
else:
techData[elements[0]][elements[1]]['GEC'] = elements[3]
print(elements[0], elements[1], techData[elements[0]][elements[1]]['BEC'], techData[elements[0]][elements[1]]['GEC'])
techSumPath = path + '/Summary_' + filename + '.csv'
fh2 = open(techSumPath, 'w')
for key1 in sorted(techData):
for key2 in sorted(techData[key1]):
BECtotal = sum(map(int, techData[key1][key2]['BEC']))
GECtotal = sum(map(int, techData[key1][key2]['GEC']))
fh2.write('%s,%s,%s,%s\n' % (key1, key2, BECtotal, GECtotal))
print('Created file at:', techSumPath)
input('Go check the file!!!!')
Thanks for taking a look at this!!!!!
Alex
I'm going to suggest a different approach: if you're processing tabular data, you should look at the pandas
library. Your code becomes something like
import pandas as pd
filenames = "fileone.txt", "filetwo.txt" # or whatever
dfs = []
for filename in filenames:
df = pd.read_csv(filename, skipinitialspace=True)
df = df.rename(columns={"level": "Level"})
dfs.append(df)
df_comb = pd.concat(dfs)
df_totals = df_comb.groupby(["Level", "defectClass"], as_index=False).sum()
df_totals.to_csv("combined.csv", index=False)
which produces
dsm@notebook:~/coding/pand$ cat combined.csv
Level,defectClass,BEC,GEC
1415PA,0,643,1991
1415PA,1,1994,6470
1415PA,2,761,584
1415PA,4,3,18
1415PA,6,98,147
1415PA,7,1,7
1415PA,14,7,5
1415PA,15,1,0
1415PA,21,5,0
1415PA,23,14,12
1415PA,107,9,11
Here I've read every file into memory simultaneously and combined them into one big DataFrame
(like an Excel sheet), but we could just as easily have done the groupby
operation file by file so we'd only need to have one file in memory at a time if we liked.