I have a bunch of files in a directory 698 to be exact. Each file contains a date and a unique ID, as well as a name. Like so:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import csv
import os
import re
20151231_7801_Test_Maps.txt
20151231_7801_Test_Items.txt
20151231_7802_Test_Maps.txt
20151231_7802_Test_Items.txt
I'm looking to group them by date and identifier, open each of the files (Maps, and Items), and do a diff analysis on some ids in the file. How would I do this?
So far I have this as my code, but I don't know how to iterate through and open each file per group:
groups = defaultdict(list)
for filename in os.listdir('F:\Desktop'):
date = filename[:8]
identifier = filename[10:14]
basename, extension = os.path.splitext(filename)
groups[date, identifier].append(filename)
My output prints some of the groups correctly but not all, for example:
('20151231','7801')['20151231_7801_Test_Maps.txt, 20151231_7801_Test_Items.txt]
Some groups only print one file even though there are two files for that date and identifier.
This is not my main concern but once they are broken up in groups I'd like to assign each file in the group to a dataframe like so :
for key in groups:
maps = pd.read_csv(file1, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
items = pd.read_csv(file2, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
#checks IDs between the two files and looks for differences
set(maps.ID).difference(items.ID)
Can someone please help with grouping the files by date and ID, and iterating opening the files by group? Thanks!
Taking from Shijo's answer, I've found a pretty good way to do this.
groups = defaultdict(list)
output = []
for filename in os.listdir(pathloc):
date = filename[:8]
identifier = filename[14:18]
basename, extension = os.path.splitext(filename)
groups[date, identifier].append(filename)
for key, fnames in groups.iteritems():
filedicts = {}
print list(fnames)
maps = pd.read_csv(pathloc+fnames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
items = pd.read_csv(pathloc+fnames[0], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
diffs = set(maps.ID).symmetric_difference(items.ID)
filedicts['FileIDKey'] = list(key)
filedicts['Missing_IDs'] = list(diffs)
filedicts['FileNames'] = fnames
output.append(filedicts)
This allows me to then go and add this master dictionary list to a dataframe:
new = pd.DataFrame(output)