I have two sample pipe delimited file in my Solaris server home directory as below:
file1.txt:
ticker|sedol|cusip|exchnage
ibm |ibm_1|ib |london
hcl | |hcl_02|london
hp |hp_01|hpm |newyork
|lkp |lk_2 |newyork
file2.txt:
exchnage|ticker|sedol|cusip
london |goo |goo_1|gm
london |hcl | |hcl_02
newyork |hp |hp_01|hpm
newyork |tyl | |ty_2
I need a result file where we will group the unique count of ticker, sedol, cusip by exchange:
expected result file as below:
exchnage|ticker|sedol|cusip
london |3 |2 |3
newyork |3 |2 |2
I know by using SQL it is easy but unfortunately database can't be involved. Each file size might go up to 300-400 MB. We have do it using Perl preferably or if at all difficult then Python. Primary environment is Solaris, but we can try it in Unix server as well. Now added need is that "exchange" Column position can by anywhere in both the file.
not the most elegant, but the fastest I crafted for your new requirements:
import glob
import os
import sys
path = "/tmp"
file_mask = "file*.txt"
results = {}
for file in glob.glob(os.path.join(path, file_mask)):
column_names = {}
exchange_col = None
with open(file, "r") as f:
for line_num, line in enumerate(f.xreadlines()):
# process header
if not line_num:
line_parsed = line.strip().split("|")
for column_num, column in enumerate(line_parsed):
if column.strip() == "exchnage":
exchange_col = column_num
else:
column_names[column_num] = column.strip()
if exchange_col is None:
print "Can't find exchnage field"
sys.exit(1)
continue
line_parsed = line.strip().split("|")
if len(line_parsed) != len(column_names) + 1:
continue
# prepare empty structure for excahnge, if not added yet
if not line_parsed[exchange_col].strip() in results:
results[line_parsed[exchange_col].strip()] = {column_name:set() for column_name in column_names.values()}
# add uniq items to exchange
for column_num, column in enumerate(line_parsed):
column_val = column.strip()
# add only non empty values
if column_val and column_num != exchange_col:
results[line_parsed[exchange_col].strip()][column_names[column_num]].add(column_val)
column_names = column_names.values()
print "exchnage|" + "|".join("%8s" %c for c in column_names)
for exchange, values in results.iteritems():
print "%8s|" % exchange + "|".join("%8s" % str(len(values[column])) for column in column_names)
program output (as input your new files with different columns order were used):
$ python parser.py
exchnage| ticker| sedol| cusip
newyork| 2| 2| 3
london| 3| 2| 3