Count unique values in each column of pipe delimited text file using Perl

I have two sample pipe delimited file in my Solaris server home directory as below:

file1.txt:

ticker|sedol|cusip|exchnage  
ibm   |ibm_1|ib   |london    
hcl   |     |hcl_02|london    
hp    |hp_01|hpm  |newyork
      |lkp   |lk_2 |newyork

file2.txt:

exchnage|ticker|sedol|cusip  
london  |goo   |goo_1|gm  
london  |hcl   |     |hcl_02  
newyork |hp    |hp_01|hpm  
newyork |tyl   |     |ty_2

I need a result file where we will group the unique count of ticker, sedol, cusip by exchange:

expected result file as below:

exchnage|ticker|sedol|cusip  
london  |3     |2    |3  
newyork |3     |2    |2

I know by using SQL it is easy but unfortunately database can't be involved. Each file size might go up to 300-400 MB. We have do it using Perl preferably or if at all difficult then Python. Primary environment is Solaris, but we can try it in Unix server as well. Now added need is that "exchange" Column position can by anywhere in both the file.

Solution

not the most elegant, but the fastest I crafted for your new requirements:

import glob
import os
import sys

path = "/tmp"
file_mask = "file*.txt"
results = {}

for file in glob.glob(os.path.join(path, file_mask)):
    column_names = {}
    exchange_col = None
    with open(file, "r") as f:
        for line_num, line in enumerate(f.xreadlines()):
            # process header
            if not line_num:
                line_parsed = line.strip().split("|")
                for column_num, column in enumerate(line_parsed):
                    if column.strip() == "exchnage":
                        exchange_col = column_num
                    else:
                        column_names[column_num] = column.strip()
                if exchange_col is None:
                    print "Can't find exchnage field"
                    sys.exit(1)
                continue
            line_parsed = line.strip().split("|")
            if len(line_parsed) != len(column_names) + 1:
                continue
            # prepare empty structure for excahnge, if not added yet
            if not line_parsed[exchange_col].strip() in results:
                results[line_parsed[exchange_col].strip()] = {column_name:set() for column_name in column_names.values()}
            # add uniq items to exchange
            for column_num, column in enumerate(line_parsed):
                column_val = column.strip()
                # add only non empty values
                if column_val and column_num != exchange_col:  
                    results[line_parsed[exchange_col].strip()][column_names[column_num]].add(column_val)

column_names = column_names.values()
print "exchnage|" + "|".join("%8s" %c for c in column_names)
for exchange, values in results.iteritems():
    print "%8s|" % exchange + "|".join("%8s" % str(len(values[column])) for column in column_names)

program output (as input your new files with different columns order were used):

$ python parser.py
exchnage|  ticker|   sedol|   cusip
 newyork|       2|       2|       3
  london|       3|       2|       3