Search code examples
pythonperlunixsolaris-10

Count unique values in each column of pipe delimited text file using Perl


I have two sample pipe delimited file in my Solaris server home directory as below:

file1.txt:

ticker|sedol|cusip|exchnage  
ibm   |ibm_1|ib   |london    
hcl   |     |hcl_02|london    
hp    |hp_01|hpm  |newyork
      |lkp   |lk_2 |newyork

file2.txt:

exchnage|ticker|sedol|cusip  
london  |goo   |goo_1|gm  
london  |hcl   |     |hcl_02  
newyork |hp    |hp_01|hpm  
newyork |tyl   |     |ty_2

I need a result file where we will group the unique count of ticker, sedol, cusip by exchange:

expected result file as below:

exchnage|ticker|sedol|cusip  
london  |3     |2    |3  
newyork |3     |2    |2

I know by using SQL it is easy but unfortunately database can't be involved. Each file size might go up to 300-400 MB. We have do it using Perl preferably or if at all difficult then Python. Primary environment is Solaris, but we can try it in Unix server as well. Now added need is that "exchange" Column position can by anywhere in both the file.


Solution

  • not the most elegant, but the fastest I crafted for your new requirements:

    import glob
    import os
    import sys
    
    path = "/tmp"
    file_mask = "file*.txt"
    results = {}
    
    for file in glob.glob(os.path.join(path, file_mask)):
        column_names = {}
        exchange_col = None
        with open(file, "r") as f:
            for line_num, line in enumerate(f.xreadlines()):
                # process header
                if not line_num:
                    line_parsed = line.strip().split("|")
                    for column_num, column in enumerate(line_parsed):
                        if column.strip() == "exchnage":
                            exchange_col = column_num
                        else:
                            column_names[column_num] = column.strip()
                    if exchange_col is None:
                        print "Can't find exchnage field"
                        sys.exit(1)
                    continue
                line_parsed = line.strip().split("|")
                if len(line_parsed) != len(column_names) + 1:
                    continue
                # prepare empty structure for excahnge, if not added yet
                if not line_parsed[exchange_col].strip() in results:
                    results[line_parsed[exchange_col].strip()] = {column_name:set() for column_name in column_names.values()}
                # add uniq items to exchange
                for column_num, column in enumerate(line_parsed):
                    column_val = column.strip()
                    # add only non empty values
                    if column_val and column_num != exchange_col:  
                        results[line_parsed[exchange_col].strip()][column_names[column_num]].add(column_val)
    
    column_names = column_names.values()
    print "exchnage|" + "|".join("%8s" %c for c in column_names)
    for exchange, values in results.iteritems():
        print "%8s|" % exchange + "|".join("%8s" % str(len(values[column])) for column in column_names)
    

    program output (as input your new files with different columns order were used):

    $ python parser.py
    exchnage|  ticker|   sedol|   cusip
     newyork|       2|       2|       3
      london|       3|       2|       3