Search code examples
pythonbashcommand-linescriptingtext-processing

How to enhance a script for look ups from multiple CSV files


I need to enhance my below script, which takes an input file that contains almost a million unique lines. Against each line, it has different values in 3 lookup files which I intend to add in my output as comma separated values.

The below script works fine, but it takes hours to finish the job. I am looking for a real fast solution which would also be less heavy on the system.

#!/bin/bash
while read -r ONT
do
{

ONSTATUS=$(grep "$ONT," lookupfile1.csv | cut -d" " -f2)
CID=$(grep "$ONT." lookupfile3.csv | head -1 | cut -d, -f2)
line1=$(grep "$ONT.C2.P1," lookupfile2.csv | head -1 | cut -d"," -f2,7 | sed 's/ //')
line2=$(grep "$ONT.C2.P2," lookupfile2.csv | head -1 | cut -d"," -f2,7 | sed 's/ //')
echo "$ONT,$ONSTATUS,$CID,$line1,$line2" >> BUwithPO.csv
} & 
done < inputfile.csv

inputfile.csv contains the lines shown below:

343OL5:LT1.PN1.ONT1
343OL5:LT1.PN1.ONT10
225OL0:LT1.PN1.ONT34
225OL0:LT1.PN1.ONT39
343OL5:LT1.PN1.ONT100
225OL0:LT1.PN1.ONT57

lookupfile1.csv contains:

343OL5:LT1.PN1.ONT100, Down,Locked,No
225OL0:LT1.PN1.ONT57, Up,Unlocked,Yes
343OL5:LT1.PN1.ONT1, Down,Unlocked,No
225OL0:LT1.PN1.ONT34, Up,Unlocked,Yes
225OL0:LT1.PN1.ONT39, Up,Unlocked,Yes

lookupfile2.csv contains:

225OL0:LT1.PN1.ONT34.C2.P1, +123125302766,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
225OL0:LT1.PN1.ONT57.C2.P1, +123125334019,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
225OL0:LT1.PN1.ONT57.C2.P2, +123125334819,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
343OL5:LT1.PN1.ONT100.C2.P11, +123128994019,REG,DigitMap,Unlocked,_media_ANT,FD_BSFU.xml,

lookupfile3.csv contains:

343OL5:LT1.PON1.ONT100.SERV1,12-654-0330
343OL5:LT1.PON1.ONT100.C1.P1,12-654-0330
343OL5:LT7.PON8.ONT75.SERV1,12-664-1186
225OL0:LT1.PN1.ONT34.C1.P1.FLOW1,12-530-2766
225OL0:LT1.PN1.ONT57.C1.P1.FLOW1,12-533-4019

the output is:

225OL0:LT1.PN1.ONT57, Up,Unlocked,Yes,12-533-4019,+123125334019,FD_BSFU.xml,+123125334819,FD_BSFU.xml
225OL0:LT1.PN1.ONT34, Up,Unlocked,Yes,12-530-2766,+123125302766,FD_BSFU.xml,
343OL5:LT1.PN1.ONT1, Down,Unlocked,No,,,
343OL5:LT1.PN1.ONT100, Down,Locked,No,,,
343OL5:LT1.PN1.ONT10,,,,
225OL0:LT1.PN1.ONT39, Up,Unlocked,Yes,,,

Solution

  • As you'll see, the bottleneck will be executing grep within the loop multiple times. You can increase the efficiency by creating a look-up table with associative arrays.
    If awk is available, please try the following:

    [Update]

    #!/bin/bash
    
    awk '
    FILENAME=="lookupfile1.csv" {
        sub(",$", "", $1);
        onstatus[$1] = $2
    }
    FILENAME=="lookupfile2.csv" {
        split($2, a, ",")
        if (sub("\\.C2\\.P1,$", "", $1)) line1[$1] = a[1]","a[6]
        else if (sub("\\.C2\\.P2,$", "", $1)) line2[$1] = a[1]","a[6]
    }
    FILENAME=="lookupfile3.csv" {
        split($0, a, ",")
        if (match(a[1], ".+\\.ONT[0-9]+")) {
            ont = substr(a[1], RSTART, RLENGTH)
            cid[ont] = a[2]
        }
    }
    FILENAME=="inputfile.csv" {
        print $0","onstatus[$0]","cid[$0]","line1[$0]","line2[$0]
    }
    ' lookupfile1.csv lookupfile2.csv lookupfile3.csv inputfile.csv > BUwithPO.csv
    

    {EDIT]

    If you need to specify absolute paths to the files, please try:

    #!/bin/bash
    
    awk '
    FILENAME ~ /lookupfile1.csv$/ {
        sub(",$", "", $1);
        onstatus[$1] = $2
    }
    FILENAME ~ /lookupfile2.csv$/ {
        split($2, a, ",")
        if (sub("\\.C2\\.P1,$", "", $1)) line1[$1] = a[1]","a[6]
        else if (sub("\\.C2\\.P2,$", "", $1)) line2[$1] = a[1]","a[6]
    }
    FILENAME ~ /lookupfile3.csv$/ {
        split($0, a, ",")
        if (match(a[1], ".+\\.ONT[0-9]+")) {
            ont = substr(a[1], RSTART, RLENGTH)
            cid[ont] = a[2]
        }
    }
    FILENAME ~ /inputfile.csv$/ {
        print $0","onstatus[$0]","cid[$0]","line1[$0]","line2[$0]
    }
    ' /path/to/lookupfile1.csv /path/to/lookupfile2.csv /path/to/lookupfile3.csv /path/to/inputfile.csv > /path/to/BUwithPO.csv
    

    Hope this helps.