Search code examples
bashcsvgit-bash

Adding data if elements from 2 different rows in 2 different CSV files matches using Bash scripting


I am very new to Bash scripting and still at the learning stage. Currently, I am working on CSV files and there has been a requirement that if 2 rows are matched in 2 different CSV files, then the data column should be added.

For example,

csv_file1.csv:

Account,Entity,Data
A1101,E101,1000

csv_file2.csv:

Account,Entity,Data
A1101,E101,2000
A1101,E102,2000

On executing bash script, the expected output should be a new CSV file with rows:

Account,Entity,Data
A1101,E101,3000
A1101,E102,2000

Since, elements in 1st row of "csv_file1" and elements in 1st row of "csv_file2" are matching, data in the "Data" column should append.

The below is the code which I have tried to print the contents of both CSV files. But, I am not able to move forward from here.

#!/bin/bash
#assigns the CSV file to a variable
FILE1="csv_file1.csv"

#reads each line and splits it into three variables

while IFS=, read -r Account Entity Data; do

echo "$Account#""$Entity#""$Data"

done < <(tail -n +2 $FILE1)

#assigns the CSV file to a variable
FILE2="csv_file2.csv"

#reads each line and splits it into three variables

while IFS=, read -r Account Entity Data; do

echo "$Account#""$Entity#""$Data"

done < <(tail -n +2 $FILE2)

It would be great if you guys can help me with your ideas. please let me know if any additional information is needed. Thanks a lot in Advance!


Solution

  • Your bash script lacks the comparison between the two inputs and the sum when rows match.

    If you absolutely want to do this with just bash, if your CSV files are as simple as what you show (no quoted fields, no newlines in fields...), if you want to also print lines that are found in only one of the two input files, and if you don't care about the final order, then you can try:

    #!/bin/bash
    
    declare -iA a=()
    declare -i Data
    declare Account Entity
    
    FILE1="csv_file1.csv"
    FILE2="csv_file2.csv"
    
    head -n1 "$FILE1"
    
    while IFS=, read -r Account Entity Data; do
      a["$Account,$Entity"]+=Data
    done < <(tail -n +2 "$FILE1"; tail -n +2 "$FILE2")
    
    for k in "${!a[@]}"; do
      printf '%s,%d\n' "$k" ${a[$k]}
    done
    

    For performance reasons awk would be a better choice than bash. With the same assumtions as for the bash version, you can try the following with any awk:

    awk '
    BEGIN { FS = OFS = "," }
    NR == 1 { print; next }
    FNR == 1 { next }
    { a[$1 OFS $2] += $3 }
    END { for(k in a) print k, a[k] }
    ' csv_file1.csv csv_file2.csv
    

    We declare the comma as input (FS) and output (OFS) field separators. We print the header line of the first file and skip the header line of the second file.

    While parsing the files we store the content in array a with 2 first fields as key and last field as value (e.g., a[A1101,E101] = 1000), adding the third field to the value if the key exists already.

    At the end (END) we print the content of array a.

    If your CSV format is more complex, with double-quoted fields, possibly containing escaped double-quotes ("") or even newline characters, you could try the --csv option of recent versions of GNU awk:

    awk --csv '
    function quote(s) {
      if(s ~ /[",\n]/) {
        gsub(/"/, "\"\"", s)
        s = "\"" s "\""
      }
      return s
    }
    
    BEGIN { OFS = "," }
    NR == 1 { print; next }
    FNR == 1 { next }
    { a[$1][$2] += $3 }
    END { for(k1 in a) for(k2 in a[k1]) print quote(k1), quote(k2), a[k1][k2] }
    ' csv_file1.csv csv_file2.csv
    

    Demo:

    $ cat csv_file1.csv 
    Account,Entity,Data
    A1101,E101,1000
    A1101,"E101,E102",3000
    A1101,"hello, ""world""",1000
    
    $ cat csv_file2.csv 
    Account,Entity,Data
    A1101,E101,2000
    A1101,E102,2000
    A1101,"E101,E102",2000
    A1101,"hello, ""world""",4000
    
    $ awk --csv -f foo2.awk csv_file*
    Account,Entity,Data
    A1101,"hello, ""world""",5000
    A1101,E101,3000
    A1101,E102,2000
    A1101,"E101,E102",5000
    

    But then, a real CSV parser would probably be safer.