I am very new to Bash scripting and still at the learning stage. Currently, I am working on CSV files and there has been a requirement that if 2 rows are matched in 2 different CSV files, then the data column should be added.
For example,
csv_file1.csv:
Account,Entity,Data
A1101,E101,1000
csv_file2.csv:
Account,Entity,Data
A1101,E101,2000
A1101,E102,2000
On executing bash script, the expected output should be a new CSV file with rows:
Account,Entity,Data
A1101,E101,3000
A1101,E102,2000
Since, elements in 1st row of "csv_file1" and elements in 1st row of "csv_file2" are matching, data in the "Data" column should append.
The below is the code which I have tried to print the contents of both CSV files. But, I am not able to move forward from here.
#!/bin/bash
#assigns the CSV file to a variable
FILE1="csv_file1.csv"
#reads each line and splits it into three variables
while IFS=, read -r Account Entity Data; do
echo "$Account#""$Entity#""$Data"
done < <(tail -n +2 $FILE1)
#assigns the CSV file to a variable
FILE2="csv_file2.csv"
#reads each line and splits it into three variables
while IFS=, read -r Account Entity Data; do
echo "$Account#""$Entity#""$Data"
done < <(tail -n +2 $FILE2)
It would be great if you guys can help me with your ideas. please let me know if any additional information is needed. Thanks a lot in Advance!
Your bash
script lacks the comparison between the two inputs and the sum when rows match.
If you absolutely want to do this with just bash
, if your CSV files are as simple as what you show (no quoted fields, no newlines in fields...), if you want to also print lines that are found in only one of the two input files, and if you don't care about the final order, then you can try:
#!/bin/bash
declare -iA a=()
declare -i Data
declare Account Entity
FILE1="csv_file1.csv"
FILE2="csv_file2.csv"
head -n1 "$FILE1"
while IFS=, read -r Account Entity Data; do
a["$Account,$Entity"]+=Data
done < <(tail -n +2 "$FILE1"; tail -n +2 "$FILE2")
for k in "${!a[@]}"; do
printf '%s,%d\n' "$k" ${a[$k]}
done
For performance reasons awk
would be a better choice than bash
. With the same assumtions as for the bash
version, you can try the following with any awk
:
awk '
BEGIN { FS = OFS = "," }
NR == 1 { print; next }
FNR == 1 { next }
{ a[$1 OFS $2] += $3 }
END { for(k in a) print k, a[k] }
' csv_file1.csv csv_file2.csv
We declare the comma as input (FS
) and output (OFS
) field separators. We print the header line of the first file and skip the header line of the second file.
While parsing the files we store the content in array a
with 2 first fields as key and last field as value (e.g., a[A1101,E101] = 1000
), adding the third field to the value if the key exists already.
At the end (END
) we print the content of array a
.
If your CSV format is more complex, with double-quoted fields, possibly containing escaped double-quotes (""
) or even newline characters, you could try the --csv
option of recent versions of GNU awk
:
awk --csv '
function quote(s) {
if(s ~ /[",\n]/) {
gsub(/"/, "\"\"", s)
s = "\"" s "\""
}
return s
}
BEGIN { OFS = "," }
NR == 1 { print; next }
FNR == 1 { next }
{ a[$1][$2] += $3 }
END { for(k1 in a) for(k2 in a[k1]) print quote(k1), quote(k2), a[k1][k2] }
' csv_file1.csv csv_file2.csv
Demo:
$ cat csv_file1.csv
Account,Entity,Data
A1101,E101,1000
A1101,"E101,E102",3000
A1101,"hello, ""world""",1000
$ cat csv_file2.csv
Account,Entity,Data
A1101,E101,2000
A1101,E102,2000
A1101,"E101,E102",2000
A1101,"hello, ""world""",4000
$ awk --csv -f foo2.awk csv_file*
Account,Entity,Data
A1101,"hello, ""world""",5000
A1101,E101,3000
A1101,E102,2000
A1101,"E101,E102",5000
But then, a real CSV parser would probably be safer.