My task is the following - I have two CSV files:
File 1 (~9.000.000 records):
type(text),number,status(text),serial(number),data1,data2,data3
File 2 (~6000 records): serial_range_start(number),serial_range_end(number),info1(text),info2(text)
The goal is to add to each entry in File 1 the corresponding info1 and info2 from File 2: type(text),number,status(text),serial(number),data1,data2,data3,info1(text),info2(text)
I use the following script:
#!/bin/bash
USER="file1.csv"
RANGE="file2.csv"
for SN in `cat $USER | awk -F , '{print $4}'`
do
#echo \n "$SN"
for LINE in `cat $RANGE`
do
i=`grep $LINE $RANGE| awk -F, '{print $1}'`
#echo \n "i= " "$i"
j=`grep $LINE $RANGE| awk -F, '{print $2}'`
#echo \n "j= " "$j"
k=`echo $SN`
#echo \n "k= " "$k"
if [ $k -ge $i -a $k -le $j ]
then
echo `grep $SN $USER`,`grep $i $RANGE| cut -d',' -f3-4` >> result.csv
break
#else
#echo `grep $SN $USER`,`echo 'N/A','N/A'` >> result.csv
fi
done
done
The script works rather good on small files but I'm sure there is a way to optimize it because I am running it on an i5 laptop with 4GB of RAM. I am a newbie in shell scripting and I came up with this script after hours and hours of research, trial and error but now I am out of ideas.
note: not all the info in file 1 can be found in file.
Thank you!
Adrian.
FILE EXAMPLES and additional info:
File 1 example:
prep,28620026059,Active,123452010988759,No,No,No
post,28619823474,Active,123453458466109,Yes,No,No
post,28619823474,Inactive,123453395270941,Yes,Yes,Yes
File 2 example:
123452010988750,123452010988759,promo32,1.11
123453458466100,123453458466199,promo64,2.22
123450000000000,123450000000010,standard128,3.333
Result example (currently):
prep,28620026059,Active,123452010988759,No,No,No,promo32,1.11
post,28619823474,Active,123453458466109,Yes,No,No,promo64,2.22
Result example (nice to have):
prep,28620026059,Active,123452010988759,No,No,No,promo32,1.11
post,28619823474,Active,123453458466109,Yes,No,No,promo64,2.22
post,28619823474,Inactive,123453395270941,Yes,Yes,Yes,NA,NA
File 1 is sorted after the 4th column File 2 is sorted after the first column.
File 2 does not have ranges that overlap
Not all the info in file 1 can be found in a range in file 2
Thanks again!
LE: The script provided by Jonathan seems to have an issue on some records, as follows:
file 2:
123456780737000,123456780737012,ONE 32,1.11
123456780016000,123456780025999,ONE 64,2.22
file 1:
Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes
The output is the following:
Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes,ONE 32,1.11
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes,ONE 32,1.11
and it should be:
Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes,ONE 32,1.11
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes,ONE 64,2.22
It seems that it returns 0 and writes the info on first record from file2...
I think this will work reasonably well:
awk -F, 'BEGIN { n = 0; OFS = ","; }
NR==FNR { lo[n] = $1; hi[n] = $2; i1[n] = $3; i2[n] = $4; n++ }
NR!=FNR { for (i = 0; i < n; i++)
{
if ($4 >= lo[i] && $4 <= hi[i])
{
print $1, $2, $3, $4, $5, $6, $7, i1[i], i2[i];
break;
}
}
}' file2 file1
Given file2
containing:
1,10,xyz,pqr
11,20,abc,def
21,30,ambidextrous,warthog
and file1
containing:
A,123,X2,1,data01_1,data01_2,data01_3
A,123,X2,2,data02_1,data02_2,data02_3
A,123,X2,3,data03_1,data03_2,data03_3
A,123,X2,4,data04_1,data04_2,data04_3
A,123,X2,5,data05_1,data05_2,data05_3
A,123,X2,6,data06_1,data06_2,data06_3
A,123,X2,7,data07_1,data07_2,data07_3
A,123,X2,8,data08_1,data08_2,data08_3
A,123,X2,9,data09_1,data09_2,data09_3
A,123,X2,10,data10_1,data10_2,data10_3
A,123,X2,11,data11_1,data11_2,data11_3
A,123,X2,12,data12_1,data12_2,data12_3
A,123,X2,13,data13_1,data13_2,data13_3
A,123,X2,14,data14_1,data14_2,data14_3
A,123,X2,15,data15_1,data15_2,data15_3
A,123,X2,16,data16_1,data16_2,data16_3
A,123,X2,17,data17_1,data17_2,data17_3
A,123,X2,18,data18_1,data18_2,data18_3
A,123,X2,19,data19_1,data19_2,data19_3
A,223,X2,20,data20_1,data20_2,data20_3
A,223,X2,21,data21_1,data21_2,data21_3
A,223,X2,22,data22_1,data22_2,data22_3
A,223,X2,23,data23_1,data23_2,data23_3
A,223,X2,24,data24_1,data24_2,data24_3
A,223,X2,25,data25_1,data25_2,data25_3
A,223,X2,26,data26_1,data26_2,data26_3
A,223,X2,27,data27_1,data27_2,data27_3
A,223,X2,28,data28_1,data28_2,data28_3
A,223,X2,29,data29_1,data29_2,data29_3
the output of the command is:
A,123,X2,1,data01_1,data01_2,data01_3,xyz,pqr
A,123,X2,2,data02_1,data02_2,data02_3,xyz,pqr
A,123,X2,3,data03_1,data03_2,data03_3,xyz,pqr
A,123,X2,4,data04_1,data04_2,data04_3,xyz,pqr
A,123,X2,5,data05_1,data05_2,data05_3,xyz,pqr
A,123,X2,6,data06_1,data06_2,data06_3,xyz,pqr
A,123,X2,7,data07_1,data07_2,data07_3,xyz,pqr
A,123,X2,8,data08_1,data08_2,data08_3,xyz,pqr
A,123,X2,9,data09_1,data09_2,data09_3,xyz,pqr
A,123,X2,10,data10_1,data10_2,data10_3,xyz,pqr
A,123,X2,11,data11_1,data11_2,data11_3,abc,def
A,123,X2,12,data12_1,data12_2,data12_3,abc,def
A,123,X2,13,data13_1,data13_2,data13_3,abc,def
A,123,X2,14,data14_1,data14_2,data14_3,abc,def
A,123,X2,15,data15_1,data15_2,data15_3,abc,def
A,123,X2,16,data16_1,data16_2,data16_3,abc,def
A,123,X2,17,data17_1,data17_2,data17_3,abc,def
A,123,X2,18,data18_1,data18_2,data18_3,abc,def
A,123,X2,19,data19_1,data19_2,data19_3,abc,def
A,223,X2,20,data20_1,data20_2,data20_3,abc,def
A,223,X2,21,data21_1,data21_2,data21_3,ambidextrous,warthog
A,223,X2,22,data22_1,data22_2,data22_3,ambidextrous,warthog
A,223,X2,23,data23_1,data23_2,data23_3,ambidextrous,warthog
A,223,X2,24,data24_1,data24_2,data24_3,ambidextrous,warthog
A,223,X2,25,data25_1,data25_2,data25_3,ambidextrous,warthog
A,223,X2,26,data26_1,data26_2,data26_3,ambidextrous,warthog
A,223,X2,27,data27_1,data27_2,data27_3,ambidextrous,warthog
A,223,X2,28,data28_1,data28_2,data28_3,ambidextrous,warthog
A,223,X2,29,data29_1,data29_2,data29_3,ambidextrous,warthog
This uses a linear search on the list of ranges; you can write functions in awk
and a binary search looking for the correct range would perform better on 6,000 entries. That part, though, is an optimization — exercise for the reader. Remember that the first rule of optimization is: don't. The second rule of optimization (for experts only) is: don't do it yet. Demonstrate that it is a problem. This code shouldn't take all that much longer than the time it takes to copy the 9,000,000 record file (somewhat longer, but not disastrously so). Note, though, that if the file1
data is sorted, the tail of the processing will take longer than the start because of the linear search. If the serial numbers are in a random order, then it will all take about the same time on average.
If your CSV data has commas embedded in the text fields, then awk
is no longer suitable; you need a tool with explicit support for CSV format — Perl and Python both have suitable modules.
awk -F, 'BEGIN { n = 0; OFS = ","; }
NR==FNR { lo[n] = $1; hi[n] = $2; i1[n] = $3; i2[n] = $4; n++ }
NR!=FNR { i = search($4)
print $1, $2, $3, $4, $5, $6, $7, i1[i], i2[i];
}
function search(i, l, h, m)
{
l = 0; h = n - 1;
while (l <= h)
{
m = int((l + h)/2);
if (i >= lo[m] && i <= hi[m])
return m;
else if (i < lo[m])
h = m - 1;
else
l = m + 1;
}
return 0; # Should not get here
}' file2 file1
Not all that hard to write the binary search. This gives the same result as the original script on the sample data. It has not been exhaustively tested, but appears to work.
Note that the code does not really handle missing ranges in file2
; it assumes that the ranges are contiguous but non-overlapping and in sorted order and cover all the values that can appear in the serial column of file1
. If those assumptions are not valid, you get erratic behaviour until you fix either the code or the data.