Search code examples
awkfile-manipulation

Modifying and Comparing similarity of two files


I have 2 files. Sample values of file1 are as follows:

1313 0 60
1313 1 60
1314 0 60
1314 1 57
1315 1 60
1316 0 60
1316 1 57
1317 1 57
1318 1 57
1333 0 57
1333 1 57
1334 0 60
1334 1 60

Sample values of file2 are as follows:

813 0 91
813 1 91
814 0 91
814 1 91
815 0 96
815 1 91
816 0 91
816 1 91
817 1 96
818 0 91
832 0 96
833 0 91
833 1 91
834 0 96

I am trying to modify file1 and create a file3 with the following values (as you can see, the values in the last column of file1 are irrelevant):

1 0 
1 1 
2 0 
2 1 
3 1 
4 0 
4 1 
5 1 
6 1 
21 0 
21 1 
22 0 
22 1 

Also, the file2 needs to be modified, and a file4 is to be created with the following values (the values in the last column of file2 are irrelevant):

1 0
1 1
2 0
2 1
3 0 
3 1
4 0
4 1
5 1
6 0
20 0
21 0
21 1
22 0

After the creation of file3 and file4, I intend to check their similarity using the diff utility. To generate file3 and file4, I am trying to write an awk script. But as a beginner to awk, I find the task very time consuming. Any guidance would be greatly appreciated.


Solution

  • We can capture the value from $1 on the first row and then just use it in a formula to calculate the offset. This assumes the smallest $1 is in the first row.

    awk 'NR==1 { i=$1 } { print $1-i+1,$2 }'
    

    So for example, you can do:

    awk 'NR==1 { i=$1 } { print $1-i+1,$2 }' file1 > file3
    awk 'NR==1 { i=$1 } { print $1-i+1,$2 }' file2 > file4
    diff file3 file4
    


    This was my previous version before I noticed you were really looking for an offset. I had assumed you just wanted to change it based on the change in $1. We can set up a variable to use to check value changes between rows and only increment the counter when $1 changes. This assumes that are grouped.

    awk 'n!=$1 { i++ } { print i,$2 } { n=$1 }'
    

    So for example, you can do:

    awk 'n!=$1 { i++ } { print i,$2 } { n=$1 }' file1 > file3
    awk 'n!=$1 { i++ } { print i,$2 } { n=$1 }' file2 > file4
    diff file3 file4