Search code examples
arraysawksplitgawk

split field into array in awk, then search each term in another file


I'm trying to de-compose a field from a specific file into an array, and then check if each term appears in a second file (which has been already stored in another array). The goal is to merge information from both files.

The first file1 (the one with the field I want to split) looks like that:

data1=data2=data3 some more stuff
data4=data1 this are things
data2=data5 more text here
...

While file2 has this structure:

data1 10
data2 20
data3 35
data4 15
data5 60

I want to split the the first field of file1 using =, then search each of the splitted terms in the second file, and print everything in the following format:

output:

data1=data2=data3 some more stuff 10
data1=data2=data3 some more stuff 20
data1=data2=data3 some more stuff 35
data4=data1 this are things 15
data4=data1 this are things 10
data2=data5 more text here 20
data2=data5 more text here 60

So far, I've got this:

awk 'NR==FNR {
l[$1] = $2; next
} {
la=split($1,a,"=")
for(x=1;x<=la;x++)
  print $0,l[a[$x]]
}' file2 file1 > output

First (when NR==FNR), I store file2 data in the array l using the first field as key.

Then I parse the next file in the following manner: for each record, I split the field $1 into an array la using = as the separator. la variable stores the number of terms in the array a.

For each element in array a (for loop), I look for the corresponding key in array l and output the current content + l value.

But, for some reason, I only get the content from file1 (current, unwanted output):

data1=data2=data3 some more stuff 
data1=data2=data3 some more stuff 
data1=data2=data3 some more stuff 
data4=data1 this are things 
data4=data1 this are things 
data2=data5 more text here 
data2=data5 more text here 

Any ideas on what might be wrong with my code?

Thanks a lot!


Solution

  • I found the answer myself. It was an issue with variable naming.

    This is the correct code:

    awk 'NR==FNR {
    l[$1] = $2; next
    } {
    la=split($1,a,"=")
    for(x=1;x<=la;x++)
      print $0,l[a[x]]
    }' file2 file1 > output
    

    The key is in the printing function. It now reads print $0,l[a[x]] instead of print $0,l[a[$x]]. The loop is using x as its internal counter, not $x. Changing that now points to the correct key in array l (from file2).

    I'm leaving the post because it looks like this question hasn't been posed before. Please tell me if you think it's not useful.

    Thanks!