Search code examples
bashcomm

How to print lines in one file that do not match lines in another *after transformation*


Please note, i understand how to output lines in one file that are not in another (here), my question is a little different.

In one file i have lines akin to

Андреев
Барбашев
Иванов
...

in a different file there are lines:

Барбашёв
Семёнов
...

Now. I need the lines from the second file, but only if you cannot find a line in the first where you substitute ё for е. For example Барбашёв should not display, because Барбашев is in the first.

If i do something like

comm -13 first.txt <(cat second.txt | sed 's/ё/е/g')

i get the correct lines, however, they have already been tranformed by that time, and it's unacceptable for what i'm trying to do.

In other words the output is:

Барбашев
...

While it should be

Барбашёв
...

Solution

  • You meant:

    "Now. I need the lines from the second file, but only if you cannot find a line in the first when you substitute ё for е in the second file."

    instead of

    "Now. I need the lines from the second file, but only if you cannot find a line in the first where you substitute ё for е."

    Right?

    Without using a cyrilic charset, this solution works:

    file test.awk

    #!/usr/bin/gawk -f
    
    {
        if(NR==FNR)
            arr[$1]++;
        else {
    
            tmp=$1;
            gsub("t","e",tmp)
    
            if(!(tmp in arr))
                printf("%s\n", $1);
        }
    }
    

    Use:

    $ ./test.awk file1 file2
    

    If you substitute "t" -> "ё" this should also work imo. Maybe you can try.