Search code examples
bashawkgrep

Extract first date from filename using multiple patterns


I have an input file containing 1 string per line

input.txt (25k lines in reality)

one
two
three

Then i have a directory with many files (50 files in reality)

2022-04-01.csv

stuff;one;more_stuff
stuff;one;more_stuff

2022-04-02.csv

stuff;one;more_stuff
stuff;two;more_stuff

2022-04-03.csv

stuff;two;more_stuff
stuff;three;more_stuff
stuff;three;more_stuff

I need to extract the earliest date each pattern appears. So output in this case would be

one:2022-04-01.csv
two:2022-02-02.csv
three:2022-04-03.csv

I can use grep -l one *.csv to get me a unqiue list of files the pattern appears in, but not for multple patterns and not the single earliest date. If i could just get a list of files each pattern occurs in then i could manually extract the earliest date i think, but im sure there must be a 1 liner to do it all ?


Solution

  • Using any awk:

    awk '
        BEGIN { FS=";"; OFS=":" }
        NR==FNR {
            vals[$0]
            next
        }
        $2 in vals {
            print $2, FILENAME
            delete vals[$2]
        }
    ' input.txt *.csv
    one:2022-04-01.csv
    two:2022-04-02.csv
    three;2022-04-03.csv
    

    The NR==FNR{...} block stores all of values from input.txt as indices of the array a[] which I'm using as a hash table. The other block executes for every line read from the CSVs and tests if the current 2nd field from that line exists as an index in a[] (i.e. does a hash lookup) and, if so prints that value and the current file name then removes that index from a[] so no later occurrence of that same value can match.

    This only works because your CSV file names are named in such a way that they will be passed to awk in the correct date order by your shell.

    If and only if it's guaranteed that every value from input.txt will always appear in at least one of the CSVs then this would probably make the execution a bit faster most of the time, as suggested by @RenaudPacalet:

    awk '
        BEGIN { FS=";"; OFS=":" }
        NR==FNR {
            vals[$0]
            numVals++
            next
        }
        $2 in vals {
            print $2, FILENAME
            delete vals[$2]
            if ( --numVals == 0 ) {
                exit
            }
        }
    ' input.txt *.csv