Search code examples
bashawkcommand-substitution

Command substitution for (GNU coreutils) date in gawk gensub


I have a data file with lines containing a huge amount (~ 5K) of dates in format yy-dd-mm.

A tipical file line could be:

bla bla 21-04-26 blabla blabla 18-01-28 bla bla bla bla 19-01-12 blabla

I need to do this kind of replacement for any single date:

$ date --date="18-01-28" "+%A, %d %B %Y"
Sunday, 28 January 2018

I already solved this problem using sed (see the post scriptum for details).

I would like to use gawk, instead. I came up with this command:

$ gawk '{b = gensub(/([0-9]{2}-[0-9]{2}-[0-9]{2})/,"$(date --date=\"\\1\" \"+%A, %d %B %Y\")", "g")}; {print b}' 

The problem is that bash does not expand the date command inside gensub, in fact I obtain:

$ echo "bla bla 21-04-26 blabla blabla 18-01-28 bla bla bla bla 19-01-12 blabla" | gawk '{b = gensub(/([0-9]{2}-[0-9]{2}-[0-9]{2})/,"$(date --date=\"\\1\" \"+%A, %d %B %Y\")", "g")}; {print b}' 
bla bla $(date --date="21-04-26" "+%A, %d %B %Y") blabla blabla $(date --date="18-01-28" "+%A, %d %B %Y") bla bla bla bla $(date --date="19-01-12" "+%A, %d %B %Y") blabla

I do not get how I could modify the gawk command to obtain the desired result:

bla bla Monday, 26 April 2021 blabla blabla Sunday, 28 January 2018 bla bla bla bla Saturday, 12 January 2019 blabla

post scriptum:

For what concerns sed, I solved with this script

#!/bin/bash

#pathFile hard-coded here
pathFile='./data.txt'

#treshold to avoid "to many arguments" error with sed
maxCount=1000
counter=0

#list of dates in the data file
dateList=($(egrep -o "[0-9]{2}-[0-9]{2}-[0-9]{2}" "$pathFile" | sort | uniq))

#string to pass multiple instruction to sed
sedCommand=''

for item in ${dateList[@]}
do
    sedCommand+="s/"$item"/"$(date --date="$item" "+%A, %d %B %Y")"/g;"
    (( counter++ ))
    if [[ $counter -gt $maxCount ]]
    then
        sed -i "$sedCommand" "$pathFile"
        counter=0
        sedCommand=''
    fi
done
[[ ! -z "$sedCommand" ]] && sed -i "$sedCommand" "$pathFile"

Solution

  • Gawk has builtin functions to deal with date/time which would be MUCH faster compared to invoking the external date command.

    Example input:

    # cat file
    79-03-21 | 21-01-01
    79-04-17 | 20-12-31
    

    The gawk script:

    # cat date.awk
    {
        while (match($0, /([0-9]{2})-([0-9]{2})-([0-9]{2})/, arr) ) {
            date = sprintf("%s-%s-%s", arr[1], arr[2], arr[3])
            #                           \_YY    \_MM    \_DD
            if (arr[1] >= 70) {
                time = sprintf("19%s %s %s  1  0  0", arr[1], arr[2], arr[3])
                #               YYYY MM DD HH MM SS
            } else {
                time = sprintf("20%s %s %s  1  0  0", arr[1], arr[2], arr[3])
            }
            secs = mktime(time)
            new_date = strftime("%A, %d %B %Y", secs)
            $0 = gensub(date, new_date, "g")
        }
        print
    }
    

    Result:

    # gawk -f date.awk file
    Wednesday, 21 March 1979 | Friday, 01 January 2021
    Tuesday, 17 April 1979 | Thursday, 31 December 2020