Search code examples
arraysawkdate-arithmetic

AWK: How to create an array of dates of the last days?


I have CSV file with millions of records. One of the fields (e.g. $2) contains date. I need the date to be from the last week.

I thought to use AWK to substitute the current date with a random date from the last 7 days. For this, I wanted to create an array indexed 1..7 and it's values will be the last 7 dates (e.g. 2013-10-08, 2013-10-07...).

Is there a function which can calculate dates from a given date? Or some code that can do the job?

My timestamp in the file is "yyyy-mm-dd hh:mm:ss".

Update - answer

This what fitted my needs at last:

BEGIN { srand() } 
{ 
  split($2, a, "-") 
  t  = mktime(a[1] " " a[2] " " substr(a[3], 1, 2) " 00 00 00") 
  $2 = strftime("%Y-%m-%d %H:%M:%S", t - int(7*rand()+1) * 3600 * 24 + int(3600*24*rand())) 
}
1

Solution

  • I assume you have something like this:

    data1,2013-10-04,data2,...
    

    and you want to change the second column date to a random day in the past 7 days.

    This is fairly straightforward with GNU awk where you have the mktime and strftime commands.

    Here is a gawk script (rnd_date.awk) that works on the input listed above:

    BEGIN { srand() }
    {
      split($2, a, /-/)
      t  = mktime(a[1] " " a[2] " " a[3] " 00 00 00")
      $2 = strftime("%F", t - int(7*rand()+1) * 3600 * 24)
    }
    1
    

    Run it like this:

    awk -f rnd_date.awk FS=, OFS=, infile
    

    Example output:

    data1,2013-09-29,data2,...
    

    There are a couple of caveats:

    1. srand seeds the pseudo-random number generator with "seconds since the epoch", so if you rapidly invoke this script there will be no randomness between each invocation. You can counter this be supplying a better seed, e.g. with GNU date:

      awk -f rnd_date -v s=$(date +%N) 'BEGIN { srand(s) } ...'

    2. Setting $2 will force a reformatting of the whole line, depending on your data format this may or may not be a problem.

    Edit

    I see, your data is differently formatted, you have:

    data1,2013-10-04 01:01:01,data2,...
    

    You can adapt the above like this (with better named variables):

    BEGIN { srand() }
    {
      split($2, date, /-| |:/)
      since_epoch = mktime(date[1] " " date[2] " " date[3] " " date[4] " " date[5] " " date[6])
      $2 = strftime("%F %T", since_epoch - int(7*rand()+1) * 3600 * 24)
    }
    1
    

    Or if you do not care about the HH:MM:SS part

    BEGIN { srand() }
    {
      split($2, a, /-| /)
      t  = mktime(a[1] " " a[2] " " a[3] " 00 00 00")
      $2 = strftime("%F %T", t - int(7*rand()+1) * 3600 * 24)
    }
    1
    

    Example output:

    data1,2013-09-27 00:00:00,data2,...