I have CSV file with millions of records. One of the fields (e.g. $2) contains date. I need the date to be from the last week.
I thought to use AWK to substitute the current date with a random date from the last 7 days. For this, I wanted to create an array indexed 1..7 and it's values will be the last 7 dates (e.g. 2013-10-08, 2013-10-07...).
Is there a function which can calculate dates from a given date? Or some code that can do the job?
My timestamp in the file is "yyyy-mm-dd hh:mm:ss".
This what fitted my needs at last:
BEGIN { srand() }
{
split($2, a, "-")
t = mktime(a[1] " " a[2] " " substr(a[3], 1, 2) " 00 00 00")
$2 = strftime("%Y-%m-%d %H:%M:%S", t - int(7*rand()+1) * 3600 * 24 + int(3600*24*rand()))
}
1
I assume you have something like this:
data1,2013-10-04,data2,...
and you want to change the second column date to a random day in the past 7 days.
This is fairly straightforward with GNU awk where you have the mktime
and strftime
commands.
Here is a gawk script (rnd_date.awk) that works on the input listed above:
BEGIN { srand() }
{
split($2, a, /-/)
t = mktime(a[1] " " a[2] " " a[3] " 00 00 00")
$2 = strftime("%F", t - int(7*rand()+1) * 3600 * 24)
}
1
Run it like this:
awk -f rnd_date.awk FS=, OFS=, infile
Example output:
data1,2013-09-29,data2,...
There are a couple of caveats:
srand
seeds the pseudo-random number generator with "seconds since the epoch", so if you rapidly invoke this script there will be no randomness between each invocation. You can counter this be supplying a better seed, e.g. with GNU date:
awk -f rnd_date -v s=$(date +%N) 'BEGIN { srand(s) } ...'
Setting $2
will force a reformatting of the whole line, depending on your data format this may or may not be a problem.
I see, your data is differently formatted, you have:
data1,2013-10-04 01:01:01,data2,...
You can adapt the above like this (with better named variables):
BEGIN { srand() }
{
split($2, date, /-| |:/)
since_epoch = mktime(date[1] " " date[2] " " date[3] " " date[4] " " date[5] " " date[6])
$2 = strftime("%F %T", since_epoch - int(7*rand()+1) * 3600 * 24)
}
1
Or if you do not care about the HH:MM:SS
part
BEGIN { srand() }
{
split($2, a, /-| /)
t = mktime(a[1] " " a[2] " " a[3] " 00 00 00")
$2 = strftime("%F %T", t - int(7*rand()+1) * 3600 * 24)
}
1
Example output:
data1,2013-09-27 00:00:00,data2,...