I'm trying to split a large log file, containing log entries for months at a time, and I'm trying to split it up into logfiles by date. There are thousands of line as follows:
Sep 4 11:45 kernel: Entry
Sep 5 08:44 syslog: Entry
I'm trying to split it up so that the files, logfile.20090904 and logfile.20090905 contain the entries.
I've created a program to read each line, and send it to the appropriate file, but it runs pretty slow (especially since I have to turn a month name to a number). I've thought about doing a grep for every day, which would require finding the first date in the file, but that seems slow as well.
Is there a more optimal solution? Maybe I'm missing a command line program that would work better.
Here is my current solution:
#! /bin/bash
cat $FILE | while read line; do
dts="${line:0:6}"
dt="`date -d "$dts" +'%Y%m%d'`"
# Note that I could do some caching here of the date, assuming
# that dates are together.
echo $line >> $FILE.$dt 2> /dev/null
done
@OP try not to use bash's while read loop to iterate a big file. Its tried and proven that its slow, and furthermore, you are calling external date command for every line of the file you read. Here's a more efficient way, using only gawk
gawk 'BEGIN{
m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",mth,"|")
}
{
for(i=1;i<=m;i++){ if ( mth[i]==$1){ month = i } }
tt="2009 "month" "$2" 00 00 00"
date= strftime("%Y%m%d",mktime(tt))
print $0 > FILENAME"."date
}
' logfile
output
$ more logfile
Sep 4 11:45 kernel: Entry
Sep 5 08:44 syslog: Entry
$ ./shell.sh
$ ls -1 logfile.*
logfile.20090904
logfile.20090905
$ more logfile.20090904
Sep 4 11:45 kernel: Entry
$ more logfile.20090905
Sep 5 08:44 syslog: Entry