Search code examples
loggingtext-processingtext-parsing

Merge log files with timestamped chunks


If every line starts with a time, it's trivial to merge the lines and then sort. I'm trying to merge together decades of chat logs, which are otherwise plain text delimited at the start and end by timestamps. Each file has several of these sections.

Session Start (Bob): Sun Nov 30 19:33:38 2003
Bob: hey what's up?
Michael: oh nothing
Session Close (Bob): Mon Dec 1 02:22:18 2003

Session Start (Bob): Thu Dec 4 09:33:38 2003
Michael: long time no hear
Session Close (Bob): Thu Dec 4 13:22:18 2003

There are multiple files for each individual representing overlapping blocks of time. If one file has sessions in November and January, another may have sessions in December and February. I'd like to combine them all into one chronological file.

Further complicating this is that sometimes there is no Session Close due to a crash and instead just another Session Start. A Session Close should be implied to have happened just before that. If there's any ambiguity or overlap, the script should not merge the blocks.

Open to solutions in any language or command line environment.


Solution

  • awk (gawk for the GNU impl.) is suitable for that task. The main idea is to find a good Record Separator and Field Separator.
    In this case, RS="Session Start " (including the trailing space) and FS="\n". For Output Field Separator OFS, pipe or other symbol can be used. Finally, output is sorted by the first date field. This solution could lead to very long lines but could help you to get started with a better solution.

    #!/bin/bash
    
    gawk 'BEGIN{ RS="Session Start " ; FS="\n"; OFS="|"} {
    split($1,a,": ")
    # put date first on first field
    $1=a[2] " " a[1]
    print $0
    }' file1.txt file2.txt | sort --field-separator="|" -k 2,2 -k 3,3 -k 5,5
    

    file1:

    Session Start (Bob): Sun Nov 30 19:33:38 2003
    Bob: hey what's up?
    Michael: oh nothing
    Session Close (Bob): Mon Dec 1 02:22:18 2003
    
    Session Start (Bob): Thu Dec 4 09:33:38 2003
    Michael: long time no hear
    Session Close (Bob): Thu Dec 4 13:22:18 2003
    

    file2:

    Session Start (Bob): Tue Dic 2 19:33:38 2003
    Bob: hey what's up?
    Michael: oh nothing
    Session Close (Bob): Tue Dic 2 20:22:18 2003
    
    Session Start (Bob): Wed Jan 15 09:33:38 2003
    Michael: long time no hear
    Session Close (Bob): Wed Jan 15 13:22:18 2003
    

    Output:

    Sun Nov 30 19:33:38 2003 (Bob)|Bob: hey what's up?|Michael: oh nothing|Session Close (Bob): Mon Dec 1 02:22:18 2003||
    Tue Dec 2 19:33:38 2003 (Bob)|Bob: hey what's up?|Michael: oh nothing|Session Close (Bob): Tue Dec 2 20:22:18 2003||
    Thu Dec 4 09:33:38 2003 (Bob)|Michael: long time no hear|Session Close (Bob): Thu Dec 4 13:22:18 2003|
    Wed Jan 15 09:33:38 2003 (Bob)|Michael: long time no hear|Session Close (Bob): Wed Jan 15 13:22:18 2003|