If every line starts with a time, it's trivial to merge the lines and then sort. I'm trying to merge together decades of chat logs, which are otherwise plain text delimited at the start and end by timestamps. Each file has several of these sections.
Session Start (Bob): Sun Nov 30 19:33:38 2003
Bob: hey what's up?
Michael: oh nothing
Session Close (Bob): Mon Dec 1 02:22:18 2003
Session Start (Bob): Thu Dec 4 09:33:38 2003
Michael: long time no hear
Session Close (Bob): Thu Dec 4 13:22:18 2003
There are multiple files for each individual representing overlapping blocks of time. If one file has sessions in November and January, another may have sessions in December and February. I'd like to combine them all into one chronological file.
Further complicating this is that sometimes there is no Session Close due to a crash and instead just another Session Start. A Session Close should be implied to have happened just before that. If there's any ambiguity or overlap, the script should not merge the blocks.
Open to solutions in any language or command line environment.
awk
(gawk
for the GNU impl.) is suitable for that task. The main idea is to find a good Record Separator and Field Separator.
In this case, RS="Session Start "
(including the trailing space) and FS="\n"
.
For Output Field Separator OFS, pipe or other symbol can be used. Finally, output is sorted by the first date field.
This solution could lead to very long lines but could help you to get started with a better solution.
#!/bin/bash
gawk 'BEGIN{ RS="Session Start " ; FS="\n"; OFS="|"} {
split($1,a,": ")
# put date first on first field
$1=a[2] " " a[1]
print $0
}' file1.txt file2.txt | sort --field-separator="|" -k 2,2 -k 3,3 -k 5,5
file1:
Session Start (Bob): Sun Nov 30 19:33:38 2003
Bob: hey what's up?
Michael: oh nothing
Session Close (Bob): Mon Dec 1 02:22:18 2003
Session Start (Bob): Thu Dec 4 09:33:38 2003
Michael: long time no hear
Session Close (Bob): Thu Dec 4 13:22:18 2003
file2:
Session Start (Bob): Tue Dic 2 19:33:38 2003
Bob: hey what's up?
Michael: oh nothing
Session Close (Bob): Tue Dic 2 20:22:18 2003
Session Start (Bob): Wed Jan 15 09:33:38 2003
Michael: long time no hear
Session Close (Bob): Wed Jan 15 13:22:18 2003
Output:
Sun Nov 30 19:33:38 2003 (Bob)|Bob: hey what's up?|Michael: oh nothing|Session Close (Bob): Mon Dec 1 02:22:18 2003||
Tue Dec 2 19:33:38 2003 (Bob)|Bob: hey what's up?|Michael: oh nothing|Session Close (Bob): Tue Dec 2 20:22:18 2003||
Thu Dec 4 09:33:38 2003 (Bob)|Michael: long time no hear|Session Close (Bob): Thu Dec 4 13:22:18 2003|
Wed Jan 15 09:33:38 2003 (Bob)|Michael: long time no hear|Session Close (Bob): Wed Jan 15 13:22:18 2003|