I have a file containing the ply sequences of multiple chess games. Games are separated by one or more new lines and the corresponding ply sequence of each game can be also split into multiple lines.
I would like to merge all lines corresponding to the same game, so as to have only one line per game. I have tried different options, but none worked. A remark is that the file contains more than 14M games, so I need a fast solution. I work on Linux.
Example:
e4 e5 Bb5 c6 Bc4 b5 Bxf7+ Kxf7 Nf3 Qf6 d4 d6 dxe5 dxe5
Bg5 Qe6 Nc3 Be7 Be3 Nf6 b4 Rd8 Ng5+ Kg8 Nd5 Qd6 Qf3 cxd5
Bc5 Qe6 Nxe6 Bxe6 Bxe7
e4 e5 Nf3 Qf6 Bc4 Bc5 Nc3 c6 Na4 Bb4 c3 Ba5 Nc5 d6 Nb3
Bb6 d4 h6 dxe5 dxe5 O-O Ne7 Be3 Nd7 Bxb6 Nxb6 Be2 O-O
Nc5 Ng6 b4 Nf4 Nd3 Rd8 Qc2 Nc4 Nxf4 Na3 Qb3 Qxf4
Qxa3 Qxe4 Rfe1 f6 Qb3+ Kh8 Bd1 Qf4 Bc2 Bg4 Re4 Qf5 Rxe5
Qd7 Re3 Qd6 Nh4 Qd5 Ng6+ Kh7 Ne7+ f5 Nxd5 Rxd5 c4 Rd2
h3 Bh5 Bxf5+ Kh8
e4 e5 Nf3 Nc6 Bb5 Nf6 Bxc6 bxc6 O-O d6 h3 Nxe4 Re1 Bf5
d4 f6 dxe5 fxe5 Nbd2 Nxd2 Bxd2 Be7 Qc1 O-O c3 h6 c4 e4
Nd4 Qd7 b3 d5 Nxf5 Qxf5 Be3 Bf6 Rb1 d4 Bd2 c5
d4 Nf6 Nc3 d5 Bg5 Ne4 Nxe4 dxe4 c3 h6 Be3 e6 Qc2 f5 g4
Be7 Bg2 O-O O-O-O Nd7 d5 Nb6 dxe6 Qe8 gxf5 Rxf5 Bxe4 Rf8
Bh7+ Kh8 Bg6
Should become:
e4 e5 Bb5 c6 Bc4 b5 Bxf7+ Kxf7 Nf3 Qf6 d4 d6 dxe5 dxe5 Bg5 Qe6 Nc3 Be7 Be3 Nf6 b4 Rd8 Ng5+ Kg8 Nd5 Qd6 Qf3 cxd5 Bc5 Qe6 Nxe6 Bxe6 Bxe7
e4 e5 Nf3 Qf6 Bc4 Bc5 Nc3 c6 Na4 Bb4 c3 Ba5 Nc5 d6 Nb3 Bb6 d4 h6 dxe5 dxe5 O-O Ne7 Be3 Nd7 Bxb6 Nxb6 Be2 O-O Nc5 Ng6 b4 Nf4 Nd3 Rd8 Qc2 Nc4 Nxf4 Na3 Qb3 Qxf4 Qxa3 Qxe4 Rfe1 f6 Qb3+ Kh8 Bd1 Qf4 Bc2 Bg4 Re4 Qf5 Rxe5 Qd7 Re3 Qd6 Nh4 Qd5 Ng6+ Kh7 Ne7+ f5 Nxd5 Rxd5 c4 Rd2 h3 Bh5 Bxf5+ Kh8
e4 e5 Nf3 Nc6 Bb5 Nf6 Bxc6 bxc6 O-O d6 h3 Nxe4 Re1 Bf5 d4 f6 dxe5 fxe5 Nbd2 Nxd2 Bxd2 Be7 Qc1 O-O c3 h6 c4 e4 Nd4 Qd7 b3 d5 Nxf5 Qxf5 Be3 Bf6 Rb1 d4 Bd2 c5
d4 Nf6 Nc3 d5 Bg5 Ne4 Nxe4 dxe4 c3 h6 Be3 e6 Qc2 f5 g4 Be7 Bg2 O-O O-O-O Nd7 d5 Nb6 dxe6 Qe8 gxf5 Rxf5 Bxe4 Rf8 Bh7+ Kh8 Bg6
With awk, you can set the record separator to the empty string, which makes records being separated by blank lines. Then you replace for each record the newlines with a space:
awk -v RS="" '{gsub("\n", " ")} 1' infile
Or, as an alternative, with sed:
sed ':a;N;/\n$/!s/\n//;ta;s/\n$//;/^$/d' infile
This works as follows:
:label # Label to jump back to
N # Append next line to pattern sapce
/\n$/! s/\n// # If pattern space does not end with newline, remove newline
t label # Jump back to label if we changed something
s/\n$// # Remove trailing newline
/^$/ d # Delete empty line
The last command isn't strictly necessary for the given input, but if there are more than two consecutive empty lines, there would be empty output lines without it. It's just there to make the sed command equivalent to the awk command.