Search code examples
rdata.tablebz2

How to read a .pgn.bz2 file with fread in R?


I am trying to read chess game files from https://database.lichess.org/ where the files are stored as a bzip of a pgn. A sample format of a pgn file would look something like this:

[Event "4th Bayern-chI Bank Hofmann"]
[Site "?"]
[Date "2000.10.29"]
[Round "?"]
[White "Carlsen, Magnus"]
[Black "Cordts, Ingo"]
[ECO "A56"]
[WhiteElo "0"]
[BlackElo "2222"]
[Result "0-1"]

1. d4 Nf6 2. c4 c5 3. Nf3 cxd4 4. Nxd4 e5 5. Nb5 d5 6. cxd5 Bc5 7. N5c3 O-O 8. e3 e4 9. h3 Re8 10. g4 Re5 11. Bc4 Nbd7 12. Qb3 Ne8 13. Nd2 Nd6 14. Be2 Qh4 15. Nc4 Nxc4 16. Qxc4 b5 17. Qxb5 Rb8 18. Qa4 Nf6 19. Qc6 Nd7 20. d6 Re6 21. Nxe4 Bb7 22. Qxd7 Bxe4 23. Rh2 Bxd6 24. Bc4 Rd8 25. Qxa7 Bxh2 26. Bxe6 fxe6 27. Qa6 Bf3 28. Bd2 Qxh3 29. Qxe6+ Kh8 30. Qe7 Bc7 

I can read the files with read.csv directly from the bz2 file:

read.csv(file.pgn.bg2, nrows = 100000, stringsAsFactors = F, header = F)

But the problem is that read.csv is quite slow and the files have millions of lines. So I thought I would use fread since now it can read .bz2 files. The problem is that when I try the following

fread(file.pgn.bg2, nrows = 1000)

The command just runs for ages without any results. My sessioninfo():

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

I tried with a normal .pgn file and fread reads albeit incorrectly. As in it splits the columns and discards the game notation, so for the example above it would result in something like this:

V1               V2
[Event     "4th Bayern-chI Bank Hofmann"]
[Site      "?"]
[Date      "2000.10.29"]
[Round     "?"]
[White     "Carlsen, Magnus"]
[Black     "Cordts, Ingo"]
[ECO       "A56"]
[WhiteElo  "0"]
[BlackElo  "2222"]
[Result    "0-1"]

But at least it's reading it. Would anyone have any advice as to how to go about it? How to use fread to correctly read from the .pgn.bz2 file?


Solution

  • This is reading the file correctly....

    how to process afterwards is a different problem ;-)

    DT <- fread( "./temp/lichess_db_standard_rated_2013-01.pgn.bz2", sep = "", header = FALSE )
    

    enter image description here