Search code examples
tarbzip2gnuwin32

Loading a TAR file and extracting its *bz2 contents to sdout with bzcat


Following this question, I'm attempting to load a 40 GB TAR file with bz2-compressed json files into PostgreSQL in an effecient manner.

As per the answer mentioned above, I'm trying to separate the process and use external tools to create the following flow.

  • Open & Extract the files to SDOUT with TAR (bsdtar in this case, as TAR does not include extracting in its Windows build), only *.bz2 files.
  • Extract the *BZ2 files with a call to bzcat (export to sdout)
  • Open this in my python script 'file_handling', which maps each incoming line to a tweet and outputs this as a csv to stdout
  • Pipe this to PSQL to load it into one COPY command.

I'm currently getting an error when arriving at bzcat, this is what I have to build line that executes the above:

pipeline = [filename[1:3] + " && ",  # Change drive to H so that TAR can find the file without a drive name (doesn't like absolute paths, apparently).
            '"C:\\Tools\\GnuWin32\\gnuwin32\\bin\\bsdtar" vxOf ' + filename_nodrive + ' "*.bz2"',  # Call to tar, outputs to stdin
            " | C:\\Tools\\GnuWin32\\gnuwin32\\bin\\bzcat.exe"#,  # Forward its output to bzcat
            ' | python "D:\Cloud\Dropbox\Coding\GitHub\pyTwitter\pyTwitter_filehandling.py"', # Extract Tweets
            ' | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -1f copy.sql ' + secret_login_d
           ]
module_call = "".join(pipeline)
module_call = "H: && "C:\Tools\GnuWin32\gnuwin32\bin\bsdtar" vxOf "Twitter datastream/Sourcefiles/archiveteam-twitter-stream-2013-01.tar" "*.bz2" | C:\Tools\GnuWin32\gnuwin32\bin\bzcat.exe | python "D:\Cloud\Dropbox\Coding\GitHub\pyTwitter\pyTwitter_filehandling.py" | "C:\Program Files\PostgreSQL\9.4in\psql.exe" -1f copy.sql "user=xxx password=xxx host=localhost port=5432 dbname=xxxxxx""

When executing the code for TAR, the TAR file is outputted to the CMD prompt, hinting me that all is well. However, the bzcat line brings an error:

x 01/29/06/39.json.bz2
bzcat.exe: Data integrity error when decompressing.
    Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

Running -tvv gives me:

huff+mtf data integrity (CRC) error in data

I've tried to extract the same archive with 7-zip (GUI): this still works. Any help on how to troubleshoot this would be greatly appreciated. I'm running Windows 8.1 with GNUWin32.


Solution

  • bsdtar.exe is translating newline bytes in the file data into the DOS CRLF sequence resulting in a corrupted bzip2 output stream.

    GNU tar worked when using relative paths but it does not handle absolute paths in Windows.

    Your best bet is to use 7-zip instead:

    7z.exe x -so -ir!*.json.bz2 archive.tar | bzcat | ...