Search code examples
pythonarchive

Python 3: extract files from tar.gz archive


I'm currently working with Semantically Enriched Wikipedia.

The resource is inside a 7.5 GB tar.gz archive and each file inside it, it's an XML whose schema is:

<text>
       Plain text
</text>

<annotation>
       Annotation for plain text
</annotation>

The current task is to extract each file and then parse the content inside the tags.

First thing I did was to use the tarfile module and its extractall() method, but during the extraction I got this error:

OSError: [Errno 22] Invalid argument: '.\\sew_conservative\\wiki384\\Live_%3F%21*%40_Like_a_Suicide.xml'

while a part of it is correctly extracted (I thought the error was due to the unicode char inside the xml file name, but I'm now seeing that each file has it inside).

So I planned to work on each file inside the archive using some of the API's methods and the code below.

Unfortunately, the TarInfo object which wraps each file doesn't allow to access the file content and the extraction, file by file, takes too much time.

def parse_sew():
    sew_path = Path("C:/Users/beppe/Desktop/Tesi/Semantically Enriched Wikipedia/sew_conservative.tar.gz")
    with tarfile.open(sew_path, mode='r') as t:
        for item in t:
           // extraction

Is the extraction mandatory to parse and use the content of the XML file or it's possible to read the archive content (on-the-fly, without extracting anything) and then parse the content?

UPDATE: I'm extracting the files via tar -xvzf filename.tar.gz command, everything is going well, but after 15 mins I was able to process only 500MB of the hundred GB.


Solution

  • I would suggest you to use 7zip for extraction. You can initiate 7zip extraction from python and then while it is extracting side by side you can read the files getting extracted. This will save quite a lot of time. You can implement is using threads.

    Secondly dont use front slashes while giving windows path. You can use \\ in place of /.