Search code examples
pythonregexdocxzipdoc

Get file paths inside a Zip archive that can be passed to utilities for processing. -- python


Using python 3.5

I need to find specific text that's stored in old-style, 1997-2003 windows .doc files and dump it into a csv. My constraints are:

a) doc files are in a zipped archive: I can't write to disk/I need to work in memory

b) I need to find specific text with regex, so the doc need to be converted to .txt

Ideally I could read the files with zipfile, pass the data on to some doc-to-txt converter (e.g. textract), and regex on the txt. This might look like

import zipfile
import textract
import re

    with zipfile.ZipFile(zip_archive, 'r') as f:
    for name in f.namelist():
        data = f.read(name)
        txt = textract.process(data).decode('utf-8')  
        #some regex on txt

This of course doesn't work, because the argument for textract (and any other doc-to-txt converter) is a filepath, while "data" is bytes. Using "name" as the argument gives a MissingFileError, probably because zip archives don't have directory structures, just filenames simulating paths.

Is there any way to regex through zipped doc files only in memory, without extracting the files (and therefore writing them to disk)?


Solution

  • Working with files without writing to a physical drive

    In most cases, the files within a zip have to be extracted first to be processed. But this can be done in memory. The roadblock is how to invoke a utility that takes only a mapped filesystem path as an argument to process the text in the zipped files without writing to the physical drive.

    Internally textract invokes a command line utility (antiword) that does the actual text extraction. So the approach that solves this could be applied generally to other command line tools that need access to zip contents via a filesystem path.

    Below are several possible solutions to get around this restriction on files:

    1. Mount a RAM Drive.
      • This works well, but requires sudo prompt, but that can be automated.
    2. Mount the zip file to the filesystem. (good option)
      • A good Linux tool for mounting these is fuse-zip.
    3. Use the tempfile module. (easiest)
      • Ensures files are automatically deleted.
      • Drawback, files may be written to disk.
    4. Access the XML within the .docx files.
      • Can regex through the raw XML, or use an XML reader.
      • Only a small portion of your files are .docx though.
    5. Find another extractor. (not covered)
      • I looked and couldn't find anything.
      • docx2txt is another Python module, but it looks like it will only handle .docx files (as its name implies) and not old Word .doc files.

    Why did I do all this leg-work, you may wonder. I actually found this useful for one of my own projects.


    1) RAM Drive

    If tempfile doesn't satisfy the file constraint goals, and you want to ensure all files used by the tool are in RAM, creating a RAM drive is a great option. The tool should unmount the drive when it's done, which will delete all the files it stored.

    A plus with this option is that Linux systems all support this natively. It doesn't incur any additional software dependencies; at least for Linux, Windows will probably require ImDisk.

    These are the relevant bash commands on Linux:

    $ mkdir ./temp_drive
    $ sudo mount -t tmpfs -o size=512m temp_drive ./temp_drive
    $ 
    $ mount | tail -n 1     # To see that it was mounted.
    $ sudo umount ./temp_drive   # To unmount.
    

    On MacOS:

    $ diskutil erasevolume HFS+ 'RAM Disk' `hdiutil attach -nomount ram://1048576 `
    $ # 512M drive created: 512 * 2048 == 1048576
    

    On Windows:

    On Windows, you may have to use a 3rd party application like ImDisk:

    To automate the process, this short script prompts the user for their sudo password, then invokes mount to create a RAM drive:

    import subprocess as sp
    import tempfile
    import platform
    import getpass
    
    ramdrv = tempfile.TemporaryDirectory()
    
    if platform.system() == 'Linux':
    
        sudo_pw = getpass.getpass("Enter sudo password: ")
    
        # Mount RAM drive on Linux.
        p = sp.Popen(['sudo', '-S', 'bash', '-c', 
                     f"mount -t tmpfs -o size=512m tmpfs {ramdrv.name}"], 
                     stderr=sp.STDOUT, stdout=sp.PIPE, stdin=sp.PIPE, bufsize=1,
                     encoding='utf-8')
    
        print(sudo_pw, file=p.stdin)
    
        del sudo_pw
    
        print(p.stdout.readline())
    
    elif platform.system() == 'Darwin':
        # And so on...
    

    Whatever GUI package your application uses likely has a password dialog, but getpass works well for console applications.

    To access the RAM drive, use the folder it's mounted on like any other file in the system. Write files to it, read files from it, create subfolders, etc.


    2) Mount the Zip file

    If the Zip file can be mounted on the OS file system, then its files will have paths that can be passed to textract. This could be the best option.

    For Linux, a utility that works well is fuse-zip. The few lines below install it, and mount a zip file.

    $ sudo apt-get install fuse-zip
    ...
    $ mkdir ~/archivedrive
    $
    $ fuse-zip ~/myarchive.zip ~/archivedrive
    $ cd ~/archivedrive/myarchive           # I'm inside the zip!
    

    From Python, create temporary mount point, mount the zip, extract text, then unmount the zip:

    >>> import subprocess as sp, tempfile, textract
    >>>
    >>> zf_path = '/home/me/marine_life.zip'
    >>> zipdisk = tempfile.TemporaryDirectory()           # Temp mount point.
    >>> 
    >>> cp = sp.run(['fuse-zip', zf_path, zipdisk.name])  # Mount.
    >>> cp.returncode
    0
    >>> all_text = textract.process(f"{zipdisk.name}/marine_life/octopus.doc")
    >>> 
    >>> cp = sp.run(['fusermount', '-u', zipdisk.name])   # Unmount.
    >>> cp.returncode
    0
    >>> del zipdisk                                       # Delete mount point.
    >>> all_text[:88]
    b'The quick Octopuses live in every ocean, and different species have\n
    adapted to different'
    >>>
    >>> # Convert bytes to str if needed.
    >>> as_string = all_text.decode('latin-1', errors='replace')
    

    A big plus with using this approach is it doesn't require using sudo to mount the archive - no prompting for a password. The only drawback would be that it adds a dependency to the project. Probably not a major concern. Automating the mounting and unmounting should be easy with subprocess.run().

    I believe that the default configuration for Linux distros allows users to mount Fuse filesystems without the need to use sudo; but that would need to be verified for the supported targets.

    For Windows, ImDisk can also mount archives and has a command line interface. So that could possibly be automated to support Windows. The XML approach and this approach both are nice because they get the information directly from the zip file without the additional step of writing it out to a file.

    Regarding character encodings: I made the assumption in the example that old Eastern European Word documents that predate 2006 might use some encoding other than 'utf-8' (iso-8859-2, latin-1, windows-1250, cyrillic, etc.). You might have to experiment a bit to ensure that each of the files is converted to strings correctly.

    Links:


    3) tempfile.NamedTemporaryFile

    This approach doesn't require any special permissions. It should just work. However, the files it creates aren't guaranteed to be in memory only.

    If the concern is that your tool will overpopulate the users' drives with files, this approach would prevent that. The temp files are reliably deleted automatically.

    Some sample code for creating a NamedTemporaryFile, opening a zip and extracting a file to it, then passing its path to textract.

    >>> zf = zipfile.ZipFile('/temp/example.docx')
    >>> wf = zf.open('word/document.xml')
    >>> tf = tempfile.NamedTemporaryFile()
    >>>
    >>> for line in wf:
    ...     tf.file.write(line)
    >>>
    >>> tf.file.seek(0) 
    >>> textract.process(tf.name)
    
    # Lines and lines of text dumped to screen - it worked!
    
    >>> tf.close()
    >>>
    >>> # The file disappears.
    

    You can reuse the same NamedTemporaryFile object over and over using tf.seek(0) to reset its position.

    Don't close the file until you're done with it. It will vanish when you close it. Instances of NamedTemporaryFile are automatically deleted when closed, their refcount goes to 0, or your program exits.

    An option if you want to have a temporary folder that's ensured to disappear after your program is done is tempfile.TemporaryDirectory.

    In the same module, tempfile.SpooledTemporaryFile is a file that exists in memory. However, the path to these is difficult to get (we only know the file descriptor of these). And if you do find a good way to retrieve a path, the path is not usable by textract.

    textract runs in a separate process, but it inherits the file handles of the parent. That's what makes it possible to share these temp files between the two.


    4) Word.docx text extraction via XML

    This approach attempts to remove the need for the 3rd party utility by doing the work within Python, or using another tool that doesn't require FS paths.

    The .docx files within the zip files are also zip files containing XML. XML is text and it can be parsed raw with regular expressions, or passed to an XML reader first.

    The Python module, docx2txt does pretty much the same thing as the 2nd example below. I looked at its sources and it opens the Word document as a zip, and uses an XML parser to get the text nodes. It's not going to work for the same reasons as this approach.

    The two examples below read the file directly out of the .docx archive - the file isn't extracted to disk.

    If you want to convert the raw XML text to a dictionary and lists, you can use xmltodict:

    import zipfile
    import xmltodict
    
    zf        = zipfile.ZipFile('/temp/example.docx')
    data      = xmltodict.parse(zf.open('word/document.xml'))
    some_text = data['w:document']['w:body']['w:p'][46]['w:r']['w:t']
    
    print(some_text)
    

    I found this format a bit unwieldy because of the complicated nesting structure of the XML elements, and it doesn't give you the advantages an XML reader does as far as locating nodes.

    Using xml.etree.ElementTree, an XPATH expression can extract all the text nodes in one shot.

    import re
    import xml.etree.ElementTree as ET
    import zipfile
    
    _NS_DICT = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
    
    def get_docx_text(docx_path):
        """
        Opens the .docx file at 'docx_path', parses its internal document.xml
        document, then returns its text as one (possibly large) string.
        """
        with zipfile.ZipFile(docx_path) as zf:
            tree = ET.parse(zf.open('word/document.xml'))
        all_text = '\n'.join(n.text for n in tree.findall('.//w:t', _NS_DICT))
        return all_text
    

    Using the xml.etree.ElementTree module as above makes text extraction possible in only a few lines of code.

    In get_docx_text(), this line grabs all the text:

    all_text = '\n'.join(n.text for n in tree.findall('.//w:t', _NS_DICT))
    

    The string: './/w:t' is an XPATH expression that tells the module to select all the t (text) nodes of the Word document. Then the list comprehension concatenates all the text.

    Once you have the text returned from get_docx_text(), you can apply your regular expressions, iterate over it line-by-line, or whatever you need to do. The example re expression grabs all parenthetical phrases.


    Links

    The Fuse filesystem: https://github.com/libfuse/libfuse

    zip-fuse man page: https://linux.die.net/man/1/fuse-zip

    MacOS Fuse: https://osxfuse.github.io/

    ImDisk (Windows): http://www.ltr-data.se/opencode.html/#ImDisk

    List of RAM drive software: https://en.wikipedia.org/wiki/List_of_RAM_drive_software

    MS docx file format: https://wiki.fileformat.com/word-processing/docx/

    The xml.ElementTree doc: https://docs.python.org/3/library/xml.etree.elementtree.html?highlight=xml%20etree#module-xml.etree.ElementTree

    XPATH: https://docs.python.org/3/library/xml.etree.elementtree.html?highlight=xml%20etree#elementtree-xpath

    The XML example borrowed some ideas from: https://etienned.github.io/posts/extract-text-from-word-docx-simply/