Search code examples
python-3.xdocx

In Python, how can I get part of docx document?


I would like to get part of docx document ( for example, 10% of all content) with Python 3. How I can do this? Thanks.


Solution

  • A good way to interact with .docx files in python is the docx2txt module.

    If you have pip installed you can open your terminal and run:

    pip install docx2txt
    

    Once you have the docx module you can run:

    import docx2txt
    

    You can then return the text in the document and filter only the parts you want. The contents of filename.docx is stored as a string in the variable text.

    text = docx2txt.process("filename.docx")
    print(text)
    

    It is now possible to manipulate that string using some basic built-functions. The code snippet below prints the results of text, returns the length using the len() function, and slices the string to about 10% by creating a substring.

    len(text)
    print(len(text))  # returns 1000 for my sample document
    
    text = text[1:100]
    print(text)  # returns 10% of the string
    

    My full code for this example is below. I hope this is helpful!

    import docx2txt
    
    text = docx2txt.process("/home/jared/test.docx")
    print(text)
    
    len(text)
    print(len(text))  # returns 1000 for my sample document
    
    text = text[1:100]
    print(text)  # returns 10% of the string