Search code examples
pythoncountingpython-docx

Why python-docx?


import docx
import collections
listofnames = list()
filename = 'Missing_Assignments.docx'
filehandle = docx.Document(filename)
studentinfo = filehandle.paragraphs

for student in studentinfo: 
    if len(student.text) > 1 or len(student.text) > 20:
        listofnames.append(student.text)

for name in listofnames: 
    if name.startswith('Assignment'):
        listofnames.remove(name)
    

counts = collections.Counter(listofnames)
counts = dict(counts)

filehandle.add_paragraph('\n')

for name,count in counts.items(): 
    filehandle.add_paragraph(name + ' ' + str(count))
    filehandle.save(filename)

print('Complete!')

More of a learning/efficiency question...if this is not generally considered appropriate please let me know what forums may be more suitable.

  1. Question is, why do I have to use docx? I'm used to creating a simple handle like:

    filehandle = open(filename)

And being able to iterate through a file this way. I was receiving all kinds of UNICODE errors before using python-docx libraries. Just seems slightly more complicated because I have to use their verbage as opposed to directly iterating through each line of text like I normally would.

  1. Also, does anyone know of a way break off the counting function shown here? I want to count the amount of times a name appears for various missing assignments but only for that period. Other periods may have students with the same name so this would complicate the counting?

Solution

  • You should use python-docx when you have a docx file.

    You can open a simple handle to parse a plain text file, but docx is not a plain text format.

    It is actually a ZIP archive containing XML files. You can read more about that here: https://docs.fileformat.com/word-processing/docx/

    You can create your own parser for that, the standard is actually open, but there are interoperability glitches. You can read more about it here: https://brattahlid.wordpress.com/2012/05/08/is-docx-really-an-open-standard/

    To summarize, python-docx takes off the burden of parsing the file format for you.