Why python-docx?

import docx
import collections
listofnames = list()
filename = 'Missing_Assignments.docx'
filehandle = docx.Document(filename)
studentinfo = filehandle.paragraphs

for student in studentinfo: 
    if len(student.text) > 1 or len(student.text) > 20:
        listofnames.append(student.text)

for name in listofnames: 
    if name.startswith('Assignment'):
        listofnames.remove(name)
    

counts = collections.Counter(listofnames)
counts = dict(counts)

filehandle.add_paragraph('\n')

for name,count in counts.items(): 
    filehandle.add_paragraph(name + ' ' + str(count))
    filehandle.save(filename)

print('Complete!')

More of a learning/efficiency question...if this is not generally considered appropriate please let me know what forums may be more suitable.

Question is, why do I have to use docx? I'm used to creating a simple handle like:

filehandle = open(filename)

And being able to iterate through a file this way. I was receiving all kinds of UNICODE errors before using python-docx libraries. Just seems slightly more complicated because I have to use their verbage as opposed to directly iterating through each line of text like I normally would.

Also, does anyone know of a way break off the counting function shown here? I want to count the amount of times a name appears for various missing assignments but only for that period. Other periods may have students with the same name so this would complicate the counting?

Solution

You should use python-docx when you have a docx file.

You can open a simple handle to parse a plain text file, but docx is not a plain text format.

It is actually a ZIP archive containing XML files. You can read more about that here: https://docs.fileformat.com/word-processing/docx/

You can create your own parser for that, the standard is actually open, but there are interoperability glitches. You can read more about it here: https://brattahlid.wordpress.com/2012/05/08/is-docx-really-an-open-standard/

To summarize, python-docx takes off the burden of parsing the file format for you.