import docx
import collections
listofnames = list()
filename = 'Missing_Assignments.docx'
filehandle = docx.Document(filename)
studentinfo = filehandle.paragraphs
for student in studentinfo:
if len(student.text) > 1 or len(student.text) > 20:
listofnames.append(student.text)
for name in listofnames:
if name.startswith('Assignment'):
listofnames.remove(name)
counts = collections.Counter(listofnames)
counts = dict(counts)
filehandle.add_paragraph('\n')
for name,count in counts.items():
filehandle.add_paragraph(name + ' ' + str(count))
filehandle.save(filename)
print('Complete!')
More of a learning/efficiency question...if this is not generally considered appropriate please let me know what forums may be more suitable.
Question is, why do I have to use docx? I'm used to creating a simple handle like:
filehandle = open(filename)
And being able to iterate through a file this way. I was receiving all kinds of UNICODE errors before using python-docx libraries. Just seems slightly more complicated because I have to use their verbage as opposed to directly iterating through each line of text like I normally would.
You should use python-docx when you have a docx file.
You can open a simple handle to parse a plain text file, but docx is not a plain text format.
It is actually a ZIP archive containing XML files. You can read more about that here: https://docs.fileformat.com/word-processing/docx/
You can create your own parser for that, the standard is actually open, but there are interoperability glitches. You can read more about it here: https://brattahlid.wordpress.com/2012/05/08/is-docx-really-an-open-standard/
To summarize, python-docx takes off the burden of parsing the file format for you.