I have roughly 6000~6500 Microsoft Word
.docx
files with various types of formatted answer scripts inside them, in the sequence:
Python Programming Question in Bold
Answer in form of complete, correctly-indented, single-spaced, self-sufficient code
Unfortunately, there seems to be no fixed pattern delineating the code blocks from normal text. Some examples from the first 50 or so files:
Entire Question in bold, after which code starts abruptly, in bold/italics
Question put in comments, after which code continues
Question completely missing, just code with numbered lists indicating start
Question completely missing, with a C/Python style comments indicating start
etc.
For now, I'm extracting the entire unformatted text through python-docx
like this:
doc = Document(infil)
# For Unicode handling.
new_paragraphs = []
for paragraph in doc.paragraphs:
new_paragraphs.append((paragraph.text).encode("utf-8"))
new_paragraphs = list(map(lambda x: convert(x), new_paragraphs))
with open(outfil, 'w', encoding='utf-8') as f:
print('\n'.join(new_paragraphs), file=f)
Once extracted, I'll run them using the PyPy Sandboxing feature which I understand is safe and then assign points as if in a contest.
What I'm completely stuck on is how to detect the start and end of the code programmatically. Most of the language detection APIs are unneeded since I already know the language. This Question: How to detect source code in a text? suggests using linters and syntax highlighters like the Google Code Prettifier, but they don't solve the issue of detecting separate programs.
A suitable solution, from this programmers.se question, seems to be training markov chains, but I wanted some second opinions before embarking on such a vast project.
This extraction code will also be provided to all students after evaluation.
I apologize if the question is too broad or the answer too obvious.
Hummm, so you are looking for some kind of formatting pattern? That sounds kind of weird to me. Is there any kind of text or string pattern that you can exploit? I'm not sure if this will help or not, but the VBA script below searches through all Word documents in a folder and puts a 'X' in any field that matches a search criteria that you specify in Row1. It also put a hyperlink in ColA, so you can click the link and open the file, rather than searching around for the file. Here is a screen shot.
Script:
Sub OpenAndReadWordDoc()
Rows("2:1000000").Select
Range(Selection, Selection.End(xlDown)).Select
Selection.ClearContents
Range("A1").Select
' assumes that the previous procedure has been executed
Dim oWordApp As Word.Application
Dim oWordDoc As Word.Document
Dim blnStart As Boolean
Dim r As Long
Dim sFolder As String
Dim strFilePattern As String
Dim strFileName As String
Dim sFileName As String
Dim ws As Worksheet
Dim c As Long
Dim n As Long
'~~> Establish an Word application object
On Error Resume Next
Set oWordApp = GetObject(, "Word.Application")
If Err() Then
Set oWordApp = CreateObject("Word.Application")
' We started Word for this macro
blnStart = True
End If
On Error GoTo ErrHandler
Set ws = ActiveSheet
r = 1 ' startrow for the copied text from the Word document
' Last column
n = ws.Range("A1").End(xlToRight).Column
sFolder = "C:\Users\your_path_here\"
'~~> This is the extension you want to go in for
strFilePattern = "*.doc*"
'~~> Loop through the folder to get the word files
strFileName = Dir(sFolder & strFilePattern)
Do Until strFileName = ""
sFileName = sFolder & strFileName
'~~> Open the word doc
Set oWordDoc = oWordApp.Documents.Open(sFileName)
' Increase row number
r = r + 1
' Enter file name in column A
ws.Cells(r, 1).Value = sFileName
ActiveCell.Offset(1, 0).Select
ActiveSheet.Hyperlinks.Add Anchor:=Sheets("Sheet1").Range("A" & r), Address:=sFileName,
SubAddress:="A" & r, TextToDisplay:=sFileName
' Loop through the columns
For c = 2 To n
If oWordDoc.Content.Find.Execute(FindText:=Trim(ws.Cells(1, c).Value),
MatchWholeWord:=True, MatchCase:=False) Then
' If text found, enter Yes in column number c
ws.Cells(r, c).Value = "Yes"
End If
Next c
oWordDoc.Close SaveChanges:=False
'~~> Find next file
strFileName = Dir()
Loop
ExitHandler:
On Error Resume Next
' close the Word application
Set oWordDoc = Nothing
If blnStart Then
' We started Word, so we close it
oWordApp.Quit
End If
Set oWordApp = Nothing
Exit Sub
ErrHandler:
MsgBox Err.Description, vbExclamation
Resume ExitHandler
End Sub
Function GetDirectory(path)
GetDirectory = Left(path, InStrRev(path, "\"))
End Function