Search code examples
pythonpython-3.xextractsandboxpython-docx

What is a safe way to extract python code blocks from docx files and run them in a sandbox?


I have roughly 6000~6500 Microsoft Word .docx files with various types of formatted answer scripts inside them, in the sequence:

Python Programming Question in Bold

Answer in form of complete, correctly-indented, single-spaced, self-sufficient code

Unfortunately, there seems to be no fixed pattern delineating the code blocks from normal text. Some examples from the first 50 or so files:

  1. Entire Question in bold, after which code starts abruptly, in bold/italics

  2. Question put in comments, after which code continues

  3. Question completely missing, just code with numbered lists indicating start

  4. Question completely missing, with a C/Python style comments indicating start

etc.

For now, I'm extracting the entire unformatted text through python-docx like this:

doc = Document(infil)

# For Unicode handling.
new_paragraphs = []
for paragraph in doc.paragraphs:
    new_paragraphs.append((paragraph.text).encode("utf-8"))

new_paragraphs = list(map(lambda x: convert(x), new_paragraphs))

with open(outfil, 'w', encoding='utf-8') as f:
    print('\n'.join(new_paragraphs), file=f)

Once extracted, I'll run them using the PyPy Sandboxing feature which I understand is safe and then assign points as if in a contest.

What I'm completely stuck on is how to detect the start and end of the code programmatically. Most of the language detection APIs are unneeded since I already know the language. This Question: How to detect source code in a text? suggests using linters and syntax highlighters like the Google Code Prettifier, but they don't solve the issue of detecting separate programs.

A suitable solution, from this programmers.se question, seems to be training markov chains, but I wanted some second opinions before embarking on such a vast project.

This extraction code will also be provided to all students after evaluation.

I apologize if the question is too broad or the answer too obvious.


Solution

  • Hummm, so you are looking for some kind of formatting pattern? That sounds kind of weird to me. Is there any kind of text or string pattern that you can exploit? I'm not sure if this will help or not, but the VBA script below searches through all Word documents in a folder and puts a 'X' in any field that matches a search criteria that you specify in Row1. It also put a hyperlink in ColA, so you can click the link and open the file, rather than searching around for the file. Here is a screen shot.

    enter image description here

    Script:

    Sub OpenAndReadWordDoc()
    
        Rows("2:1000000").Select
        Range(Selection, Selection.End(xlDown)).Select
        Selection.ClearContents
        Range("A1").Select
    
        ' assumes that the previous procedure has been executed
        Dim oWordApp As Word.Application
        Dim oWordDoc As Word.Document
        Dim blnStart As Boolean
        Dim r As Long
        Dim sFolder As String
        Dim strFilePattern As String
        Dim strFileName As String
        Dim sFileName As String
        Dim ws As Worksheet
        Dim c As Long
        Dim n As Long
    
        '~~> Establish an Word application object
        On Error Resume Next
        Set oWordApp = GetObject(, "Word.Application")
        If Err() Then
            Set oWordApp = CreateObject("Word.Application")
            ' We started Word for this macro
            blnStart = True
        End If
        On Error GoTo ErrHandler
    
        Set ws = ActiveSheet
        r = 1 ' startrow for the copied text from the Word document
        ' Last column
        n = ws.Range("A1").End(xlToRight).Column
    
        sFolder = "C:\Users\your_path_here\"
    
        '~~> This is the extension you want to go in for
        strFilePattern = "*.doc*"
        '~~> Loop through the folder to get the word files
        strFileName = Dir(sFolder & strFilePattern)
        Do Until strFileName = ""
            sFileName = sFolder & strFileName
    
            '~~> Open the word doc
            Set oWordDoc = oWordApp.Documents.Open(sFileName)
            ' Increase row number
            r = r + 1
            ' Enter file name in column A
            ws.Cells(r, 1).Value = sFileName
    
            ActiveCell.Offset(1, 0).Select
            ActiveSheet.Hyperlinks.Add Anchor:=Sheets("Sheet1").Range("A" & r), Address:=sFileName,
            SubAddress:="A" & r, TextToDisplay:=sFileName
    
            ' Loop through the columns
            For c = 2 To n
                If oWordDoc.Content.Find.Execute(FindText:=Trim(ws.Cells(1, c).Value),
                        MatchWholeWord:=True, MatchCase:=False) Then
                    ' If text found, enter Yes in column number c
                    ws.Cells(r, c).Value = "Yes"
                End If
            Next c
            oWordDoc.Close SaveChanges:=False
    
            '~~> Find next file
            strFileName = Dir()
        Loop
    
    ExitHandler:
        On Error Resume Next
        ' close the Word application
        Set oWordDoc = Nothing
        If blnStart Then
            ' We started Word, so we close it
            oWordApp.Quit
        End If
        Set oWordApp = Nothing
        Exit Sub
    
    ErrHandler:
        MsgBox Err.Description, vbExclamation
        Resume ExitHandler
    End Sub
    
    Function GetDirectory(path)
        GetDirectory = Left(path, InStrRev(path, "\"))
    End Function