Search code examples
c#pdfitextpdf-parsing

How to check if a checkbox is checked or not on a non-form PDF using C#?


Using c#, I want to see if a specific check box is checkd on a PDF page. The PDF file is not a form one.

PDF could be something like: enter image description here

Sample file is here: MDS30ResidentP2.pdf (in this sample file, I want to somehow figure it out that check-box "E" in the question A1000 is checked. Again: the PDF is not in "form" format!).

PS: none of the following posts was solved my problem:


Solution

  • OCR is probably the only way. From the PDF perspective, there's a rectangle and some of those rectangles have two lines drawn through them. They're not even images but actual vector drawing commands. You could possibly look for that extra drawing of an "x" but it is unrelated to the text that appears beside it so'd have to write some fuzzy logic to estimate what "x" goes to what "text" and I think you'd end up with a bunch of false positives. If you've got a bunch of these PDFs it might be worth writing something, otherwise OCR or manual entry.

    If you want to parse the PDF you can try something like this which is a little ugly but if you're parsing the same PDF over and over again it might work OK. If you want something more generic and reusable I would check out the creator of iText's post here. His post is for optional content groups but it should give you some ideas to start with.