Search code examples
algorithmuser-interfacestatisticssurvey

how to develop a program to minimize errors in human transcription of hand written surveys


I need to develop custom software to do surveys. Questions may be of multiple choice, or free text in a very few cases.

I was asked to design a subsystem to check if there is any error in the manual data entry for the multiple choices part. We're trying to speed up the user data entry process and to minimize human input differences between digital forms and the original questionnaires. The surveys are filled with handwritten marks and text by human interviewers, so it's possible to find hard to read marks, or also the user could accidentally select a different value in some question, and we would like to avoid that.

The software must include some automatic control to detect possible typing differences. Each answer of the multiple choice questions has the same probability of being selected.

This question has two parts:

  • The GUI.

The most simple thing I have in mind is to implement the most usable design of the questions display: use of large and readable fonts and space generously the choices. Is there something else? For faster input, I would like to use drop down lists (favoring keyboard over mouse). Given the questions are grouped in sections, I would like to show the answers selected for the questions of that section, but this could slow down the process. Any other ideas?

  • The error checking subsystem.

What else can I do to minimize or to check human typos in the multiple choice questions? Is this a solvable problem? is there some statistical methodology to check values that were entered by the users are the same from the hand filled forms? For example, let's suppose the survey has 5 questions, and each has 4 options. Let's say I have n survey forms filled in paper by interviewers, and they're ready to be entered in the software, then how to minimize the accidental differences that can have the manual transcription of the n surveys, without having to double check everything in the 5 questions of the n surveys?

My first suggestion is that at the end of the processing of all the hand filled forms, the software could choose some forms randomly to make a double check of the responses in a few instances, but on what criteria can I make this selection? This validation would be enough to cover everything in a significant way?

The actual survey is nation level and it has 56 pages with over 200 questions in total, so it will be a lot of hand written pages by many people, and the intention is to reduce the likelihood of errors and to optimize speed in the data entry process. The surveys must filled in paper first, given the complications of taking laptops or handhelds with the interviewers.


Solution

  • Call me old-school, but I still think the most pragmatic way to do this is to use double entry. Two data entry clerks enter their surveys, then swap stacks and enter the other clerk's surveys. Whenever your system detects a difference between the two, it throws up a flag - then the two clerks put their heads together and decide on the correct answer (or maybe it gets reviewed by a more senior research staff member, etc.). Combined with some of the other suggestions here (I like mdma's suggestions for the GUI a lot), this would make for a low-error system.

    Yes, this will double your data entry time (maybe) - but it's dead simple and will cut your errors way, way down. The OMR idea is a great one, but it doesn't sound to me like this project (a national, 52-page survey) is the best case for a lone hacker to try to implement that for the first time. What software do you need? What hardware is available to do that? There will still be a lot of human work involved in identifying the goofy stuff where an interviewer marks all four possible answers and then writes a note off to the side - you'll likely want to randomly sample surveys to get a sense of what the machine-read error rate is. Even then you still just have an estimate of the error rate, not corrected data.

    Try a simpler method to give your employer quality results this time - then use those results as a pre-validated data set for experimenting with the OMR stuff for next time.