Search code examples
ruby-on-railsthinking-sphinxtranscription

What is an appropriate string to represent illegible data in a digital humanities transcription?


I have a digital humanities app I am building where we have a bunch of digitized historical documents, and students will be transcribing the text. Here is the schema...

  create_table "documents", force: true do |t|
    t.string   "document_name"
    t.date     "date_filed"
    t.string   "grantor"
    t.string   "grantee"
    t.string   "description"
    t.string   "document_file_name"
    t.string   "document_content_type"
    t.integer  "document_file_size"
  end

  create_table "transcriptions", force: true do |t|
    t.text     "content"
    t.integer  "user_id"
    t.integer  "document_id"
  end

  create_table "users", force: true do |t|
    t.string   "email"
    t.string   "password_digest"
    t.string   "role"
  end

The app is pretty straightforward. I am using paperclip to store the images on S3, and students will create a 'transcription' which will just be a text field. We will then make the text searchable.

These are old documents with a lot of illegible text. I want some way for the users be able to represent a word that is illegible, with the hopes of being able to programmatically identify that later on. One use case might be when others (not the original transcriber) are viewing a transcription, they might be able to make a suggestion (or edit) to an illegible word.

As an example the user might see the sentence "See Jack Rzn" in a document/image. So in the text area they might input "See Jack ---", if they can't interpret the word. Or maybe if the think they know what the word is, but are not sure they could do something like "See Jack -! run !-. The later I could look for instances of --- or -! * !- to identify illegible text.

I'm just spit balling, but just wondering if there are some characters that might give me less grief later one when it comes time to do 'other stuff' with these transcriptions.


Solution

  • After some research this week, here are a few thoughts.

    First, the Smithsonian has a crowd-sourced digitizing project and these are the guidelines they recommend:

    If you find a word you can’t quite read
    
    Please make a note using double brackets [[ ]] like this: [[good guess?]] or simply [[?]]. Save your work and you can continue transcribing the rest of the item.
    

    ...more info here: https://transcription.si.edu/instructions

    Second, there are a couple of 'off the shelf' options out there. http://scripto.org/omeka/ which is based on Omeka DH tool.

    For Rails folks there is fromthepage, https://github.com/benwbrum/fromthepage. This is a wiki style app that allows transcribers to collaborate on handwritten documents.