Search code examples
pythontextmappingocr

Mapping OCRed text to numbers


I want to map extracted document type from some invoice to numbers. The rule is simple:

00 Απόδειξη Πώλησης Εισιτηρίων
01 Τιμολόγιο Συνδρομών
02 Τιμολόγιο Παροχής Υπηρεσιών
03 ΤΙΜΟΛΟΓΙΟ
04 as None
etc.

The problem is that in some invoices text is shortened (e.g. Τιμολόγιο Συνδ. or Τιμολόγιο Παρ.Υπη.) and may have some OCR errors (e.g. φβχ; Τιμολόγιο Συν6. or Τιμολόγιο Παρ.Ynn. ...δ) . I tried to use Levenshtein distance to deal with OCR errros but because of shortened text it messes all up (e.g. because even when extracted right Τιμολόγιο Παρ.Υπη. is closer to ΤΙΜΟΛΟΓΙΟ than to Τιμολόγιο Παροχής Υπηρεσιών in terms of characters and results in wrong mapping)

edit1: made errors bold

What should I do to improve mapping quality ?


Solution

  • Solved using a simple model for sentence classification:

    model = Sequential()          #21           100                50
    model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
    model.add(layers.Conv1D(128, 5, activation='relu'))
    model.add(layers.GlobalMaxPooling1D())
    model.add(layers.Dense(16, activation='relu'))
    model.add(layers.Dense(8, activation='softmax'))
    model.compile(optimizer='adam',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    

    Embeddings were made of character level tokenized strings. Seems like NN learned some patterns of letters and even with more mistakes the types are classified correctly.