I want to map extracted document type from some invoice to numbers. The rule is simple:
00 Απόδειξη Πώλησης Εισιτηρίων
01 Τιμολόγιο Συνδρομών
02 Τιμολόγιο Παροχής Υπηρεσιών
03 ΤΙΜΟΛΟΓΙΟ
04 as None
etc.
The problem is that in some invoices text is shortened (e.g. Τιμολόγιο Συνδ.
or Τιμολόγιο Παρ.Υπη.
) and may have some OCR errors (e.g. φβχ; Τιμολόγιο Συν
6. or Τιμολόγιο Παρ.
Ynn. ...δ) .
I tried to use Levenshtein distance to deal with OCR errros but because of shortened text it messes all up (e.g. because even when extracted right Τιμολόγιο Παρ.Υπη.
is closer to ΤΙΜΟΛΟΓΙΟ
than to Τιμολόγιο Παροχής Υπηρεσιών
in terms of characters and results in wrong mapping)
edit1: made errors bold
What should I do to improve mapping quality ?
Solved using a simple model for sentence classification:
model = Sequential() #21 100 50
model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
model.add(layers.Conv1D(128, 5, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(8, activation='softmax'))
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
Embeddings were made of character level tokenized strings. Seems like NN learned some patterns of letters and even with more mistakes the types are classified correctly.