I have a txt file that includes ['صفحه رقم ا من ٤'] which is the output of EasyOCR. I want to extract "ا", which is after the substring "صفحه رقم". I am using this code:
# Open the text file for reading
with open(r'file.txt', 'r', encoding='utf-8') as f:
text = f.read()
# Extract the digit after the string "صفحه رقم"
page_num_str = 'صفحه رقم'
start_idx = text.find(page_num_str)
if start_idx != -1:
substr = text[start_idx+len(page_num_str):]
page_num = ''.join(filter(str.isdigit, substr.split()[0]))
print("Page number extracted: ",page_num)
Here is the output:
Page number extracted:
As you can see, nothing is extracted! I don’t know why, but I have tried to output the value of
start_idx = text.find(page_num_str)
The output was 2 instead of 0. What is the problem? Here is the text file uploaded.
Your filter is expecting digits:
page_num = ''.join(filter(str.isdigit, substr.split()[0]))
^^^^^^^
The "ا" appears to be U+0627 ARABIC LETTER ALEF. Unicode classifies this as a letter, not a digit, and I assume str.isdigit is based on the Unicode classifications.
I don't know Arabic. Are letters also used as digits? If so, then you could probably drop the filter altogether.
Perhaps the optical character recognition made a mistake and should have generated "١" U+0661 ARABIC-INDIC DIGIT ONE instead. They look very similar, so it wouldn't be surprising.