Search code examples
pythonstringtext

Extracting Arabic text from a text file


I have a txt file that includes ['صفحه رقم ا من ٤'] which is the output of EasyOCR. I want to extract "ا", which is after the substring "صفحه رقم". I am using this code:

# Open the text file for reading
with open(r'file.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Extract the digit after the string "صفحه رقم"
page_num_str = 'صفحه رقم'
start_idx = text.find(page_num_str)
if start_idx != -1:
    substr = text[start_idx+len(page_num_str):]
    page_num = ''.join(filter(str.isdigit, substr.split()[0]))
    print("Page number extracted: ",page_num)

Here is the output:

Page number extracted:

As you can see, nothing is extracted! I don’t know why, but I have tried to output the value of

start_idx = text.find(page_num_str)

The output was 2 instead of 0. What is the problem? Here is the text file uploaded.


Solution

  • Your filter is expecting digits:

    page_num = ''.join(filter(str.isdigit, substr.split()[0]))
                                  ^^^^^^^
    

    The "ا" appears to be U+0627 ARABIC LETTER ALEF. Unicode classifies this as a letter, not a digit, and I assume str.isdigit is based on the Unicode classifications.

    I don't know Arabic. Are letters also used as digits? If so, then you could probably drop the filter altogether.

    Perhaps the optical character recognition made a mistake and should have generated "١" U+0661 ARABIC-INDIC DIGIT ONE instead. They look very similar, so it wouldn't be surprising.