Search code examples
pythonstringsearchurdu

Finding Substrings in Non-English Strings [Urdu Strings]


I wish to find substrings in strings that are in Urdu language. For example, suppose that I have a following string and substrings in the Urdu language:

fullstring = "آزاد دائرۃ المعارف، ویکیپیڈیا"

substring1 = "افریقی نژاد امریکی شہری حقوق کی تحریک (1955–1968) - آزاد دائرۃ ..."
substring2 = "Urdu English Translator حاصل کریں - Microsoft Store ur-PK"
substring3 = "ببر شیر - آزاد دائرۃ المعارف، ویکیپیڈیا"
substring4 = "اقوام متحدہ - ویکیپیڈیا"
substring5 = "واقعہ کربلا - آزاد دائرۃ المعارف"
substring6 = "Inaugural Address - Urdu | JFK Library"
substring7 = "دنیا میں امریکہ کے مقام کے بارے میں صدر بائیڈن کا خطاب - United ..."
substring8 = "ایران امریکہ کشیدگی: امریکی صدور اور جنگوں کی مبہم قانونی ..."

The objective is to search / find the words that are present in the fullstring in each of the substrings and then select the corresponding substring for further processing. Especially, the minimum words that are to be present in any substring should be "آزاد دائرۃ".

In the above given examples, substring1, substring3, substring4, and substring5 should be selected and returned (True), whereas, the rest of the substrings should not be selected (False).

I have written the following code to achieve the above given task:

fullstring = "آزاد دائرۃ المعارف، ویکیپیڈیا"
substring = "افریقی نژاد امریکی شہری حقوق کی تحریک (1955–1968) - آزاد دائرۃ ..."

# extract the part after the "-" part
s = substring.split("-")[1]
# remove any spaces if they are present
s = s.strip()

if s in fullstring:
   print("Found!")
else:
   print("Not found!")

The code is giving me Not found! response for all substrings. Whereas it should return Found! for substring1, substring3, substring4 and substring5, and Not found! for all other substrings as given above.

Please help me in achieving the substring search task as described above.


Solution

  • You should try this:

    fullstring = "آزاد دائرۃ المعارف، ویکیپیڈیا"
    substring = "افریقی نژاد امریکی شہری حقوق کی تحریک (1955–1968) - آزاد دائرۃ ..."
    
    # extract the part after the "-" part
    s = substring.split("-")[1]
    # remove any spaces if they are present
    s = s.strip().replace(".","")
    
    if s in fullstring:
       print("Found!")
    else:
       print("Not found!")
    

    Doing striped s is like آزاد دائرۃ ... but you don't have ... in fullstring so you're getting Not found.

    Alternatively you can use .find() function like this :

    fullstring = "آزاد دائرۃ المعارف، ویکیپیڈیا"
    substring = "افریقی نژاد امریکی شہری حقوق کی تحریک (1955–1968) - آزاد دائرۃ ..."
    
    # extract the part after the "-" part
    s = substring.split("-")[1]
    # remove any spaces if they are present
    s = s.strip()
    
    if fullstring.find(s)!=-1:
       print("Found!")
    else:
       print("Not found!")
    

    For all substring you can try this :

    fullstring = "آزاد دائرۃ المعارف، ویکیپیڈیا"
    
    substring1 = "افریقی نژاد امریکی شہری حقوق کی تحریک (1955–1968) - آزاد دائرۃ ..."
    substring2 = "Urdu English Translator حاصل کریں - Microsoft Store ur-PK"
    substring3 = "ببر شیر - آزاد دائرۃ المعارف، ویکیپیڈیا"
    substring4 = "اقوام متحدہ - ویکیپیڈیا"
    substring5 = "واقعہ کربلا - آزاد دائرۃ المعارف"
    substring6 = "Inaugural Address - Urdu | JFK Library"
    substring7 = "دنیا میں امریکہ کے مقام کے بارے میں صدر بائیڈن کا خطاب - United ..."
    substring8 = "ایران امریکہ کشیدگی: امریکی صدور اور جنگوں کی مبہم قانونی ..."
    allsub=[substring1,substring2,substring3,substring4,substring5,substring6,substring7,substring8]
    
    for a in allsub:
        try:
            s=a.split("-")[1].strip(". ").strip()
        except IndexError:
            s=a.split("-")[0].strip(". ").strip()
        if fullstring.find(s)!=-1:
            print("Found!")
        else:
            print("Not found!")
    

    Output :

    Found!
    Not found!
    Found!
    Found!
    Found!
    Not found!
    Not found!
    Not found!
    

    I have created the list of all substring as allsub and checking as what you are doing. Additionally, I have done the try-except because in some substring there is no - and we selecting second element of list. So, sometimes it through errors. But if we use try-expect then it will execute except part rather than throwing error.