Search code examples
pythonreplaceextracttext-mining

problem with text find and replacement in python


i have very specific function. I have 2 strings, one that is backup of input of the code, and second one, that is modified by steps like replacing spaces, extract of information etc (not important for this case).

I need to find a match in those strings, even when the first one is modified. After the match is found, i need to store the match from original string (without modification), and remove it from "sub_str"/"modified_sub_str".

def find_and_save(sub_str, main_str):
    # Convert both strings to lowercase and remove spaces, commas, and hyphens for case-insensitive matching
    sub_str_mod = re.escape(sub_str.lower().replace(" ", "").replace(",", "").replace("-", ""))
    main_str_mod = main_str.lower().replace(" ", "").replace(",", "").replace("-", "")

    # Use re.search() to find the substring in the modified main string
    match = re.search(sub_str_mod, main_str_mod)

    if match:
        start = match.start()
        end = match.end()

        count = 0
        original_start = 0
        original_end = 0

        for i, c in enumerate(main_str):
            if c not in [' ', ',', '-']:
                count += 1
            if count == start + 1:
                original_start = i
            if count == end:
                original_end = i + 1
                break

        original_sub_str = main_str[original_start:original_end]

        # If the whole sub_str is matching with some part of main_str, return an empty string as modified_sub_str
        if original_sub_str.lower().replace(" ", "").replace(",", "").replace("-", "") == sub_str_mod:
            modified_sub_str = ""
        else:
            # Remove the matching part from sub_str in a case-insensitive manner
            modified_sub_str = re.sub(re.escape(original_sub_str), '', sub_str, flags=re.IGNORECASE)

        return modified_sub_str, original_sub_str  # Returns the modified sub_str and the matched string in its original form
    else:
        return sub_str, None  # Returns sub_str as it was and None if no match is found

But i have a specific problems with this code. For example if i have inputs like

sub_str = "internationalworkshopongraphene/ceramiccomposites2016,wgcc2016"

and

main_str = "Roč. 37, č. 12, International Workshop on Graphene/Ceramic Composites 2016, WGCC 2016 (2017), s. 3773-3780 [print, online]" 

This code can find match, can return "original_sub_str", but cannot remove the match from "modified_sub_str".

The same problem for those inputs: "sub_str" - "main_str"

"isnnm-2016,internationalsymposiumon"
"Roč. 2017, č. 65, ISNNM-2016, International Symposium on Novel and Nano Materials (2017), s. 76-82 [print, online]"

"fractographyofadvancedceramics5“fractographyfrommacro-tonano-scale”" 
"Roč. 37, č. 14, Fractography of Advanced Ceramics 5 “Fractography from MACRO- to NANO-scale” (2017), s. 4315-4322 [print, online]"

"73.zjazdchemikov,zborníkabstraktov"
"Roč. 17, č. 1, 73. Zjazd chemikov, zborník abstraktov (2021), s. 246-246 [print, online]" 

I cant find a solution even with use of AI, but i know theres a problem with replace function, unique symbols, case sensitivity.


Solution

  • Your sub_str_mod was a regex escaped string. . is converted to \., now original_sub_str can not be found because original_sub_str has no backslash. (Next time use a debugger)

    Removed re and do all with literal string find.

    Removed the else because the if test is always True

    def clean_str(s) -> str:
        return s.lower().replace(" ", "").replace(",", "").replace("-", "")
    
    def find_and_save(sub_str, main_str):
        # Convert both strings to lowercase and remove spaces, commas, and hyphens for case-insensitive matching
        sub_str_mod = clean_str(sub_str)
        main_str_mod = clean_str(main_str)
    
        # find the substring in the modified main string
        start = main_str_mod.find(sub_str_mod)
        if start == -1:
            return sub_str, None  # Returns sub_str as it was and None if no match is found
    
        end = start + len(sub_str_mod)
    
        count = 0
        original_start = 0
        original_end = 0
    
        for i, c in enumerate(main_str):
            if c not in [' ', ',', '-']:
                count += 1
            if count == start + 1:
                original_start = i
            if count == end:
                original_end = i + 1
                break
    
        original_sub_str = main_str[original_start:original_end]
    
        # If the whole sub_str is matching with some part of main_str, return an empty string as modified_sub_str
        modified_sub_str = ""
        if clean_str(original_sub_str) == sub_str_mod:  # always True
            modified_sub_str = ""
        return modified_sub_str, original_sub_str  # Returns the modified sub_str and the matched string in its original form
    

    Output of the 4 cases:

    ('', 'International Workshop on Graphene/Ceramic Composites 2016, WGCC 2016')
    ('', 'ISNNM-2016, International Symposium on')
    ('', 'Fractography of Advanced Ceramics 5 “Fractography from MACRO- to NANO-scale”')
    ('', '73. Zjazd chemikov, zborník abstraktov')