Search code examples
pythonconventionspep8

Is there a way to detect words without searching for whitespace or underscores


I am trying to write a CLI for generating python classes. Part of this requires validating the identifiers provided in user input, and for python this requires making sure that identifiers conform to the pep8 best practices/standards for identifiers- classes with CapsCases, fields with all_lowercase_with_underscores, packages and modules with so on so fourth-

# it is easy to correct when there is a identifier
# with underscores or whitespace and correcting for a class

def package_correct_convention(item):
    return item.strip().lower().replace(" ","").replace("_","")

But when there is no whitespaces or underscores between tokens, I'm not sure how to how to correctly capitalize the first letter of each word in an identifier. Is it possible to implement something like that without using AI or something like that:

say for example:

# providing "ClassA" returns "classa" because there is no delimiter between "class" and "a"
def class_correct_convention(item):
    if item.count(" ") or item.count("_"):
        # checking whether space or underscore was used as word delimiter.
        if item.count(" ") > item.count("_"):
            item = item.split(" ")
        elif item.count(" ") < item.count("_"):
            item = item.split("_")
        item = list(map(lambda x: x.title(), item))
        return ("".join(item)).replace("_", "").replace(" ","")
    # if there is no white space, best we can do it capitalize first letter 
    return item[0].upper() + item[1:]

Solution

  • Well, with AI-based approach it will be difficult, not perfect, a lot of work. If it does not worth it, there is maybe simpler and certainly comparably efficient.

    I understand the worst scenario is "todelineatewordsinastringlikethat".

    I would recommend you to download a text file for english language, one word by line, and to proceed this way:

    import re
    
    string = "todelineatewordsinastringlikethat" 
    
    #with open("mydic.dat", "r") as msg:
    #    lst = msg.read().splitlines()
    
    lst = ['to','string','in'] #Let's say the dict contains 3 words
    
    lst = sorted(lst, key=len, reverse = True)
    
    replaced = []
    
    for elem in lst:
    
        if elem in string: #Very fast
            replaced_str = " ".join(replaced) #Faster to check elem in a string than elem in a list
            capitalized = elem[0].upper()+elem[1:] #Prepare your capitalized word
    
            if elem not in replaced_str: #Check if elem could be a substring of something you replaced already
                string = re.sub(elem,capitalized,string) 
    
            elif elem in replaced_str: #If elem is a sub of something you replaced, you'll protect
                protect_replaced = [item for item in replaced if elem in item] #Get the list of replaced items containing the substring elem
    
                for protect in protect_replaced: #Uppercase the whole word to protect, as we do a case sensitive re.sub()
                    string = re.sub(protect,protect.upper(),string)
    
                string = re.sub(elem,capitalized,string)
    
                for protect in protect_replaced: #Deprotect by doing the reverse, full uppercase to capitalized
                    string = re.sub(protect.upper(),protect,string)
    
            replaced.append(capitalized) #Append replaced element in the list
            
    print (string)
    

    Output:

    TodelIneatewordsInaStringlikethat
    #You see that String has been protected but not delIneate, cause it was not in our dict.
    

    This is certainly not optimal, but will perform certainly comparably to AI for a problem which would certainly not be presented as it is for AI anyway (input prep are very important in AI).

    Note it is important to reverse sort the list of words. Cause you want to detect full string words first, not sub. Like in beforehand you want the full one, not before or and.