Search code examples
pythonlistperformancesublist

how to make this function faster?


I'm new to python and I would like to make this function faster.

this function get a string as a parameter and in output gives back a list of SE (sound element).

A 'sound element' (SE) is a maximal sequence of 1 or more consonants followed by 1 or more vowels:

  • first all the consonants
  • then all the vowels (aeioujy)
  • all non-alphabetic chars like spaces, numbers, colon, comma etc. must be ignored
  • all accents from accented letters (e.g. è->e) must be removed
  • differences between uppercase and lowercase letters are disregarded

NOTICE: the only exceptions are the first and the last SE of a verse, that could contain only vowels and only consonants, respectively.

Example:

If the verse is "Donàld Duck! wènt, to tHe seas'ìde to swim"

  • the SEs are [ 'do', 'na', 'lddu', 'ckwe', 'ntto', 'the', 'sea', 'si', 'de', 'to', 'swi', 'm' ]
def es_converter(string):
    
    
    vowels, li_es, container = ['a', 'e', 'i', 'o', 'u', 'y', 'j'], [] , ''

    #check for every element in the string
    for i in range(len(string)):
        #i is a vowel?
        if not string[i] in vowels:
            # No, then add it in the variable container
            container += string[i]
            # is the last element of the list?
            if i == (len(string) - 1):
                #yes, then insert inside the list li_es, container and then set it back to ''
                li_es.append(container)
                container = ''
            if string[i] == (len(string) - 1):
                li.append(container)
                container = ''
        #if it was the last element, we check if there are other values after i and are vowels
        elif i < (len(string)-1) and string[i+1] in vowels:
            #yes, add in container
            container += string[i]
        else:
            #no, add in container, append container on the list li_es, set container to '' 
            container += string[i]
            li_es.append(container)
            container = ''
    return li_es

Thanks for all the suggestions! (Unfortunately I can't use any imports)


Solution

  • A big source of inefficiency in your current code is that you use indices all along when iterating on your string. Rather than:

    for i in range(len(data)):
        x = data[i]
        ...
        if data[i] == ...
    

    you should always do:

    for char in data:
        x = char
        ...
        if char == ...
    

    and if you really need indices at some point, use enumerate:

    for i, char in enumerate(data):
        ...
    

    and only use the indices when really needed.


    I would rather use a regex here, though. Without sample data, I can't time it, but I'm certain that it would be much faster than using Python loops.

    The process is:

    • remove all non alphabetic characters
    • make the string lowercase
    • remove the accents, which your current code doesn't do
    • split the string using a regex that describes your conditions.

    So, you could do:

    import re
    import unicodedata
    
    # from https://stackoverflow.com/a/44433664/550094
    def strip_accents(text):
        return  unicodedata.normalize('NFD', text)\
               .encode('ascii', 'ignore')\
               .decode("utf-8")
    
        
    
    def se(data):
        # keep only alphabetical characters
        data = re.sub(r'\W', '', data)
        # make lowercase
        data = data.casefold()
        # strip accents from the remaining data
        data = strip_accents(data)
    
        # creating the regex: 
        #  - start of the string followed by vowels, or
        #  - consonants followed by vowels, or
        #  - consonants followed by end of string
        vowels = 'aeiouy'
        se_regex = re.compile(rf'^[{vowels}]+|[^{vowels}]+[{vowels}]+|[^{vowels}]+$')
        
        # return the SEs
        return se_regex.findall(data)
    

    Sample run (I added a vowel at the start of your string to test this case):

    data = "A Donàld Duck! wènt, to tHe seas'ìde to swim"
    print(se(data))
    # ['a', 'do', 'na', 'lddu', 'ckwe', 'ntto', 'the', 'sea', 'si', 'de', 'to', 'swi', 'm']