Search code examples
pythonpandasstop-words

how to write this romove_stopwords faster python?


I have a function remove_stopwords like this. How do I make it run faster?

temp.reverse()

def drop_stopwords(text):
    
    for x in temp:
        elif len(x.split()) > 1:
            text_list = text.split()  
            for y in range(len(text_list)-len(x.split())):
                if " ".join(text_list[y:y+len(x.split())]) == x:
                    del text_list[y:y+len(x.split())]
                    text = " ".join(text_list)
        
        else:
            text = " ".join(text for text in text.split() if text not in vietnamese)

    return text

time to solve a text in my data is 14s and if I have some trick like this time for will decrease to 3s:


temp.reverse()

def drop_stopwords(text):
    
    for x in temp:
        if len(x.split()) >2:
            if x in text:
                text = text.replace(x,'')

        elif len(x.split()) > 1:
            text_list = text.split()  
            for y in range(len(text_list)-len(x.split())):
                if " ".join(text_list[y:y+len(x.split())]) == x:
                    del text_list[y:y+len(x.split())]
                    text = " ".join(text_list)
        
        else:
            text = " ".join(text for text in text.split() if text not in vietnamese)

    return text

but I think it may get wrong some where in my language. How can I rewrite this function in Python to make it faster (in C and C++ I can solve it easily with the function above :(( )


Solution

  • Your function does a lot of the same thing over and over, particularly repeated split and join of the same text. Doing a single split, operating on the list, and then doing a single join at the end might be faster, and would definitely lead to simpler code. Unfortunately I don't have any of your sample data to test the performance with, but hopefully this gives you something to experiment with:

    temp = ["foo", "baz ola"]
    
    
    def drop_stopwords(text):
        text_list = text.split()
        text_len = len(text_list)
        for word in temp:
            word_list = word.split()
            word_len = len(word_list)
            for i in range(text_len + 1 - word_len):
                if text_list[i:i+word_len] == word_list:
                    text_list[i:i+word_len] = [None] * word_len
        return ' '.join(t for t in text_list if t)
    
    
    print(drop_stopwords("the quick brown foo jumped over the baz ola dog"))
    # the quick brown jumped over the dog
    

    You could also just try iteratively doing text.replace in all cases and seeing how that performs compared to your more complex split-based solution:

    temp = ["foo", "baz ola"]
    
    
    def drop_stopwords(text):
        for word in temp:
            text = text.replace(word, '')
        return ' '.join(text.split())
    
    
    print(drop_stopwords("the quick brown foo jumped over the baz ola dog"))
    # the quick brown jumped over the dog