Search code examples
pythonweb-scrapingbeautifulsoupscrapy

BS Extract all text between two specified keyword


With Python and BS i need to extract all text contained between two specified word

blabla text i need blibli

I succeed to extract inside DIV and TAG but not for specific and different keyword.

Thank you for your help


Solution

  • Assuming that you have extracted all the words between the specified tag, you now have a string extracted in chronological order to the way that the text was written...

    Once you have your full text string, you can get a substring between two words that are different and each only occur once in the text:

    text = {text}
    def get_textchunk(word1, word2, text):
            new_text = text.split(word1)
            new_text = new_text[1]
            newnew_text = new_text.split(word2)
            return newnew_text
    print(get_textchunk('word1','word2',text)[0])
    

    This is a function that will split in two steps using two different words.

    If you want to get text between two of the same words that occur only twice (once at the start of the text and once at the end) use this code:

    def get_textchunk(word, text):
            text = text.split(word)
            return text
    print(get_textchunk('word', text)[1])
    

    This will get you the middle of the text you just split.

    If you want to get text between two words that are different but occur frequently in the body of your text use this code:

    def get_textchunk(word1, word2, text):
            idx1 = text.index(word1)
            idx2 = text.index(word2)
            for idx in range(idx1 + len(word1) + 1, idx2):
                    new_text = new_text + text[idx]
            return new_text
    

    This function may be the most helpful for you.