Search code examples
pythonscreen-scrapingfetch

Finding content between two words withou RegEx, BeautifulSoup, lXml ... etc


How to find out the content between two words or two sets of random characters?

The scraped page is not guaranteed to be Html only and the important data can be inside a javascript block. So, I can't remove the JavaScript.

consider this:

<html>
<body>
<div>StartYYYY "Extract HTML", ENDYYYY

</body>

Some Java Scripts code STARTXXXX "Extract JS Code" ENDXXXX.

</html>

So as you see the html markup may not be complete. I can fetch the page, and then without worrying about anything, I want to find the content called "Extract the name" and "Extract the data here in a JavaScript".

What I am looking for is in python:

Like this:

data = FindBetweenText(UniqueTextBeforeContent, UniqueTextAfterContent, page)

Where page is downloaded and data would have the text I am looking for. I rather stay away from regEx as some of the cases can be too complex for RegEx.


Solution

  • Here's my attempt, this is tested. While recursive, there should be no unnecessary string duplication, although a generator might be more optimal

    def bracketed_find(s, start, end, startat=0):
        startloc=s.find(start, startat)
        if startloc==-1:
            return []
        endloc=s.find(end, startloc+len(start))
        if endloc == -1:
            return [s[startloc+len(start):]]
        return [s[startloc+len(start):endloc]] + bracketed_find(s, start, end, endloc+len(end))
    

    and here is a generator version

    def bracketed_find(s, start, end, startat=0):
        startloc=s.find(start, startat)
        if startloc==-1:
            return
        endloc=s.find(end, startloc+len(start))
        if endloc == -1:
            yield s[startloc+len(start):]
            return
        else:
            yield s[startloc+len(start):endloc]
    
        for found in bracketed_find(s, start, end, endloc+len(end)):
            yield found