How to find out the content between two words or two sets of random characters?
The scraped page is not guaranteed to be Html only and the important data can be inside a javascript block. So, I can't remove the JavaScript.
consider this:
<html>
<body>
<div>StartYYYY "Extract HTML", ENDYYYY
</body>
Some Java Scripts code STARTXXXX "Extract JS Code" ENDXXXX.
</html>
So as you see the html markup may not be complete. I can fetch the page, and then without worrying about anything, I want to find the content called "Extract the name" and "Extract the data here in a JavaScript".
What I am looking for is in python:
Like this:
data = FindBetweenText(UniqueTextBeforeContent, UniqueTextAfterContent, page)
Where page is downloaded and data would have the text I am looking for. I rather stay away from regEx as some of the cases can be too complex for RegEx.
Here's my attempt, this is tested. While recursive, there should be no unnecessary string duplication, although a generator might be more optimal
def bracketed_find(s, start, end, startat=0):
startloc=s.find(start, startat)
if startloc==-1:
return []
endloc=s.find(end, startloc+len(start))
if endloc == -1:
return [s[startloc+len(start):]]
return [s[startloc+len(start):endloc]] + bracketed_find(s, start, end, endloc+len(end))
and here is a generator version
def bracketed_find(s, start, end, startat=0):
startloc=s.find(start, startat)
if startloc==-1:
return
endloc=s.find(end, startloc+len(start))
if endloc == -1:
yield s[startloc+len(start):]
return
else:
yield s[startloc+len(start):endloc]
for found in bracketed_find(s, start, end, endloc+len(end)):
yield found