Search code examples
regexhtml-parsingregex-negationregex-lookarounds

RegEx: Get content from multiple concatenated HTML-Files


I have a bunch of html-files that I concat and want to get the actual contents only. However, I'm having some trouble with finding the correct regex for that. Basically I'm trying to remove everything before, in between and after certain boundaries. Its somewhat similar to Regular expression to match a line that doesn't contain a word? however as I feel more complex. I'm having no luck.

Source-Data:

Stuff I dont need before

<div id="start">
blablabla11
blablabla12
<div id="end">

Stuff I dont need in the middle1

<div id="start">
blablabla21
blablabla22
<div id="end">

Stuff I dont need in the middle2

<div id="start">
blablabla31
blablabla32
<div id="end">

Stuff I dont need in the end

Desired result:

<div id="start">
blablabla11
blablabla12
<div id="end">

<div id="start">
blablabla21
blablabla22
<div id="end">

<div id="start">
blablabla31
blablabla32
<div id="end">

Context: I'm working in Sublime (Mac) -> Perl Regex

My current approach is based on inverse matching / regex-lookarounds (I know, there is lots of discussion about wording/methods/uglyness etc around this topic, however I must not care as I need to get the job done) :

Find: (?s)^((?!(<div id="start">)(?s)(.*?)(<div id="end">)).)*$
Replace: $3

And many more variants, I've been testing and playing around. However, it yields to:

blablabla11
blablabla12

<div id="start">

blablabla21
blablabla22

<div id="start">

blablabla31
blablabla32

<div id="start">

Nice, but not there yet. And whatever I'm trying I'm stumbling into other problems. Noob at work I guess.

Thanks a gazillion for your help guys!

Chris

EDIT: Thank you for the first answers! However I must admit that my minimal example is a bit misleading (because too easy). In reality I am facing hundrets of complex and diverse html-files concatenated into one single large file. The only common bits are that the content of every html-file starts with a known string (here simplified as ) and ends with a known string (here simplified as ). And the content as such obviously has loads of different tags etc. So just testing for opening and closing tags sadly wont cut it


Solution

  • You may look for

    (?s).*?(<div id="start">.*?<div id="end">)(?:(?:(?!<div id="start">).)*$)?
    

    and replace with $1\n\n. See regex demo.

    Details

    • (?s) - DOTALL modifier, . now matches any char
    • .*? - any 0+ chars, as few as possible
    • (<div id="start">.*?<div id="end">) - Group 1: <div id="start">, any 0+ chars as few as possible, and <div id="end">
    • (?:(?:(?!<div id="start">).)*$)? - an optional non-capturing group matching 1 or 0 occurrence of
      • (?:(?!<div id="start">).)* - any char, 0 or more occurrences, that does not start a <div id="start"> char sequence (aka tempered greedy token)
      • $ - end of string.