Search code examples
regexcoldfusion

Page scraping using ColdFusion


I need to create a new layout dynamically using ColdFusion by scraping the top and bottom of the page and saving as 2 different variables.

The top stops at the top until this.

googleoff: all (This is in an HTML Comment)

The bottom starts at this

googleon: all (This is in an HTML comment)

until the end.

I am thinking that I can use regular expressions to do this.


Solution

  • Assuming that these comments only occur in the positions you have stated, you can easily do this with a regex string split:

    <cfset Sections = String.split( '<!-- google(?:on|off): all -->' ) />
    
    <cfset TopOfPage    = Sections[1] />
    <cfset BottomOfPage = Sections[3] />
    

    An updated regex would be needed if they comments are not fixed - for example, you can replace the spaces with \s* if the whitespace is unpredictable.


    For comparison, here's a non regex version:

    <cfset EndOfTopPos      = find( '<!-- googleoff: all -->' , String ) - 1 />
    <cfset StartOfBottomPos = find( '<!-- googleon: all -->' , String , EndOfTopPos ) + 22 />
    
    <cfset TopOfPage    = left( String , EndOfTopPos ) />
    <cfset BottomOfPage = right( String , len(String)-StartOfBottomPos ) />
    

    Since this works with fixed strings, it is faster - but you would need to be repeating this several thousand times before this difference might be significant.