Search code examples
javascriptgoogle-apps-scriptcheerio

Is there a way to improved method for separating a substring from search position text via indexOf?


The method I use I need to put +13 and -1 inside the calculation when searching the position of each part of the text (const Before and const After), is there a more reliable and correct way?

  const PositionBefore = TextScript.indexOf(Before)+13;
  const PositionAfter = TextScript.indexOf(After)-1;

My fear is that for some reason the search text changes and I forget to change the numbers for the calculation and this causes an error in the retrieved text.

The part of text i'm return is date and hour:

2021-08-31 19:12:08
function Clock() {
  var sheet = SpreadsheetApp.getActive().getSheetByName('Clock');
  var url = 'https://int.soccerway.com/';
  
  const contentText = UrlFetchApp.fetch(url).getContentText();
  const $ = Cheerio.load(contentText);
  
  const Before = '"timestamp":"';
  const After = '});\n    block.registerForCallbacks();';
  
  var ElementSelect = $('script:contains(' + Before + ')');
  var TextScript = ElementSelect.html().replace("\n","");
  
  const PositionBefore = TextScript.indexOf(Before)+13;
  const PositionAfter = TextScript.indexOf(After)-1;
  
  sheet.getRange(1, 1).setValue(TextScript.substring(PositionBefore, PositionAfter));
}

Example full text colected in var TextScript:

  (function() {
    var block = new HomeMatchesBlock('block_home_matches_31', 'block_home_matches', {"block_service_id":"home_index_block_homematches","date":"2021-08-31","display":"all","timestamp":"2021-08-31 19:12:08"});
    block.registerForCallbacks();
    
    $('block_home_matches_31_1_1').observe('click', function() { block.filterContent({"display":"all"}); }.bind(block));
$('block_home_matches_31_1_2').observe('click', function() { block.filterContent({"display":"now_playing"}); }.bind(block));


      block.setAttribute('colspan_left', 2);
  block.setAttribute('colspan_right', 2);



    TimestampFormatter.format('block_home_matches_31');
  })();
  

Solution

  • There is no way to eliminate the risk of structural changes to the source content.

    You can take some steps to minimize the likelihood that you forget to change your code - for example, by removing the need for hard-coded +13 and -1. But there can be other reasons for your code to fail, beyond that.

    It's probably more important to make it extremely obvious when your code does fail.

    Consider the following sample (which does not use Cheerio, for simplicity):

    function demoHandler() {
      var url = 'https://int.soccerway.com/';
      const contentText = UrlFetchApp.fetch(url).getContentText();
    
      var matchedJsonString = contentText.match(/{.*?"timestamp".*?}/)[0];
      if ( matchedJsonString ) {
        try {
          var json = JSON.parse(matchedJsonString);
        } catch(err) {
          console.log( err ); // "SyntaxError..."
        }
        console.log(json.timestamp)
      } else {
        consle.log( 'Something went terribly wrong...' )
      }
    
    }
    

    When you run the above function it prints the following to the console:

    2021-08-31 23:18:46
    

    It does this by assuming the key value of "timestamp" is part of a JSON string, starting with { and ending with }.

    You can therefore extract this JSON string and convert it to a JavaScript object and then access the timestamp value directly, without needing to handle substrings.

    If the JSON is not valid you will get an explicit error similar to this:

    [SyntaxError: Unexpected token c in JSON at position 0]
    

    Scraping web page data almost always has these types of risk: Your code can be brittle and break easily if the source structure changes without warning. Just try to make suc changes as noticeable as possible. In your case, write the errors to your spreadsheet and make it really obvious (red, bold, etc.).

    And make good use of try...catch statements. See: try...catch