Search code examples
regexgoogle-sheetsgoogle-sheets-formulare2

Extract matching addresses using Google Sheets Re2 regular expression syntax


I am trying to extract all cell/range addresses appear with in a formula in a Google Sheets cell.

Formulas by nature might be very complicated. I tried many patterns, that works in web testers, but not on google sheets re2.

The following example shows two issues. Maybe I misreading the matching results, but as I understand the are 4 Matches.

Formula (ignore the logic):

=A$13:B4+$BC$12+$DE2+F2:G2

Regex:

((\$?[A-Z]+\$?\d+)(:(\$?[A-Z]+\$?\d+))?)

Expected result:

[A$13:B4,$BC$12,$DE2,F2:G2]

Here (if I am not misreading the results) it looks OK. I am not sure if the group matching displayed are also considered as matches as it is stated "4 matches, 287 steps"

However in google sheets returns all Match 1 results

[A$13:B4,A$13,:B4,B4]    

The other matches are ignored So I guess the question is how to convert the regex to re2 syntax?

Update: Following player0 comments, maybe I was not clear. This is only a simple example, to isolate other issues I have. This one is just a string containing addresses in few relative and absolute formats. However, I am looking for a wider general solution that will fit any possible formulas that might contain formulas and references to other sheets. For example:

=(STDEVA(Sheet1!B2:B5)+sum($A$1:$A$2))*B2

Expected results here is Sheet1!B2:B5,$A$1:$A$2,B2

This formula contains two formulas and reference to another sheet. Still ignoring here from Named Ranges and other formula possible references that I am currently can not think of. Also, the square brackets [] are irrelevant, it was just way to display the results, and actually is copied from Logs as it is all done within a script.


Solution

  • I figured out a better way without splitting, by using the /g flag. However, this works in a script and not by using Sheets internal regex function (ie REGEXEXTRACT), as I couldn't figure out how to format the regex string within a cell that will contain the /g flag and REGEXEXTRACT will accept as a valid regex.

    Here's the code:

    /* Find all predessesor cells of input range 
    */
    function findPredecessor(rng){
     
      var formualaText = rng.getFormula();
      
      /* addMatchesRegex
      * supports all A1Notation addresses 
      * the 2nd regex after the | operator will match all column addresses (A:A, Sheet2!b:B, etc)
      * some NamedRanges with names like NameRange1 
      * Does not support - NamedRange with names including dot, not ending with digits 
      */
      var addMatchesRegex = /(([\w .'!]+)?(\$?[A-Z]+\$?\d+\b)(:(\$?[A-Z]+\$?\d+))?)|([\w .'!]+)?[A-Z]+:[A-Z]+/gi; 
         
      var addMatches = formualaText.match(addMatchesRegex);
      
      Logger.log("%s add matched: %s",addMatches.length,addMatches);
      
      /* fullMatchRegex
      *  modify addMatches to return also strings like
      * 1. SUM, IFERROR, etc - internal sheets functions.
      * 2. NamedRanges
      * 
      */
      var fullMatchRegex = /(([\w .'!]+)?([\$A-Z.\d]*)(:(\$?[A-Z]+\$?\d*))?)/gi; 
      
      // match regex with formula
      var fullMatches =  formualaText.match(fullMatchRegex);
        
      Logger.log("Full matches list: %s",fullMatches);
      
      var namedRangesAdd = analyzeMatch(addMatches,fullMatches);
        
      Logger.log("%s total predecessors: %s",namedRangesAdd.length,namedRangesAdd);
    }
    
    
    
    /* This function accepts the two regex matches list
    *  and returns one unique list of all predecessor addresses
    *  @param {Array} addMatches - All A1 notation addresses 
    *                              plus some of NamedRanges 
    *  @param {Array} fullMatches - All A1 notation addresses,All NamedRanges,
    *                               Other irrelevent matches
    */
    function analyzeMatch(addMatches,fullMatches){
    
      /*Expected 
        First parameter - holds all A1Notation addresses as well as NamedRanges that
        their name in the form of /[A-Z]+/d+
        NamedRange with name including dot(.) or does not contain digits will not
        be on the list
        Second Parameter - contains all first list matches, as well as all NamedRanges
        names and also irrelevant matches to be filtered like function names and empty string 
      */
      
      //Full Matched Addresses to be returned
      var mAddresses = [];
      
      //Remove duplicate addresses
      var uniqueMatches = 
          addMatches.filter((item,index)=>addMatches.indexOf(item)===index); 
      
      //Get all named Ranges in spread sheet
      var nr = SpreadsheetApp.getActive().getNamedRanges();
      
      // Loop Named Ranges arr 
      nr.forEach(function(item){
      
        /* Check if the name of the current Named Range
        * is included in matches
        * 1. first in addMatches list
        * 2. only if not found in the wider list */
        
        var name = item.getName();
        
        //Check if in addmatches array
        var i = uniqueMatches.indexOf(name);
        
        //Build A1Notation address of current NamedRange 
        var rng = item.getRange();
        var add = "'" + rng.getSheet().getName() + "'!" + rng.getA1Notation();    
        
        if (i > -1){
          
          //Add the address of curr NamedRange to final list 
          mAddresses.push(add);
          //Remove curr NamedRange from list
          uniqueMatches.splice(i,1);
          
        }else if (fullMatches.includes(name)){
          // Name found - add the address of the 
          //              Named Range to matched Addresses list
          
          mAddresses.push(add);    
        }
        
      });
      
      //Add all left matched addresses to final list  
      mAddresses.push(...uniqueMatches);
      
      return mAddresses;
       
    }
    

    What makes it a bit complicated are NamedRanges. This code will match and analyze and returns one list with all predecessors addresses including the addresses of the NamedRanges.