Search code examples
xpathweb-scrapinggoogle-sheetsgoogle-sheets-formula

Extract substring using importxml and substring-after


Using Google sheet 'ImportXML', I was able to extract the following data from a url(in cell A2) using:

=IMPORTXML(A2,"//a/@href[substring-after(., 'AGX:')]").

Data:

/vector/AGX:5WH
/vector/AGX:Z74
/vector/AGX:C52
/vector/AGX:A27
/vector/AGX:C6L

But, I want to extract the code after "/vector/AGX:". The code is not fixed to 3 letters and number of rows is not fixed as well.

I used =INDEX(SPLIT(AP2,"/,'vector',':'"),1,2). But it applied to only one line of data. Had to copy the index+split function to the whole column and had to insert an additional column to store the codes.

5WH
Z74
C52
A27
C6L

But, I want to be able to extract the code(s) after AGX: using ImportXML in one go. Is there a way?


Solution

  • Your issue is in how you are implementing the index formula. The first parameter returns the rows (in your case each element) and the second the column (in your case either AGX or the code after that).

    If instead of getting a single cell we apply this formula on a range and we do not set any value for the row, the formula will return all the values achieving what you were aiming for. Here is its implementation (where F1:F5 will be the range of values you want this formula to be applied) :

    =INDEX(SPLIT(F1:F5,"/,'vector',':'"),,2)
    

    If you are interested in a solution simply using IMPORTXML and XPATH, according to the documentation you could use a substring as follows:

    =IMPORTXML(A1,"//a/@href[substring-after(.,'SGX:')]")
    

    The drawback of this is that it will return the full string and not exclusively what is after the SGX: which means that you would need to use a Google sheet formula to splitting this. This is the furthest I have achieved exclusively using XPath. In XML it would be easier to apply a forEach and really select what is after the : but I believe in sheets is more complicated if not impossible just using XPath.