Search code examples
regexgoogle-sheetsextractgoogle-sheets-formulastring-matching

Regex: extract multiple URL strings from a cell of arrays


What is a clean regex pattern for matching URL strings that stops at the first comma? Trying to extract values from an array of arrays in Google Sheets.

Cell A1

{https://www.myshop.com/shop/the_first_shop,Marcus. White's. Shop.,ACTIVE,US};{https://www.myshop.com/shop/a-second-shop,The first! Shop,CLOSED,UK};{EMPTY,ClosedShop,CLOSED,IN}

Desired Output (Cell B1)

https://www.myshop.com/shop/the_first_shop,https://www.myshop.com/shop/a-second-shop

I have figured out how to get a clean array of matching values in my desired output cell using:

=trim(regexreplace(regexreplace(regexreplace(REGEXREPLACE(A2,"/(https?:\/\/[^ ]*)/"," "),";"," "),"}"," "),"{"," "))

But I can't find a regex pattern that stops at a comma. For example, this soution:

"/(https?:\/\/[^ ]*)/" 

matches the first URL, but gives me back:

https://www.myshop.com/shop/the_first_shop,Marcus. White's. Shop.,ACTIVE,US https://www.myshop.com/shop/a-second-shop,The first! Shop,CLOSED,UK EMPTY,ClosedShop,CLOSED,IN

Solution

  • I'd go with REGEXREPLACE and use:

    =REGEXREPLACE(A1,".*?(?:(https.*?)|$)","$1")
    

    Just a trailing comma to deal with...

    =REGEXREPLACE(REGEXREPLACE(A1,".*?(?:(https.*?(,))|$)","$1"),",$","")
    

    A much longer alternative to REGEXREPLACE could be:

    =TEXTJOIN(",",,QUERY(TRANSPOSE(SPLIT(SUBSTITUTE(SUBSTITUTE(A1,"{","}"),"}",","),",")),"Select Col1 where Col1 like 'http%'"))