Search code examples
web-scrapinggoogle-sheetsxpathgoogle-sheets-formula

Trying to use importxml in google sheets to extract a value from a website, but it returns a completely different value than I am selecting


image of my google sheet

I am trying to create a sheet where I want to scrape values from www.tapology.com of individual fighters and fill columns created like fighter's name, his record, where he is from, etc.

I would like to scrape some values of this fighter so I started with name:

https://www.tapology.com/fightcenter/fighters/119825-dominik-mazur

I put together something like this:

=IMPORTXML("https://www.tapology.com/fightcenter/fighters/119825-dominik-mazur", "//div[@id='stats']/ul[1]/li[1]/span[1]")

The XPath should lead to the name of the fighter which is Dominik Mazur, but for some reason it returns a different fighters name at random.

I guess it gets all the names that have the same structure, but I have no idea how to single out the value that I'm looking for.

Generally my goal is to create a google sheet where I paste a link to any fighter's tapology and it fills my sheet with the information I want instead of me manually typing everything.

Unfortunately I am not a programmer or very technical person so I would really appreciate your help. If anything is unclear please let me know.


Solution

  • Solution

    Try this

    =arrayformula(split(transpose(split(scraping("yourURLhere"),"♥")),"♦"))
    

    with custom function

    function scraping(url){
      var source = UrlFetchApp.fetch(url).getContentText().replace(/(\r\n|\n|\r|\t|  )/gm,"")
      var data = source.match(/<div class='details details_two_columns' id='stats'>[\s\S\w]+?<\/div>/g)
      return (data[0].replace(/<li>/g,'♥').replace(/<span>/g,'♦').replace(/<\/li>|<strong>|<\/strong>|<\/span>/g,'').replace(/(<([^>]+)>)/g,''))
    }
    

    Scrapping multiple urls

    function infoTapology(url){
      var source = UrlFetchApp.fetch(url).getContentText().replace(/(\r\n|\n|\r|\t|  )/gm,"")
      var data = source.match(/<div class='details details_two_columns' id='stats'>[\s\S\w]+?<\/div>/g)
      var infos = data[0].replace(/<li>/g,'♥').replace(/(<([^>]+)>)/g,'').replace(/&#39;/g,'’').replace(/&quot;/g,'”').split('♥')
      var result = []
      infos.forEach(function(info){
        var x = info.split(':')[0]
        if (info) {x=info.replace(x+':','')}
        result.push(x)
      })
      return([result])
    }
    

    enter image description here

    A comparison between importxml and custom function:

    https://docs.google.com/spreadsheets/d/1se1lWH4g4OSvAW2NldjGiTDS5XB19_5QwSBpMqVRy84/copy