Search code examples
google-sheets

regexreplace multiple columns for football data


I've been trying without success to import using google sheets to scrape the various team table data from the url's https://www.pro-football-reference.com/years/2024/ and https://www.pro-football-reference.com/years/2024/opp.htm

I found this code from another post.

ImportXML XPath issue using Google Sheets on a simple web scraping query

It is so close to my need but I've been unable adapt it to similar data from the same website.

The query portion pulls the correct unparsed data.

QUERY(IMPORTDATA("https://www.pro-football-reference.com/years/2024/");"select Col1 where Col1 contains 'exp_pts_rush'")

So I believe my misunderstanding of REGEXREPLACE is the issue.

Any help is appreciated

=ARRAYFORMULA(SPLIT(QUERY(ARRAYFORMULA(REGEXREPLACE(ARRAYFORMULA(SUBSTITUTE(QUERY(IMPORTDATA("https://www.pro-football-reference.com/years/2024/");"select Col1 where Col1 contains 'exp_pts_rush'");"></td>";">0</td>"));".+ranker.+?>(.+?)<.+team.+([A-Z]{3}).+g.+?>(.+?)<.+?rush.+?>(.+?)<.+?rush.+?>(.+?)<.+?rush.+?>(.+?)<.+?rush.+?>(.+?)<.+?rush.+?>(.+?)<.+?rush.+?>(.+?)<.+?fum.+?>(.+?)<.+?exp.+";"$1;$2;$3;$4;$5;$6;$7;$8;$9;$10;$11"));"select * WHERE NOT Col1 contains '<'");";"))

I hope to a result like this. I hope to a result like this;


Solution

  • Here's one approach which you may test out:

    =let(Σ,tocol(split(choosecols(importxml("https://www.pro-football-reference.com/years/2024","//*[@id='all_rushing']"),3),char(10))),
     filter(substitute(split(regexreplace(Σ,"([a-z])( )([A-Za-z]+\d+|\d+[a-zA-Z]+|[A-Z])","$1🌀$3")," "),"🌀"," "),regexmatch(Σ,"^\d+ ")))
    

    enter image description here

    semicolon-based

    =let(Σ;tocol(split(choosecols(importxml("https://www.pro-football-reference.com/years/2024";"//*[@id='all_rushing']");3);char(10)));
     filter(substitute(split(regexreplace(Σ;"([a-z])( )([A-Za-z]+\d+|\d+[a-zA-Z]+|[A-Z])";"$1🌀$3");" ");"🌀";" ");regexmatch(Σ;"^\d+ ")))