I've been trying without success to import using google sheets to scrape the various team table data from the url's https://www.pro-football-reference.com/years/2024/ and https://www.pro-football-reference.com/years/2024/opp.htm
I found this code from another post.
ImportXML XPath issue using Google Sheets on a simple web scraping query
It is so close to my need but I've been unable adapt it to similar data from the same website.
The query portion pulls the correct unparsed data.
QUERY(IMPORTDATA("https://www.pro-football-reference.com/years/2024/");"select Col1 where Col1 contains 'exp_pts_rush'")
So I believe my misunderstanding of REGEXREPLACE
is the issue.
Any help is appreciated
=ARRAYFORMULA(SPLIT(QUERY(ARRAYFORMULA(REGEXREPLACE(ARRAYFORMULA(SUBSTITUTE(QUERY(IMPORTDATA("https://www.pro-football-reference.com/years/2024/");"select Col1 where Col1 contains 'exp_pts_rush'");"></td>";">0</td>"));".+ranker.+?>(.+?)<.+team.+([A-Z]{3}).+g.+?>(.+?)<.+?rush.+?>(.+?)<.+?rush.+?>(.+?)<.+?rush.+?>(.+?)<.+?rush.+?>(.+?)<.+?rush.+?>(.+?)<.+?rush.+?>(.+?)<.+?fum.+?>(.+?)<.+?exp.+";"$1;$2;$3;$4;$5;$6;$7;$8;$9;$10;$11"));"select * WHERE NOT Col1 contains '<'");";"))
I hope to a result like this.
Here's one approach which you may test out:
=let(Σ,tocol(split(choosecols(importxml("https://www.pro-football-reference.com/years/2024","//*[@id='all_rushing']"),3),char(10))),
filter(substitute(split(regexreplace(Σ,"([a-z])( )([A-Za-z]+\d+|\d+[a-zA-Z]+|[A-Z])","$1🌀$3")," "),"🌀"," "),regexmatch(Σ,"^\d+ ")))
=let(Σ;tocol(split(choosecols(importxml("https://www.pro-football-reference.com/years/2024";"//*[@id='all_rushing']");3);char(10)));
filter(substitute(split(regexreplace(Σ;"([a-z])( )([A-Za-z]+\d+|\d+[a-zA-Z]+|[A-Z])";"$1🌀$3");" ");"🌀";" ");regexmatch(Σ;"^\d+ ")))