USING: accessors html.parser.analyzer io kernel math namespaces
present regexp sequences ;
IN: all-roads-to-wiki
SYMBOL: G
: match-good-pages ( a -- ?/f )
R/ \/wiki\/[^:]*$/ first-match ;
: filter-urls ( tags -- urls )
find-hrefs [ present ] map
[ match-good-pages ] filter
[ match-good-pages seq>> ] map ;
: findpath ( url -- url )
G get =
[
! false
]
[ scrape-html nip
[
dup "title" find-by-name drop 1 + swap nth
text>> R/ - Wikipedia,/ re-split first print
]
[
"bodyContent" find-by-id-between filter-urls [ findpath ] map
] bi
] if ; inline recursive
: allroads-entry ( -- a )
readln "http://en.wikipedia.org/wiki/" prepend G set-global
"enwp.org/Special:Random" findpath ; inline
The above code will recurse over every link on Wikipedia until it finds the one it's looking for.
That's okay, because (hopefully) findpath
will eventually "return" (i.e. not call itself again) and leave a huge nested data structure on the stack. But when I try to compile this, I get an unbalanced-recursion
error:
The recursive word “findpath” leaves with the stack having the wrong height
unbalanced-recursion
: Thrown when stack effect inference determines that an inline recursive word has an incorrect stack effect declaration.
No matter what I do, Factor (understandably) complains about the stack effect not matching up. What do I have to do to get this to recurse properly?
Look closely at the find-path
word. I'll add comments so that you can see what is on the stack:
: findpath ( url -- url )
! 1 item: { url }
G
! 2 items: { url G }
get
! 2 items: { url value-of-G }
=
! 1: item { t/f }
[
! 0 items!!!!
! false
]
[ scrape-html nip
[
dup "title" find-by-name drop 1 + swap nth
text>> R/ - Wikipedia,/ re-split first print
]
[
"bodyContent" find-by-id-between filter-urls
[ findpath ] map
] bi
] if ; inline recursive
The if
combinator consumes the last item on the stack so this code can't possibly work. Here is working code for the findpath
word:
: page-title ( seq -- title )
dup "title" find-by-name drop 1 + swap nth
text>> R/ - Wikipedia,/ re-split first ;
: page-links ( seq -- links )
"bodyContent" find-by-id-between filter-urls ;
: scrape-en-wiki-url ( wiki-url -- seq )
"https://en.wikipedia.org" prepend
dup print flush scrape-html nip ;
: found-url? ( wiki-url -- ? )
G get [ = ] [ drop t ] if* ;
: findpath ( wiki-url -- seq/f )
dup found-url?
[ drop f G set f ] [
scrape-en-wiki-url
[ page-title print flush ] [
page-links [ findpath ] map
] bi
] if ; inline recursive
Also take a look at the Wikipedia vocab which is meant for tasks like these.