I am using the SWI-Prolog library(http/http_open)
. According to the docs, "After [http_open(Url, Stream, [])
] succeeds the data can be read from Stream." Thus, I thought maybe I could rig up a simple, declarative predicate to parse phrases from URL's by using phrase_from_stream/2
in library(pure_input)
:
phrase_from_url(Url, Phrase) :-
http_open(Url, In, []),
phrase_from_stream(Phrase, In),
close(In).
But I suspect there is some nuance to the kinds of stream provided by http_open/3
; I receive the following error:
ERROR: set_stream_position/2: stream `<stream>(0x7feebbf5c810)' does not exist (Device not configured)
(I have tested the same url against the example provided on the library(http/http_open)
docs, which uses copy_stream_data/2
to pipe the output to user_output
, and it works. So I know the url is not at fault.)
I have learned that I can download the data from the url into a string, code-list, or text file, and then use a phrase/n
, our cousin, on that. But I'm hoping someone can help inform me about...
phrase_from_stream/2
on some streams, as one might naively hope.As it is at the moment, library(pure_input)
does not support non-repositioning streams. This is the problem.
One solution is to read everything and then use the normal phrase
on it. This of course is not the same as the promised "lazy reading".
As for "parsing data from URL", keep in mind that SWI-Prolog has libraries for many things you find on the web: SGML/XML/HTML; JSON; RDF.
For picking out text from an html page, see for example this simple scraper. The relevant code is in scrape/3
and its help predicates. It uses the SWI-Prolog SGML/XML parser and library(xpath)
.
In the mean time, if you want to use a DCG to parse from a non-repositioning stream, tough luck. library(pure_input)
does not even work on the standard input. What you can do, depending on how your data is structured, is either use read_line_to_codes/3
(see the example), if your input is organized line-wise, or read_pending_input/3
if it is not, and read to a buffer.