Capture matching substrings from a DCG

Using regular expressions makes it quite easy to capture sub strings, e.g. the string "Jaco was an American bassist" matches this regular expression (PCRE2 syntax):

(?sm)^([Jj]aco).+(was|is).+?(American|famous).+(dancer|bassist|singer|programmer|dueller)

and captures these strings

Jaco
was
American
bassist.

Here is a DCG that matches the string as well as generating all the possible strings. But it doesn't capture the specific sub strings.

jaco_bassist --> ("J" ; "j"), "aco", space, ("was" ; "is"), space, ("a" ; "an"), space,
                   ("American" ; "famous"), space,
                   ("dancer" ; "bassist" ; "singer" ; "programmer" ; "dueller").
space --> " ".

What would be the best - or at last a good - way of getting the same captures using Prolog's DCGs. Preferably an approach that also generates the possible strings.

For simple problems like this one can use member/2 to enumerate all the alternatives:

jaco_bassist2([Name,WasIs,Adj,Noun]) -->  who(Name), space, was_is(WasIs), space,
                                          ("a" ; "an"), space, adj(Adj), space,
                                          noun(Noun).

who(Who) --> [Who], {member(Who,["Jaco","jaco"])}.
was_is(WasIs) --> [WasIs], {member(WasIs,["was","is"])}.
adj(Adj) --> [Adj], {member(Adj,["American","famous"])}.
noun(Noun) --> [Noun], {member(Noun,["dancer","bassist","singer","programmer","dueller"])}.

To get the captures:

 % ...
 phrase(jaco_bassist2,[Who,WasIs,Adj,Noun], String)

A major drawback of this approach is that for more complex structures the enumeration can be a little tricky, for example if the name in the subject string instead of "[Jj]aco" would be one of the 48 spellings of my last name (kjellerstrand):

kjellerstrand --> "k", ("je" ; "ä"), "ll", ("" ; "er" ; "ar"), 
                  ("st" ; "b"), ("" ; "r"), "a", (""; "n"), "d".

Please note that I'm looking for "basic" DCG, for example those supported by e.g. B-Prolog (i.e. not requiring SWI-Prolog's fancy DCG stuff).

Solution

Let me re-phrase that: Given a goal phrase(NT__0, Cs0,Cs), capture the sequence described by NT__0.

First of all we need to restrict ourselves to DCGs without semicontext. For a (non-empty) semicontext may be represented with two variables (which in that context do not form a difference) but cannot be captured with a single list.

append(Capture, Cs, Cs0) should be it. At least declaratively when considering only ground terms.

as --> "" | "a", as.

?- Cs = [], phrase(as, Cs0,Cs), append(Capture, Cs, Cs0).
   Cs = [], Cs0 = [], Capture = []
;  Cs = [], Cs0 = "a", Capture = "a"
;  Cs = [], Cs0 = "aa", Capture = "aa"
;  Cs = [], Cs0 = "aaa", Capture = "aaa"
;  Cs = [], Cs0 = "aaaa", Capture = "aaaa"
;  ... .
?- phrase(as, Cs0,Cs), append(Capture, Cs, Cs0).
   Cs0 = Cs, Capture = []
;  Cs0 = [_A|Cs0], Cs = [_A|Cs0], Capture = [_A], unexpected
;  Cs0 = [_A,_B|Cs0], Cs = [_A,_B|Cs0], Capture = [_A,_B], unexpected
;  ... .
?- set_prolog_flag(occurs_check,true).
   true.
?- phrase(as, Cs0,Cs), append(Capture, Cs, Cs0).
   Cs0 = Cs, Capture = []
;  loops, unexpected.

So far, the procedural reality of Prolog is a bit different. append/3 only works for lists but not for partial lists. There infinite, rational trees show up. And the occurs-check does not help that much, it just prevents the display of such answers, but keeps non-termination.

Time for a new version of append/3, append2u/3

?- set_prolog_flag(occurs_check,false).
   true.
?- phrase(as, Cs0,Cs), append2u(Capture, Cs, Cs0).
   Cs0 = Cs, Capture = []
;  Cs0 = [a|Cs0], Cs = [a|Cs0], Capture = [], unexpected
;  Cs0 = [a|Cs], Capture = "a", dif:dif(Cs,[a|Cs])
;  Cs0 = [a,a|Cs0], Cs = [a,a|Cs0], Capture = [], unexpected
;  Cs0 = [a,a|Cs], Capture = "aa", dif:dif(Cs,[a,a|Cs])
;  Cs0 = [a,a,a|Cs0], Cs = [a,a,a|Cs0], Capture = [], unexpected
;  ... .
?- set_prolog_flag(occurs_check,true).
   true.
?- phrase(as, Cs0,Cs), append2u(Capture, Cs, Cs0).
   Cs0 = Cs, Capture = []
;  Cs0 = [a|Cs], Capture = "a"
;  Cs0 = [a,a|Cs], Capture = "aa"
;  Cs0 = [a,a,a|Cs], Capture = "aaa"
;  ... .

So with the help of the occurs-check it is possible to get this right, also for the more general case. A new non-terminal phrase_capture//2 now uses the following internal definition:

phrase_capture(NT__0, Capture, S0,S) :-
   phrase(NT__0, S0,S1),
   append2u(Capture, S1, S0),
   S1 = S.

For systems without a built-in occurs-check like B, rewrite append2u/3 using unify_with_occurs_check/2 explicitly. That is, also for (\=)/2.

Some further optimizations may be done to avoid costs that depend on the size of Cs0+Cs instead of the length of Capture. Like special casing for var(Cs), Cs == [], and partial strings. If Cs is a list constructor, an internal implementation may also just skip through Cs0 to find that very address of Cs first, and only resort to more costly means otherwise. Care must be given to ensure that this is always terminating, thus using mechanisms similar to '$skip_max_list'/4.

Also, what to do if Cs0 and Cs do not fit, that is, if they are not the result of a valid grammar. Such a case may happen with generalizations to explain unexpected failure.

Usage:

jaco_bassist([Name,WasIs,Adj,Noun]) -->
   phrase_capture( (("J" ; "j"), "aco"), Name),
   space,
   phrase_capture( ("was" ; "is"), WasIs),
   space,
   ("a" ; "an"),
   space,
   phrase_capture( ("American" ; "famous"), Adj),
   space,
   phrase_capture( ("dancer" ; "bassist" ; "singer" ; "programmer" ; "dueller"), Noun).

?- phrase(jaco_bassist(D), Ys).
   D = ["Jaco","was","American","dancer"], Ys = "Jaco was a American ..."
;  D = ["Jaco","was","American","bassist"], Ys = "Jaco was a American ..."
;  D = ["Jaco","was","American","singer"], Ys = "Jaco was a American ..."
;  ...
;  D = ["jaco","is","famous","dueller"], Ys = "jaco is an famous d ...".

So this version terminates also when generating strings. And it has the potential to incur costs that are in many cases only depending on the length of the captured string. The original version using append/3 will always visit the entire string.

Lest I forget, there will always be some oddities should you be into infinite lists. Think of:

?- phrase("abc",L0,L0).
   L0 = [a,b,c|L0].
?- phrase("abc",L0,L0), phrase(phrase_capture("abc",Capture),L0,L).
   L0 = [a,b,c|L0], Capture = [], L = [a,b,c|L0], unexpected.
   L0 = [a,b,c|L0], Capture = "abc", L = [a,b,c|L0]. % expected

These are all typical paradoxa that infinite lists ensue. First luring people into them only to smother them.

The following version of phrase_capture//2 does not rely on internal details. It uses the ^s of library(lambda) which are responsible for parameter passing only. (The other lambda-related construct \ is for renaming.)

phrase_capture(NT__0, Capture) -->
   call(S0^S0^true),
   NT__0,
   call(S1^S1^true),
   {append2u(Capture, S1, S0)}.