Search code examples
phpxmlxpathxml-parsingdomxpath

Xpath expression for first sentence in a pagraph


I'm looking for an Xpath expression for the first sentence in a paragraph.

<p>
A federal agency is recommending that White House adviser Kellyanne Conway be 
removed from federal service saying she violated the Hatch Act on numerous 
occasions. The office is unrelated to Robert Mueller and his investigation.
</p>

The result should be:

A federal agency is recommending that White House adviser Kellyanne Conway be 
removed from federal service saying she violated the Hatch Act on numerous 
occasions.

I've tried a few things to no avail.

$expression = '/html/body/div/div/div/div/p//text()';

Would I need to use: //p[ends-with or maybe substring-before?


Solution

  • You're not going to be able to parse natural language via XPath, but you can get the substring up to and including the first period as follows:

    substring(/p,1,string-length(substring-before(/p,"."))+1)
    

    Note that this may not be the "first sentence" if there are abbreviations or other lexical occurences of a period before the first sentence ends, if the first sentence ends with another form of punctuation, etc.


    Alternatively, and more concisely:

    concat(substring-before(/p, "."), ".")
    

    Credit: ThW's clever idea in the comments.