I want to segment a parent block, while capturing nested tags along each segment's text:
(?<tag>.)(?: href="(?<url>.+?)")?>(?<text>.+?)<
it works, but I want the "tag" to be empty in segments that aren't wrapped in a tag, however with the current reg., these capture the preceding segment's closing tag...
Live sample: https://regex101.com/r/UEZAaw/3/
The result set I would like to obtain, note that item 2 and 4 should have null
for the tag:
match: "p>The <",
tag: "p",
url: null,
text: "The "
match: "a href=\"https://www.legislation.gov.uk/ukpga/2010/23/contents\">UK Bribery Act<",
tag: "a",
url: "https://www.legislation.gov.uk/ukpga/2010/23/contents",
text: "UK Bribery Act"
match: "/a> (“the Act”) received Royal Assent in April 2010 and came into ... <",
tag: null
url: null,
text: " (“the Act”) received Royal Assent in April 2010 and came into ... "
match: "a href=\"http://www.oecd.org/daf/anti-bribery/ConvCombatBribery_ENG.pdf\">OECD anti-bribery Convention<",
tag: "a",
url: "http://www.oecd.org/daf/anti-bribery/ConvCombatBribery_ENG.pdf",
text: "OECD anti-bribery Convention"
match: "/a>. The Act outlined four prime offences, including the introduction ... <",
tag: null,
url: null,
text: ". The Act outlined four prime offences, including the introduction ... "
match: "b>rest is history<",
tag: "b",
url: null,
text: "rest is history"
How can I fix this?
I think this works, based on what I see in the MATCH INFORMATION box on regex101:
/(?:(?<tag>(?<!\/).)|(?:\/.))(?: href="(?<url>.+?)")?>(?<text>.+?)</gm