Using the excellent guide at https://www.tachytelic.net/2021/05/power-automate-extract-text-from-word-docx-file/, I am able to setup a Power Automate flow to extract the text content from a DOCX file. After getting the content of the file, I have a Compose action with the following expression:
xpath(xml(outputs('Get_file_content')?['body']), '//*[name()=''w:t'']/text()')
This works to extract the content, however, one of the drawbacks is that the text can often be split over several w:t nodes from the document.xml file, resulting in each section of the text being retrieved as its own individual object in the Compose output. For example, the output of the Compose action may appear like so:
[
"If ",
"not, ",
"this needs to be documented. "
]
However, the above text, as it appears inside the document, should really just be:
[
"If not, this needs to be documented."
]
As this text is contained in a single table cell, is it possible to tweak the XPath expression above, so that it somehow contatenates or combines all the text in the w:t nodes (per table cell), so that text values aren't split in this way in the final Compose object?
Since power-automate only supports xpath 1.0 you are forced to use something out of xpath scope like the answer of @Skin
If you want those strings per cell you could first use this xpath:
//*[name()='w:tc']
and loop over that result with:
join(xpath(xml(variables('XML')), './/*[name()=''w:t'']/text()'), '')
mind the .
as start of the second xpath , so it will use the current context (being the current w:tc
).