I want to export HTML content from Confluence pages. Those can contain <img>
tags with src
attributes that are just usual hyperlinks. Since I want to export those as well I decided to replace the src
content to their corresponding data URLs, so that there is src="…"
.
This needs fetching of the images via HTTP of course, and this can only be done in an asynchronous manner. Also, it contains lots of "nested" asynchronous calls.
This is my code so far:
/**
* @param {HTMLTableCellElement | undefined} cell
*/
async #getCellHtml(cell) {
if (!cell) return undefined;
const srcMap = {}
for await (const imgElement of cell.querySelectorAll('img')) {
if ("attachment" !== imgElement.dataset.linkedResourceType) {
return;
}
const imgUrl =
new URL(imgElement.src, imgElement.dataset.baseUrl);
await fetch(imgUrl)
.then(response => response.blob())
.then(blob => blob.arrayBuffer())
.then(arrayBuffer => {
srcMap[imgElement.src] =
`data:${imgElement.dataset.linkedResourceContentType};base64,`
+ Buffer.from(arrayBuffer).toString('base64');
});
}
const cellHtml = cell.innerHTML;
Object.entries(srcMap).forEach(([imgSrc, dataUrl]) => {
cellHtml.replace(imgSrc, dataUrl)
})
return cellHtml;
}
For reference, such HTML looks like the following:
<p style="text-align: left;"><br/></p>
<p style="text-align: left;"><span
class="confluence-embedded-file-wrapper confluence-embedded-manual-size"><img
class="confluence-embedded-image" draggable="false" width="639"
src="/confluence/download/attachments/2345432345/image-2024-7-11_16-48-22-1.png?version=1&modificationDate=1720709302000&api=v2"
data-image-src="/confluence/download/attachments/235432345/image-2024-7-11_16-48-22-1.png?version=1&modificationDate=1720709302000&api=v2"
data-unresolved-comment-count="0" data-linked-resource-id="345654345"
data-linked-resource-version="1" data-linked-resource-type="attachment"
data-linked-resource-default-alias="image-2024-7-11_16-48-22-1.png"
data-base-url="https://suite.acme.com/confluence"
data-linked-resource-content-type="image/png"
data-linked-resource-container-id="1491043790"
data-linked-resource-container-version="1" alt=""/></span></p>
<p style="text-align: left;"><br/></p>
<p style="text-align: left;"><br/></p>
My intention is loop through all <img>
elements, find relevant <img>
tags, fetch their image data, and collect a replacement array. Afterwards, I'd just replace all findings with their respective data URL.
What I think I would want is something like this:
cell.querySelectorAll('img').map(cell => {
// return a Promise that combines all the fetching etc.
// so that it resolves() with returning the base64 string(!).
return new Promise()…
});
After I map()
ped this array to Promises I could Promise.all()
and do the replacement of the HTML then.
I have no idea how to "return" that last promise after all the other ones fulfilled already. Should my code use await
's rather than .then()
invocations so I don't get into callback context?
A few remarks on your current code
for await (const imgElement of cell.querySelectorAll('img')
: as querySelectorAll
is not async
you don't need for await (...)
a plain for (...)
loop is ok.
if ("attachment" !== imgElement.dataset.linkedResourceType) { return; }
will exit the method on the first element not meeting this condition and leave all other other elements unhandled. Moreover, the images already loaded, won't be replaced, because you never reach the code after the loop. Use continue
instead of return
to skip the current element and continue with the next element in the list.
You shouldn't mix async/await
with then/catch
if you don't know exactly what you are doing. Because it will cause confusion and probably lead to unexpected behaviour
That being said, I'd refactor your code to the following.
As your async #getCellHtml(cell)
is async, I'd completely switch to await
and ditch all .then(...)
Replace your for
loop iterating over all elements with a Promise.all()
. You don't really need the result of that Promise.all
because if it doesn't throw, you know, all promises have successfully resolved. And as each callback sets the respective value in srcMap
object, you know, once the Promise.all()
resolved, all images have been loaded.
...
let srcMap = {};
await Promise.all(cell.querySelectorAll('img').map(async c => {
if ("attachment" !== c.dataset.linkedResourceType) {
//ignore wrong resource types and do nothing
return;
};
//for correct resourcetype load the images and update the `srcMap` object
const
imgUrl = new URL(c.src, c.dataset.baseUrl),
resp = await fetch(imgUrl),
blob = await resp.blob(),
buff = await blob.arrayBuffer();
scrMap[c.src] = ...
});
const cellHtml = cell.innerHTML;
...
Of course this code has no errorhandling whatsoever. So if for instance one image fails to load, the whole process throws. But I let including that error handling for you as an exercise.