I'm working on a project where I need to process HTML tables generated by the tinytable
package. The HTML includes JavaScript that dynamically applies CSS styling to the table cells. My goal is to extract the modified plain HTML after all the styling has been applied.
This is an example table that I would like to process:
library(tinytable)
tt(mtcars[1:4, 1:4]) |>
style_tt(j = 1:2, background = "teal", color = "white") |>
save_tt("example.html", overwrite = TRUE)
This code saves an example.html
file with colors applied by JavaScript. I would like to convert that to plain HTML with the styles.
I am very open to suggestions on alternatives. The one path I tried was to save the HTML to a temporary file, use servr
to serve the file, then chromote
to browse the file headlessly and to extract. However, I keep running into timeout issues.
Again, I'm happy to try a different strategy if you can propose something more effective or direct.
Here's what I tried so far:
library(servr)
library(chromote)
library(tinytable)
serve_and_strip <- function(filename) {
fn <- file.path(tempdir(), "index.html")
file.copy(filename, fn, overwrite = TRUE)
srv <- servr::httd(tempdir())
url <- file.path(srv$url, "index.html")
b <- ChromoteSession$new()
b$Page$navigate(url)
tab <- b$Runtime$evaluate("document.querySelector('table').outerHTML")$result$value
sty <- b$Runtime$evaluate("document.querySelector('style').outerHTML")$result$value
out <- list(tab, sty)
b$close()
servr::daemon_stop(srv$daemon)
return(out)
}
serve_and_strip("example.html")
Edit: If I just scrape the HTML file, the first cell shows up as <td>21.0</td>
. However, if you load the page in Firefox or Chrome and right-click to "Inspect" the cell, you'll see that it has become: <td class="tinytable_css_n9oxlmixvkthzx38wcrd">21.0</td>
. This is because the Javascript functions were run by Firefox, and have added class information to the cell. What I want to retrieve is the HTML and CSS code from the page after applying JS functions. This is why I suggested going through a headless browser.
The pre-transformation HTML code looks like this:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>tinytable_xg1s2bqenh3yuyr9x2mg</title>
<link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet">
<style>
.table td.tinytable_css_u43qs0b5ucz8ik5jm341, .table th.tinytable_css_u43qs0b5ucz8ik5jm341 { border-bottom: solid 0.1em #d3d8dc; }
.table td.tinytable_css_4rjvz3zmw0n1t4i0jlbe, .table th.tinytable_css_4rjvz3zmw0n1t4i0jlbe { color: white; background-color: teal; }
</style>
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
<script>
MathJax = {
tex: {
inlineMath: [['$', '$'], ['\\(', '\\)']]
},
svg: {
fontCache: 'global'
}
};
</script>
</head>
<body>
<div class="container">
<table class="table table-borderless" id="tinytable_xg1s2bqenh3yuyr9x2mg" style="width: auto; margin-left: auto; margin-right: auto;" data-quarto-disable-processing='true'>
<thead>
<tr>
<th scope="col">mpg</th>
<th scope="col">cyl</th>
<th scope="col">disp</th>
<th scope="col">hp</th>
</tr>
</thead>
<tbody>
<tr>
<td>21.0</td>
<td>6</td>
<td>160</td>
<td>110</td>
</tr>
<tr>
<td>21.0</td>
<td>6</td>
<td>160</td>
<td>110</td>
</tr>
<tr>
<td>22.8</td>
<td>4</td>
<td>108</td>
<td> 93</td>
</tr>
<tr>
<td>21.4</td>
<td>6</td>
<td>258</td>
<td>110</td>
</tr>
</tbody>
</table>
</div>
<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.bundle.min.js"></script>
<script>
function styleCell_tinytable_pii6zht3qrjzjy0n9jwp(i, j, css_id) {
var table = document.getElementById("tinytable_xg1s2bqenh3yuyr9x2mg");
table.rows[i].cells[j].classList.add(css_id);
}
function insertSpanRow(i, colspan, content) {
var table = document.getElementById('tinytable_xg1s2bqenh3yuyr9x2mg');
var newRow = table.insertRow(i);
var newCell = newRow.insertCell(0);
newCell.setAttribute("colspan", colspan);
// newCell.innerText = content;
// this may be unsafe, but innerText does not interpret <br>
newCell.innerHTML = content;
}
function spanCell_tinytable_pii6zht3qrjzjy0n9jwp(i, j, rowspan, colspan) {
var table = document.getElementById("tinytable_xg1s2bqenh3yuyr9x2mg");
const targetRow = table.rows[i];
const targetCell = targetRow.cells[j];
for (let r = 0; r < rowspan; r++) {
// Only start deleting cells to the right for the first row (r == 0)
if (r === 0) {
// Delete cells to the right of the target cell in the first row
for (let c = colspan - 1; c > 0; c--) {
if (table.rows[i + r].cells[j + c]) {
table.rows[i + r].deleteCell(j + c);
}
}
}
// For rows below the first, delete starting from the target column
if (r > 0) {
for (let c = colspan - 1; c >= 0; c--) {
if (table.rows[i + r] && table.rows[i + r].cells[j]) {
table.rows[i + r].deleteCell(j);
}
}
}
}
// Set rowspan and colspan of the target cell
targetCell.rowSpan = rowspan;
targetCell.colSpan = colspan;
}
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(0, 0, 'tinytable_css_u43qs0b5ucz8ik5jm341') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(0, 1, 'tinytable_css_u43qs0b5ucz8ik5jm341') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(0, 2, 'tinytable_css_u43qs0b5ucz8ik5jm341') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(0, 3, 'tinytable_css_u43qs0b5ucz8ik5jm341') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(0, 0, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(0, 1, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(1, 0, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(1, 1, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(2, 0, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(2, 1, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(3, 0, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(3, 1, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(4, 0, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(4, 1, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
</script>
</body>
</html>
This solution seems much simpler and appears to work:
library(chromote)
url <- "file:/home/username/example.html"
b <- ChromoteSession$new()
b$Page$navigate(url)
b$Page$loadEventFired(wait = FALSE)
body <- b$Runtime$evaluate("document.querySelector('body').outerHTML")$result$value
b$close()