I am trying to use JavaScript to search through all of the p elements to find a regular expressions, but the text that I am looking for may or may not partially exist in an attribute element or contained within a span. Ultimately, I plan to fix the cross references in the HTML code that were applied in Word to a Word bullet item by adding an attribute element with a reference to an html id that I have previously inserted with JavaScript.
My overall project is to create a Word document that I use the Save As function to have Word create a filtered HTML file. I am ultimately using JavaScript to insert ids and tags so that I can utilize a CSS file to standardize formatting of all my HTML files. Due to this, I have limited control of the initial HTML code.
Thus far I have been able to create a loop through all of the p elements. Within the loop, I am able to do a conditional statement for the regular expression on the innerText for "/Step (\d+)/" since I expect that the text will look something like Step 1, Step 12, or any other number. The code below seems to successfully enter if statement. I am running into trouble with the replace function for the innerHTML portion because the innerText matches the expression, but the innerHTML contains the element that prevents the final results that I am looking for. I would like to be able to generically account for any other element such as bold, italics, a, etc. To account for this, I have tried to use multiple if statements to replace various potential HTML conditions.
I am trying to figure out this skill by just being able to apply bold to the text to ensure that I understand how to complete this particular function. So far all of the searches that I have done have helped get the regular expression to match the innerText, but I can't find a method or ignoring the extraneous html code. I was thinking that it might be possible to store replaced innerText with the new HTML code and then make that the new innerHTML, but there could be other formatting in the p element that I want to maintain.
With the approach that I am taking to use a second regular expression for the innerHTML replace, the greedy search it seems like I would catch false results even if the regular expression was catching it.
HTML
<p id="FirstPara" class=firstpara>This is a header</p>
<p class=firstpara>This is a reference to Step <span lang=HE>‎ </span><b>1</b>.</p>
<p class=firstpara>This is a reference to Step <span lang=HE>‎</span>2.</p>
<p class=firstpara>This is a reference to Step <span lang=HE>‎</span>1 and Step <span lang=HE>‎</span>2.</p>
JavaScript function
function findTheText() {
regExp1 = /Step (\d)/g;
for (var i = 0; i < document.getElementsByTagName('p').length; i++) {
alert(i+" - "+j+" - "+document.getElementsByTagName('p')[i].innerHTML+" - "+results[j]);
var results = document.getElementsByTagName('p')[i].innerText.match(regExp1);
if (results !== null) {
for (var j = 0; j < results.length; j++) {
var replace = results[j].replace(/Step\s/,"");
var regExp2 = new RegExp('Step\s'+replace,"i");
var regExp3 = new RegExp('Step\s.*>'+replace,"i");
var regExp4 = new RegExp('Step\s.*>.*>'+replace,"i");
var results2 = document.getElementsByTagName('p')[i].innerText.match(regExp2);
var results3 = document.getElementsByTagName('p')[i].innerText.match(regExp3);
var results4 = document.getElementsByTagName('p')[i].innerText.match(regExp4);
if (results2 !== null) {
document.getElementsByTagName('p')[i].innerHTML.replace(regExp2, "<b>"+results[j]+"</b>");
} else if (results3 !== null) {
document.getElementsByTagName('p')[i].innerHTML.replace(regExp3, "<b>"+results[j]+"</b>");
} else if (results4 !== null) {
document.getElementsByTagName('p')[i].innerHTML.replace(regExp4, "<b>"+results[j]+"</b>");
}
}
}
}
}
As of now the code will find the text that I want, but since the regular expression matches the strings that I am looking for, but the innerHTML does not I am not achieving the bold (or eventually attributes) on the text.
Expected HTML output
<p class=firstpara>This is a reference to <b>Step 1</b>.</p>
<p class=firstpara>This is a reference to <b>Step 2</b>.</p>
<p class=firstpara>This is a reference to <b>Step 1</b> and <b>Step 2</b>.</p>
You might remove all child span
s and then check the textContent
to ignore the rest of the markup (like <b>
s), capturing the step digit and replacing with that surrounded by <b>
and </b>
:
document.querySelectorAll('p').forEach((p) => {
p.querySelectorAll('span').forEach(span => span.remove());
p.innerHTML = p.textContent.replace(/Step +(\d+)/g, '<b>Step $1</b>');
});
<p id="FirstPara" class=firstpara>This is a header</p>
<p class=firstpara>This is a reference to Step <span lang=HE>‎ </span><b>1</b>.</p>
<p class=firstpara>This is a reference to Step <span lang=HE>‎</span>2.</p>
<p class=firstpara>This is a reference to Step <span lang=HE>‎</span>1 and Step <span lang=HE>‎</span>2.</p>
To only remove span
s with a lang
of HE
:
document.querySelectorAll('p').forEach((p) => {
p.querySelectorAll('span[lang="HE"]').forEach(span => span.remove());
p.innerHTML = p.textContent.replace(/Step +(\d+)/g, '<b>Step $1</b>');
});
<p class=firstpara>This is a <span>reference</span> to Step <span lang=HE>‎ </span><b>1</b>.</p>