Search code examples
cslcitation-jsciteproc-js

Replace pandoc Markdown citeproc references in html DOM by citations and add bibliography with citeproc-js


Dears,

I create compliance reports with nodejs and pug. Historically, we have Markdown templates with Pandoc citerproc blocks that previously were processed by pandoc-citeproc, but now I try to reduce the dependencies and use citeproc-js (or citation.js) instead.

Our document has paragraphs such as:

<p>Cookie consent management is therefore the means through which web service users can <strong>express, reject and withdraw their consent</strong> to accessing or storing information on or related to their device in line with @regulation20181725, art. 37; @eprivacydirective, art. 5(3)</p>
<p>Lorem ipsum [See in this respect @edpb:guidelines012020, paras. 14 and 15] Lorem ipsum [See in this respect @edpb:cookiebannertaskforce2023, paras 2, 23, 24; @edpb:guidelines052020, para. 7]</p>

I have developed the following javacript based on citeproc-js to:

  1. iterate over all p elements and match all brackets […]
  2. iterate over the bib items in the brackets (one or more)
  3. call citeproc-js and replace the bracket by a number
  4. call citeproc-js to generate a bibliography

Unfortunately, my understanding of citationsPre and citationsPost is too poor to understand how I generate the number and the bibliography. (docs)

My code:

let endnoteIndex = 0;
// const regexpGroup = new RegExp(/\[((;?(?:[^;@\]]*)@[^;@\]]+)+)\]/,"g"); // first capture group is content in brackets
const regexpGroup = new RegExp(/\[([^\]\n]*@[^\]\n]+)\]/,"g"); // first capture group is content in brackets
const regexpRef = new RegExp(/(?<prefix>[^@]*)@(?<id>[^,]+)(,(?<post>.*))?/); // capture pretext, id, postext

const style = document.getElementById('citeproc-csl-style').textContent;
const locale = document.getElementById('citeproc-locale').textContent;
const bibliographyArray = JSON.parse(document.getElementById('citeproc-csl-bib').textContent);
let bibliography = {};
for (item of bibliographyArray) {
  bibliography[item.id] = item;
}

const citeprocSys = {
  retrieveLocale: function (lang) {
    return locale;
  },
  retrieveItem: function (id) {
    let item = bibliography[id];
    if(!item) throw new Error(`Bibliography item ${id} not found.`);
    return item;
  }
}

let citeproc = new CSL.Engine(citeprocSys, style);

let citationsPre = [];
let citationsPost = [];

// https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace#specifying_a_function_as_the_replacement
function replacer(match, refsString) {
  console.log(refsString);
  let citationItems = refsString.replaceAll('&nbsp;',' ').split(';').map( refString => {
    let refData = refString.trim().match(regexpRef).groups;
    // https://citeproc-js.readthedocs.io/en/latest/csl-json/markup.html#cite-items
    let ref = {
      id: refData.id,
      // CSL Locator list https://docs.citationstyles.org/en/stable/specification.html#locators
      /*
      locator: 12,
      label: "page",
      */
      prefix: refData.prefix,
      suffix: refData.suffix
    }
    return ref;
  });
  
  let citations = {
    citationItems: citationItems,
    properties: {
      // In the properties portion of a citation, the noteIndex value indicates the footnote number in which the citation is located within the document.
      // Citations within the main text of the document have a noteIndex of zero.
      noteIndex: 0
    }
  }
  
  let result;
  try {
    result = citeproc.processCitationCluster(citations, citationsPre, citationsPost);
    citationsPre.push(...[]);
    console.log(result);
  } catch (error) {
    console.warn(error);
    return match;
  }
  
  endnoteIndex += 1;
  // return `<a href="#endnote-${endnoteIndex}">[${endnoteIndex}]</a>`;
  return `<a href="#endnote-${endnoteIndex}">${result[1][0][1]}</a>`;
}

function findReferences(paragraph) {
  paragraph.innerHTML = paragraph.innerHTML.replaceAll(regexpGroup, replacer);
}

// find references in all paragraphs
[].forEach.call(document.getElementsByTagName('p'), findReferences);

let bibResult = citeproc.makeBibliography();
console.log("bibresult", bibResult);
document.getElementById('bibliography').innerHTML = bibResult[0].bibstart+bibResult[1].join('\n')+bibResult[0].bibend;

The output bib has always only the last item. What's missing here? Another issue I face is that URLs are plain-text instead of clickable.

screenshot of bibliograhpy with only one element


Solution

  • I managed to get it working. For the record here my solution.

    A few comments ahead:

    • I made this work with the Oscola CSL style. It creates citations meant to be footnotes and does not assume a bibliography in the document. As HTML has no real footnotes, I sided for endnotes. Those endnotes refer to each other (e.g. "{short title} (n. 12)" means the long references is in endnote 12).
    • The array endnoteArray stores all received notes and I integrate updates coming in from citeproc.processCitationCluster.
    • citeproc.processCitationCluster calls a method to add to my citation object the property citationID.
    • in the citationsPre array, I need to start numbering from 1 upwards, because that the endnote number that oscola uses and oscola assumes numbering starts with 1
    • I replace every finding of a pandoc-formatted citation not with the citation, but with my own counter endnoteIndex, which is like footnote/endnote number. Pandoc+biblatex would do this step automatically when configured for footnote citations (\autocite{ref} with autocite=footnote).

    Thanks to LarsW who put me into the right direction.

    let endnoteIndex = 0;
    let endnoteArray = [];
    // const regexpGroup = new RegExp(/\[((;?(?:[^;@\]]*)@[^;@\]]+)+)\]/,"g"); // first capture group is content in brackets
    const regexpGroup = new RegExp(/\[([^\]\n]*@[^\]\n]+)\]/,"g"); // first capture group is content in brackets
    const regexpRef = new RegExp(/(?<prefix>[^@]*)@(?<id>[^,]+)(,(?<suffix>.*))?/); // capture pretext, id, postext
    
    const style = document.getElementById('citeproc-csl-style').textContent;
    const locale = document.getElementById('citeproc-locale').textContent;
    const locale_fallback = document.getElementById('citeproc-locale-fallback').textContent;
    const bibliographyArray = JSON.parse(document.getElementById('citeproc-csl-bib').textContent);
    let bibliography = {};
    for (item of bibliographyArray) {
      bibliography[item.id] = item;
    }
    
    const citeprocSys = {
      retrieveLocale: function (lang) {
        if(lang == 'en-US') return locale_fallback;
        return locale;
      },
      retrieveItem: function (id) {
        let item = bibliography[id];
        if(!item) throw new Error(`Bibliography item ${id} not found.`);
        return item;
      }
    }
    
    let citeproc = new CSL.Engine(citeprocSys, style);
    
    let citationsPre = [];
    let citationsPost = [];
    
    // https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace#specifying_a_function_as_the_replacement
    function replacer(match, refsString) {
      console.log(refsString);
      let citationItems = refsString.replaceAll('&nbsp;',' ').split(';').map( refString => {
        let refData = refString.trim().match(regexpRef).groups;
        // https://citeproc-js.readthedocs.io/en/latest/csl-json/markup.html#cite-items
        let ref = {
          id: refData.id,
          // CSL Locator list https://docs.citationstyles.org/en/stable/specification.html#locators
          /*
          locator: 12,
          label: "page",
          */
          prefix: refData.prefix || undefined,
          suffix: refData.suffix
        }
        return ref;
      });
    
      let citation = {
        citationItems: citationItems,
        properties: {
          // In the properties portion of a citation, the noteIndex value indicates the footnote number in which the citation is located within the document.
          // Citations within the main text of the document have a noteIndex of zero.
          noteIndex: 0,
        }
      }
    
      let result;
      try {
        result = citeproc.processCitationCluster(citation, citationsPre, citationsPost);
        // console.log("citationID", citation.citationID);
        citationsPre.push([
          // endnote number starting from 1
          citation.citationID, endnoteIndex+1,
        ]);
        // console.log(JSON.stringify(result[1], null, 2));
      } catch (error) {
        console.warn(error);
        return match;
      }
    
      result[1].forEach(e => {
        endnoteArray[e[0]] = e[1];
      });
      endnoteIndex += 1;
      // citations here will be numbered and linked as endnotes
      return `<sup><a href="#endnote-${endnoteIndex}" id="ref-${endnoteIndex}">(${endnoteIndex})</a></sup>`;
      // normally (check whether instead of [1][0] another item must be chosen)
      // return `<a href="#endnote-${endnoteIndex}">${result[1][0][1]}</a>`;
    }
    
    function findReferences(paragraph) {
      paragraph.innerHTML = paragraph.innerHTML.replaceAll(regexpGroup, replacer);
    }
    
    // find references in all paragraphs
    [].forEach.call(document.getElementsByTagName('p'), findReferences);
    
    let bibResult = citeproc.makeBibliography();
    console.log("bibresult", bibResult);
    // citations are added as endnotes
    document.getElementById('bibliography').innerHTML = bibResult[0].bibstart+endnoteArray.map((e,i) => `<div id="endnote-${i+1}" class="csl-entry">${e.replaceAll(/&#60;(.*?)&#62;/mg, (match,url) => `<a href="${url}" target="_blank">${url}</a>`).replaceAll(/\((n (\d+))\)/mg, (match,note,ref) => `(<a href="#endnote-${ref}">${note}</a>)`)} <a href="#ref-${i+1}">↩︎</a></div>`).join('\n')+bibResult[0].bibend;
    // normally, bibliography is printed
    // document.getElementById('bibliography').innerHTML = bibResult[0].bibstart+bibResult[1].join('\n')+bibResult[0].bibend;
    

    screenshot of endnotes with Oscola CSL