Search code examples
javascriptpdf-reader

Adapting a function that isnt chainable to return a value


I am trying to get all the pages of a pdf in one object using the pdfreader package. The function originally returns each page (as its own object) when it processes it. My goal is to write a wrapper that returns all pages as an array of page objects. Can someone explain why this didn't work?

I tried:

adding .then and a return condition - because I expected the parseFileItems method to return a value:

let pages = [];

new pdfreader.PdfReader()
  .parseFileItems(pp, function(err, item) {
    {
      if (!item) {
        return pages;
      } else if (item.page) {
        pages.push(lines);
        rows = {};
      } else if (item && item.text) {
        // accumulate text items into rows object, per line
        (rows[item.y] = rows[item.y] || []).push(item.text);
      }
    }
  })
  .then(() => {
    console.log("done" + pages.length);
  });

and got the error

TypeError: Cannot read property 'then' of undefined


The function I'm modifying (From the package documentation):

var pdfreader = require("pdfreader");

var rows = {}; // indexed by y-position

function printRows() {
  Object.keys(rows) // => array of y-positions (type: float)
    .sort((y1, y2) => parseFloat(y1) - parseFloat(y2)) // sort float positions
    .forEach(y => console.log((rows[y] || []).join("")));
}

new pdfreader.PdfReader().parseFileItems("CV_ErhanYasar.pdf", function(
  err,
  item
) {
  if (!item || item.page) {
    // end of file, or page
    printRows();
    console.log("PAGE:", item.page);
    rows = {}; // clear rows for next page
  } else if (item.text) {
    // accumulate text items into rows object, per line
    (rows[item.y] = rows[item.y] || []).push(item.text);
  }
});

Solution

  • There seem to be several issues/misconceptions at once here. Let's try to look at them once at a time.

    Firstly, you seem to have thought that the outer function will return ("pass on") your callback's return value

    This is not the case as you can see in the library source.

    Also, it wouldn't even make sense, because the callback called once for each item. So, with 10 items, it will be invoked 10 times, and then how would parseFileItems know which of the 10 return values of your callback to pass to the outside?

    It doesn't matter what you return from the callback function, as the parseFileItems function simply ignores it. Furthermore, the parseFileItems function itself doesn't return anything either. So, the result of new pdfreader.parseFileItems(...) will always evaluate to undefined (and undefined obviously has no property then).

    Secondly, you seem to have thought that .then is some sort of universal chaining method for function calls.

    In fact, .then is a way to chain promises, or to react on the fulfillment of a promise. In this case, there are no promises anywhere, and in particular parseFileItems doesn't returns a promise (it returns undefined as described above), so you cannot call .then on its result.

    According to the docs, you are supposed to react on errors and the end of the stream yourself. So, your code would work like this:

    let pages = [];
    
    new pdfreader.PdfReader()
      .parseFileItems(pp, function(err, item) {
        {
          if (!item) {
            // ****** Here we are done! ******
            console.log("done" + pages.length) // The code that was in the `then` goes here instead
          } else if (item.page) {
            pages.push(lines);
            rows = {};
          } else if (item && item.text) {
            // accumulate text items into rows object, per line
            (rows[item.y] = rows[item.y] || []).push(item.text);
          }
        }
      })
    

    However, I agree that it'd be nicer to have a promise wrapper so that you won't have to stuff all the following code inside the callback's if (!item) branch. You could achieve that like this, using new Promise:

    const promisifiedParseFileItems = (pp, itemHandler) => new Promise((resolve, reject) => {
      new pdfreader.PdfReader().parseFileItems(pp, (err, item) => {
        if (err) {
          reject(err)
        } else if (!item) {
          resolve()
        } else {
          itemHandler(item)
        }
      })
    })
    
    let pages = []
    
    promisifiedParseFileItems(pp, item => {
      if (item.page) {
        pages.push(lines)
        rows = {}
      } else if (item && item.text) {
        // accumulate text items into rows object, per line
        (rows[item.y] = rows[item.y] || []).push(item.text)
      }
    }).then(() => {
      console.log("done", pages.length)
    }, e => {
      console.error("error", e)
    })
    

    Note: You would get even nicer code with async generators but that is too much to explain here now, because the conversion from a callback to an async generator is less trivial than you may think.