How to parse the entire gulp stream before making changes?

I'm in the process of trying to make a static site using Gulp. I hit an interesting problem of translating a concept I wrote in the previous version and how to implement it using Gulp.

One of the concepts if that I have files that dynamically include other files.

---
title: Table of Contents
include:
  key: book
  value: book-1
---

Introduction.

Then, other files include have that key.

---
title: Chapter 1
book: book-1
---
It was a dark and stormy night...

... and:

---
title: Chapter 2
book: book-1
---

The desired end result is:

---
title: Table of Contents
include:
  key: book
  value: book-1
  files:
    - path: chapters/chapter-01.markdown
      title: Chapter 1
      book: book-1
    - path: chapters/chapter-02.markdown
      title: Chapter 2
      book: book-1
---

Basically, scan through the files and insert the data elements as a sequence into the pages that have an inclusion. I don't know all of the categories or tags to include ahead of time (I'm merging 30-40 Git repositories together), so I don't want to create one task per category.

What I'm hoping for is something like:

return gulp.src("src/**/*.markdown")
  .pipe(magicHappens())
  .pipe(gulp.dest("build"));

The problem seems to be how streams work. I can't chain two methods together because each file is passed from one pipe to the next to the next. To insert the include.files element, I have to parse through all of the input files (they aren't even in sub directories) to figure out which ones are included before I can finish.

It seems like I have to "split the stream", parse the first one to get data, chain the second to the end of the first, and then use the second to pass the results out of the method. I'm just not entirely sure how to do that and would like some pointers or suggestions. My google-fu didn't really come up with good suggestions or even a hint I reorganized. Thank you.

Solution

After a lot of fumbling, I came up with this:

var through = require('through2');
var pumpify = require("pumpify");

module.exports = function(params)
{
    // Set up the scanner as an inner pipe that goes through the files and
    // loads the metadata into memory.
    var scanPipe = through.obj(
        function(file, encoding, callback)
        {
            console.log("SCAN: ", file.path);
            return callback(null, file);
        });

    // We have a second pipe that does the actual manipulation to the files
    // before emitting.
    var updatePipe = through.obj(
        {
            // We need a highWaterMark larger than the total files being processed
            // to ensure everything is read into memory first before writing it out.
            // There is no way to disable the buffer entirely, so we just give it
            // the highest integer value.
            highWaterMark: 2147483647
        },
        function(file, encoding, callback)
        {
            console.log("UPDATE: ", file.path);
            return callback(null, file);
        });

    // We have to cork() updatePipe. What this does is prevent updatePipe
    // from getting any data until it is uncork()ed, which we won't do, or
    // the scanPipe gets to the end.
    updatePipe.cork();

    // We have to combine all of these pipes into a single one because
    // gulp needs a single pipe  but we have to treat these all as a unit.
    return pumpify.obj(scanPipe, updatePipe);
}

I think the comments are pretty clear, but I had to combine two pipes into a single one (using pumpify), then use cork to stop the second stream from processing until the first one is done (which automatically uncorkd the second stream). Since I had a large number of files, I had to use a much higher watermark to avoid starving the first one.