javascript fileapi arraybuffer typed-arrays

Find byte sequence in typed array without writing a Boyer-Moore implementation

I have to find some markers (text sequences) in files loaded by the user. Like 85% of those files will be UTF-8 encoded text, but there will be binary files too. The markers are right now text sequences but in the future regular expressions may be used (probably not for the binary files, but I don't know yet).

I have the file contents as ArrayBuffer arrays, and I have to find the markers within that data. I have the following options:

Use a typed array (UInt8Array) and TextDecoder('ascii'), and use String.indexOf() on the decoded data. I don't like this because at least in principle this can duplicate the memory used by the file contents after TextDecoder.decode(). But it's easy and pretty straightforward. Works with the binary files, too, because the markers will be ASCII bytes within the binary data.
Use a typed array (UInt8Array, again) and write my own version of Boyer-Moore or other fast string searching function, to find the sequence of bytes I need. Fast, memory optimal... but I'd rather not write yet another Boyer-Moore implementation or copy one. And remember, in the future I may want to use regular expressions so...
Read the files as text rather than ArrayBuffer, since like 85% of them will be UTF-8 encoded text, and try to do something ad-hoc with the few real binary files that I'll encounter. That means I can use regular expressions or String.indexOf() from the start, no big deal, but on the other hand handling the binary files after finding the markers can be a problem because the raw data will be converted to text.

Since in the future the use of regular expressions is almost guaranteed, my only clear path is the first one, but I'm quite worried about memory usage, since some of the files can be large (about 100 MB or so), or use the third option and see how to get the raw bytes after they were converted to text...

Any suggestions? Am I missing something obvious?

Thanks in advance!

Solution

You could decode your ArrayBuffer to text as a stream.
There is a TextDecoderStream,, but it is still not supported in FF and it would require to copy your ArrayBuffers in Blobs so they can be streamed.

So instead in your case where you have ArrayBuffers, you can use the stream option of the TextDecoder.decode() method and read your ArrayBuffer by small chunks:

const buffer_1 = new TextEncoder().encode( "AAABCBAB" ).buffer;
// expected indices: [2,6]                    ^   ^
const indices_1 = getAllIndices( buffer_1, "AB", 3 /* 3 bytes at a time */ );
console.log( "buffer 1",
  JSON.stringify( indices_1 ),
  // check they're all "AB"
  JSON.stringify( extractContent( buffer_1, indices_1, 2 ) )
);

// A more complex test, with random binary data (read as UTF-8)
const buffer_2 = Uint32Array.from(
  { length: 1024 * 1024 },
  () => Math.random() * 0xFFFFFFFF
).buffer;
const indices_2 = getAllIndices( buffer_2, "AB" );
console.log( "buffer 2",
  JSON.stringify( indices_2 ),
  // check they're all "AB"
  JSON.stringify( extractContent( buffer_2, indices_2, 2 ) )
);

function getAllIndices( buffer, marker, chunksize = 1024 /* Bytes */ ) {

  if( !marker ) { return null; }
  if( !(marker instanceof RegExp) ) {
    marker = new RegExp( marker, "g" );
  }
  marker.global = true;
  
  // The marker could get split over two chunks.
  // So, at every chunk we prepend the last few characters
  // of the last chunk.
  const marker_length = marker.source.length;
  const positions = [];
  const arr = new Uint8Array( buffer );
  const decoder = new TextDecoder();

  let current_index = 0;
  let full_length = 0;
  let marker_buffer = "";

  while( current_index < arr.length ) {
    const next_index = current_index + chunksize;
    // subarray doesn't copy
    const chunk = arr.subarray( current_index, next_index );
    const decoded = decoder.decode( chunk, { stream: true } );

    const text = marker_buffer + decoded;

    let match;
    let last_index = -1;
    while ((match = marker.exec( text )) !== null) {
      last_index = match.index - marker_buffer.length;
      positions.push( full_length + last_index );
    }

    current_index = next_index;
    full_length += decoded.length;

    // Check that the buffer doesn't itself include the marker
    // this would cause duplicate finds (we could also use a Set to avoid that)
    const marker_index = last_index > -1 ?
      (last_index + marker_length) :
      (decoded.length - marker_length);
    marker_buffer = decoded.slice( marker_index );
  }
  return positions;

}
// for demo only
function extractContent( buffer, indexes, length ) {
  const full_str = new TextDecoder().decode( buffer );
  return indexes.map( (index) => full_str.slice( index, index + length ) );
}

However while this method will work fine with simple markers, complex RegExp may cause some problem. For instance, if your RegExp uses both the line-start and line-end, the chunk could split a value that would normally have matched.