Search code examples
javascripttypescriptencodingcharacter-encodingtextdecoder

Is it safe to decode an arbitrary UTF8-byte-chunk to string?


Is it safe to decode an UTF8-string that has been hacked into arbitrary byte-chunks to string (chunk-wise)?

Also, what about an arbitrary encoding ?

Context is this method:

async getFileAsync(fileName: string, encoding: string):string
{
    const textDecoder = new TextDecoder(encoding);
    const response = await fetch(fileName);
    
    console.log(response.ok);
    console.log(response.status);
    console.log(response.statusText);
    
    // let responseBuffer:ArrayBuffer = await response.arrayBuffer();
    // let text:string = textDecoder.decode(responseBuffer);
    
    // https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream/getReader
    const reader = response.body.getReader();
    let result:ReadableStreamReadResult<Uint8Array>;
    let chunks:Uint8Array[] = [];
    
    // due to done, this is unlike C#:
    // byte[] buffer = new byte[32768];
    // int read;
    // while ((read = input.Read(buffer, 0, buffer.Length)) > 0)
    // {
    //     output.Write (buffer, 0, read);
    // }

    do
    {
        result = await reader.read();
        chunks.push(result.value);

        // would this be safe ? 
        let partN = textDecoder.decode(result.value);
        // chunks.push(partN);

        console.log("result: ", result.value, partN);
    } while(!result.done)

    let chunkLength:number = chunks.reduce(
        function(a, b)
        {
            return a + (b||[]).length;
        }
        , 0
    );
    
    let mergedArray = new Uint8Array(chunkLength);
    let currentPosition = 0;
    for(let i = 0; i < chunks.length; ++i)
    {
        mergedArray.set(chunks[i],currentPosition);
        currentPosition += (chunks[i]||[]).length;
    } // Next i 

    let file:string = textDecoder.decode(mergedArray);
    
    // let file:string = chunks.join('');
    return file;
} // End Function getFileAsync

Now what I'm wondering is, if it's safe considering an arbitrary encoding, to do this:

result = await reader.read();
// would this be safe ? 
chunks.push(textDecoder.decode(result.value));

And by "safe" I mean will it result in the overall string being correctly decoded?

My guess is that it's not, but I guess I just want somebody to confirm that.

I figured when I have to wait until the end to merge the chunked array, I can just as well call

let responseBuffer:ArrayBuffer = await response.arrayBuffer();
let text:string = textDecoder.decode(responseBuffer);

instead.


Solution

  • It depends on what do you mean for safe.

    You know the size of original string, so you have the maximum size of decoded string. So this reduce a lot some modern DoS (amplification attacks).

    Algorithms are straightforward. But there are a lot of security implication on how to use data: UTF-8 may hide unnecessary long sequences. Good decoder should discard them, but maybe when requiring U+0000 (a long encoding helps keeping C string happy, but also to use all Unicode characters (so also U+0000). You should test this. You do not want that string will have a 0x00 value, and some function will use one length, some an other length of string, and so giving a possible buffer overflow.

    UCS uses a generalization of UTF-8 allowing encoding more bits (up to 31), but so eating more bytes. Some UTF-8 decoder allow that, some not. In general this should be an error, because many manipulation functions are not happy about code-point above current Unicode limits).

    Normalization has many implications, e.g. removing unnecessary code-points: Unicode (and so libraries) may have problems with characters encoded in too much (more then 16 or 32 code points, I do not remember exactly the minimum requirement).

    Obviously the sorting of codepoints, and composing/decomposing have also own security problems, but this seems outside your question, like also that some characters may look like (or are exactly like) others [impersonalisation].

    Good decoder should detect invalid bytes (0xC0) in UTF8, overlong sequences of UTF-8 (using more byte to get a code point), and codepoints outside Unicode (so more then 4 bytes, allowed by UCS). But some decoder are much more permissive, so programs should be able to handle that. And there are also invalid sequences, but these are not decodeable, so decoder often do the right thing (but some insert an error symbols, some just discard the invalid byte, and try to recover.