javascriptexcelregex

Parse Excel-originated, tab-delimited string in JavaScript


When you copy columns from Excel and paste into a textarea element, you get tab-delimited string. And values of cells of the same row are now separated by tab characters.

I need to get values of all cells in JavaScript. The theoretical approach would be simple:

  1. Split the string into an array of rows, by specifying \n as the separator.
  2. Split each row into an array of columns, by specifying \t as the separator.

But in practice, there is a certain difficulty because a cell, too, can contain one or more tabs. Therefore, directly specifying \t as the separator in the step 2 can yield wrong result.

A solution I came up with is, firstly replacing every tab inside cells with a custom string, such as __TAB__; secondly doing the step 1 and 2 safely; lastly replacing all __TAB__ back to tabs.

Because cell values which contain tabs or multiple lines are automatically wrapped inside two straight double quote (") when being pasted into a textarea element, regular expression is needed to find those strings.

A regular expression I found is this one. However, it cannot handle cell values which contain tab characters. I tried fixing it by modifying it as follows:

var rows = pastedText.replace(/(^|\t)"((?:(?:""|[^"])*(?:\r\n|\n\r|\n|\r|\t))+(?:""|[^"])*)"(\t|$)/mg, function (match, p1, p2, p3) {
    // This function runs for each cell with multi lined text.
    return p1 + p2
        // Replace any double double-quotes with a single
        // double-quote
        .replace(/""/g, '"')
        // Replacing any tab with a custom string
        .replace(/\t/g, '__TAB__') + p3;
})
// Split each line into rows
.split(/\r\n|\n\r|\n|\r/g);

And it works fine, until such pasted text is encountered. The pasted text represents a 20-row, 3-colum data with some rows being empty.

Browsers become irresponsive indefinitely when handling such pasted text (tested in the newest Firefox 120.0.1 and Chrome 120.0.6099.71). The main cause is that the cell of row 6 and column 2 contains a " at its beginning.

The goal is getting the value of each cell, and I don't really need to stick to one type of solution if there is a better one. Any advice is appreciated. Each cell can contain one or more lines, one or more tabs, or even one or more straight double quotes.


Solution

  • The following regex will match quoted cells (possibly multiline, possibly with tabs in it) or simple cells (possibly with double quotes in it):

    /(?<=^|(\t))(?=.)(?:"((?:""|[^"])*)"|[^\t\r\n]*)(?=\t|$)/gm
    

    It has two capture groups to help identify in which case we are.

    Here is a simple demo where you can paste in a textarea, and where the result is displayed in JSON format (be aware that in JSON rendering the double quotes are escaped with backslash):

    function convertTsv(tsv) {
        const regex = /(?<=^|(\t))(?=.)(?:"((?:""|[^"])*)"|[^\t\r\n]*)(?=\t|$)/gm;
        const data = [];
        for (const [all, sep, quoted] of tsv.matchAll(regex)) {
            if (!sep) data.push([]); // New row
            data.at(-1).push(quoted === undefined ? all : quoted.replaceAll('""', '"'));
        }
        return data;
    }
    
    function padColumns(data) {
        const length = Math.max(...data.map(row => row.length));
        return data.map(row => Array.from({length}, (_, i) => row[i] ?? ""));
    }
    
    function refresh() {
        const tsv = document.querySelector("textarea").value;
        const data = padColumns(convertTsv(tsv));
        document.querySelector("pre").textContent = JSON.stringify(data, null, 2);
    }
    
    document.querySelector("textarea").addEventListener("input", refresh);
    refresh();
    textarea { width: 100%; height: 5em }
    <textarea>This is a test    " quote at start
    "Multiline
    text"   "Multiline 
    with tab     in it"
        
    above row is empty  
    Next cell is empty  
        previous cell is empty
    </textarea>
    
    <pre>
    </pre>