Search code examples
javascriptregexregular-language

RegEx in Javascript: unknown number of groups to save


I have the following tsv file, which I am trying to read and save the information from it separately.

Here an example of two lines of the file :

Extract of the file

13->7   3   270296:[T]1132070:[T]2807979:[T]
12->8   31  73108:[G]119227:[T]210429:[T]237902:[T]490699:[A]588160:[T]730687:[A]863532:[T]953590:[T]1207654:[T]1270425:[C]1315919:[C]1374547:[C]1787551:[C]1872033:[G]1963836:[T]2112830:[A]2183936:[A]2464064:[T]2573449:[T]2594098:[T]2667677:[C]2815676:[T]2926565:[T]3019188:[T]3023991:[A]3097403:[A]3142179:[A]3180137:[C]3254219:[G]3265026:[G]

As you can see, each line has a different amount of the last group. I have tried the following code, but it only saves the first group:

Draft of the code:

var x = str.split('\n');
var regex = /([0-9]+)\t([0-9]+)\t(([0-9]+):.([ACGTN]).)+/g;
for (var i=0; i<x.length; i++) {
    line = regex.exec(x[i]);
    console.log(line);
    //Example for the first line
    //line[1] = 7
    //line[2] = 3
    //line[3] = 270296:[T]
    //line[4] = 270296
    //line[5] = T
    //that's it
}

My expected output is that each of the NUM:[LETTER] appears either together in a cell of the array (like in line[3]) or already separated, like in line[4] and line[5].

Output draft

Idea 1:

line[3] = 270296:[T]
line[4] = 1132070:[T]
line[5] = 2807979:[T]

Idea 2

line[3] = 270296
line[4] = T
line[5] = 1132070
line[3] = T
line[4] = 2807979
line[5] = T

Any ideas what I have been missing to obtain this mentioned output?


Solution

  • If I were doing this, I would break the regex into two pieces — one for the first two numbers and one for the data — to make it easier to understand late. Something like:

    var line = '8  31  73108:[G]119227:[T]210429:[T]237902:[T]490699:[A]588160:[T]730687:[A]863532:[T]953590:[T]1207654:[T]1270425:[C]1315919:[C]1374547:[C]1787551:[C]1872033:[G]1963836:[T]2112830:[A]2183936:[A]2464064:[T]2573449:[T]2594098:[T]2667677:[C]2815676:[T]2926565:[T]3019188:[T]3023991:[A]3097403:[A]3142179:[A]3180137:[C]3254219:[G]3265026:[G]'
    
    // get the numers and the rest
    let [num1, num2, data] = line.split(/\s+/g)
    
    // parse the rest to an array
    data = data.match(/([0-9]+:\[[ACGTN]\])/g)
    
    console.log(num1, num2, data)

    From here if you needed further processing, for example making an array of objects from your data, it should be easy.

    // array of objects like [{'73108': '[G]'}, ...]
    let objArray = data.map(n => {
        let [key, value] = n.split(':')
        return {[key]:value}
    })