Search code examples
javascriptregexregex-lookaroundsabcjs

Javascript Regex: Second occurrence of block: ABC.js music notation


ABC is a music notation; I'm working on patterns to parse it as part of an app.

Sometimes multiple renditions of a tune are in an ABC file, and I need to get just the first rendition -- or in an ideal world any rendition I specify. The beginning of a rendition is signified by the X: string.

It's not possible to know in advance how many renditions are in a file.

In Javascript, how can I return, for example, the first rendition (from the first X: inclusive to the beginning of the second) in the example below, in a way that will return the first if there is no second, and return the first if there are more than two renditions.

My work so far yields ([\s\S]*)(?=X:) which succeeds in the two rendition example, but fails with a single rendition or more than two.

Adding an 'OR'd end of file condition to the lookahead lets the single rendition case work, but fails on the one and three rendition cases, e.g. \([\s\S]*)(?=X:|$)

Any help appreciated ... a good way to parse ABC will be used by many.

A two-rendition example can look like the below -- for a three rendition example just add a line with X: at the end, and for a single chop off everything from the second X:

EDITS: Folks have been kind enough to ask for better examples, and they won't fit in a comment, so here's a few

Broken pledge is interesting because it has more than one ABC and they're not numbered sequentially:

X:56
T:Broken Pledge, The
R:reel
D:De Dannan: Selected Reels and Jigs.
Z:Also played in Edor, see #734
Z:id:hn-reel-56
M:C|
K:Ddor
dcAG ADDB|cAGF ECCE|D2 (3EFG Addc|AcGc Aefe|
dcAG FGAB|c2Bd cAGE|D2 (3EFG AddB|cAGE FDD2:|
|:dcAG Acde|~f3d ecAB|cAGE GAcd|ec~c2 eage|
dcAG Acde|fedf ecAG|~F3G AddB|cAGE FDD2:|
P:"Variations:"
|:dcAG ~A3B|cAGF ECCE|DEFG Addc|(3ABc Gc Aefe|
dcAG FGAB|c2Bd cAGE|DEFG AddB|A2GE FDD2:|
|:dcAG Acde|~f3d ecAB|cAGE GAcd|ec~c2 eage|
dcAG Acde|~f3d ecAG|FEFG AddB|A2GE FDD2:|

X:2
T:Broken Pledge, The
M:C
L:1/8
Q:250
K:D
dcAG A2 dB | cAGF EDC2 | DEFG Ad ~d2 | AcGc Adfe |
dcAG A2 dB | cAGF EDC2 | DEFG Ad ~d2 | AcGc ADD2 :|
|: dcAG A2 de | fedf edAB | cAGE GAcd | ec ~c2 eage |
dcAG A2 de | fedf edcA | F3 E FGAB | cAGE {F}ED D2 :||

Huish the Cat is interesting because it has lots of renditions, all numbered alike. You can see the X:whatever is totally arbitrary:

X:1
T:Huish the Cat
M:6/8
L:1/8
N:”Author and date unknown.”
R:Air
Q:"Quick"
S:Byrne, the harper, 1802
B:Bunting – Ancient Music of Ireland (1840, p. 3)
Z:AK/Fiddler’s Companion
K:C
(G>A).G c2(e|d<).d.A c2z|(G>A).G .c2 d|(ec).A .A2G|
(G>A).G .c2(e|d<).d.A .c2e|(g>f).e .f2d|(ec).A A2G:|
|:(gf).e .f2d|(ed).c .f2d|(gf).e .f2d|(ec).A A2G|
(gf).e .f2d|(ed).c .f2.d|(G>A).G f2d|(ec).A [F2A2]G:|]

X:1
T:Hunt the Cat
M:6/8
L:1/8
R:Jig
Q:”Allegro”
B:William Forde – 300 National Melodies of the British Isles (c. 1841, p.  26, No. 87)
B: https://www.itma.ie/digital-library/text/300-national-melodies-of-the-british-isles.-vol.-3-100.-irish-airs
N:William Forde (c.1795–1850) was a musician, music collector and scholar from County Cork
Z:AK/Fiddler’s Companion
K:D
A>BA d2f|e<eB d3|A>BA d2e|fdB B2A|
A>BA d2f|e<eB d2f|a>gf g2e|fdB B2A:|
|:agf g2e|FED G2E|agf g2e|fdB B2A|
agf g2e|fed g2e|A>BA g2e|fdB B2A:|]

X:1
T:Huish the Cat
M:6/8
L:1/8
R:Jig
Q:"Quick"
B:P.M. Haverty – One Hundred Irish Airs vol. 1 (1858, No. 87, p. 37)
Z:AK/Fiddler’s Companion
K:C
(G>A).G .c2(e|d<).d.A c2z|(G>A).G .c2d|(ec).A .A2G|
(G>A).G .c2(e|d<).d.A .c2|(g>f).e .f2d|(cA).A A2G:|
|:(gf).e .f2d|(ed).c .f2d|(gf).e .f2d|(ec).A A2G|
(gf).e .f2d|(ed).c .f2.d|(G>A).G f2d|(ec).A [F2A2] G:|]

X:1
T:Huish the Cat
M:6/8
L:1/8
R:Single Jig
S:O'Neill - Dance Music of Ireland: 1001 Gems (1907), No. 382
Z:AK/Fiddler's Companion
K:C
G>AG c2e|d<dA c2e|G>AG c2d|ecA A2c|
G>AG c2e|d<dA c2e|g>fe f2d|ecA A2G:|
|:gfe f2d|edc f2d|gfe f2d|ecA A2G|
gfe f2d|edc f2d|G>AG f2d|ecA A2G:||

X:1
T:Hunt the Cat
M:6/8
L:1/8
B:Roche, vol. 3 (1927, p. 114)
K:Ddor
DED D2A|AGE c3|DED D2A|AGE E2D|
DED D2A|AGE c3|ABc d2B AGE E2D:|
|:dcA AGE|AGE c3|dcA AGE|AGE E2D|
dcA AGE|AGE c3|ABc d2c|AGE E2D:||

LowBack car is pretty messy, with per cent signs and the like

X:1
%
T:Lowbacked Car [1], The
M:6/8
L:1/8
R:Air
S:James Goodman (1828─1896) music manuscript collection, 
S:vol. 3, p. 133. Mid-19th century, County Cork
Z:AK/Fiddler’s Companion
K:G
G|G2B B2d|c2A z2F|G2B d2d|d3 z2G|
c2c A2A|B2B G2B|c2A G2F|G3 z2G|
G2c c2e|e2d d2G|G2c c2e|d3 z2G|
G2g !fermata!g2e|e2d dcB|A2G A2B|!fermata!d3 z2A|
GED G2G|G3 z2B|AGE A2A|A3z B/c/|
dcB dcB|gfe !fermata!d2 B/A/|GED G2G|(G3 G2)||
X:1
%
T:Low Backed Car (1)
M:6/8
L:1/8
B:Howe - Musicians's Omnibus No. 2 (p. 107)
Z:AK/Fiddler's Companion
R:G
G|G2B B2d|c3 A2d|G2 B2 d2d|(d3 d2)B|
c2c A2A |B3 G2G A2A F2F|(G3 G2)||d|
d2g g2e|e2d d2B|d2g g2e|(e3 d2)d|
d2g g2e|e2d d2B|BAG A2B|d2c B2A|
.G.E.E .G2G|(G3 G2)B|AGE A2A|A3 ABc|
(.d.c.B) (.d.c.B)|(.a.a.d) .e.d.B|.G.E.D|(G3 G2)|]
X:1
%
T:Low Backed Car [1], The
M:6/8
L:1/8
R:Jig
B:Kerr - Merry Melodies, vol. 2, No. 257  (c. 1880's)
Z:AK/Fiddler's Companion
K:G
D|G2B B2d|d2c A2F|G2B d2d|(d3 d2) B|
cBc A2A|BAB GAB|c2A G2F|(G3 G2):||
B|G2g g2e|e2d d2B|G2g g2e|d3 cBA|
G2g g2e|e2d dcB|A2G A2B|d3 cBA|
GED G2G|(G3 G2)B|AGE A2A|A3 (ABc)|
dcB dcB|Gfe dBA|GED G2G|(G3 G2)||

And Lowbacked Car for 6 is the modal case of a single tune which we need to handle as the most common case:

X:1
T:Jaunting Car for Six
M:9/8
L:1/8
R:Slip Jig
S:Kerr - Merry Melodies, vol. 3, No. 233 (c. 1880's)
Z:AK/Fiddler's Companion
K:A
efe c2c c3|efe cde fga|efe c2c c3|BcB B2c def:|
|:e2a agf ecA|e2a agf e3|e2a agf ecA|BcB B2c def:|| 

Solution

  • This is a complete rewrite of the answer, sorry. The following function returns the info you are currently interested in (it can be extended to return more info, like, e.g., the titles of the renditions as an array sharing indices with the renditions array).

    function getAbcInfo(abc) {
        let renditions = ('\n' + abc).split(/[\r\n]+(?=[ \t\u00a0]*X[ \t\u00a0]*:[ \t\u00a0]*\d+)/);
        renditions.push(renditions.pop().replace(/[\r\n]+$/, ''))
        renditions.unshift(renditions.shift().replace(/^[\r\n]+/, ''))
        let x = ['']
        let indicesOfX = {'': [0]}
        for (let i = 1; i < renditions.length; i++) {
            let n = renditions[i].match(/^[ \t\u00a0]*X[ \t\u00a0]*:[ \t\u00a0]*(\d+)/)[1]
            x[i] = n
            if (n in indicesOfX) {
                indicesOfX[n].push(i)
            } else {
                indicesOfX[n] = [i]
            }
        }
        return {renditions: renditions, x: x, indicesOfX: indicesOfX}
    }
    
    console.log(JSON.stringify(getAbcInfo(brokenPledge)));
    // {"renditions":["","X:56…","X:2…"],"x":["","56","2"],"indicesOfX":{"2":[2],"56":[1],"":[0]}}
    console.log(JSON.stringify(getAbcInfo(huishTheCat)));
    // {"renditions":["","X:1…","X:1….","X:1…","X:1…","X:1…"],"x":["","1","1","1","1","1"],"indicesOfX":{"1":[1,2,3,4,5],"":[0]}}
    console.log(JSON.stringify(getAbcInfo(lowbackedCar)));
    // {"renditions":["","X:1…","X:1…","X:1…"],"x":["","1","1","1"],"indicesOfX":{"1":[1,2,3],"":[0]}}
    console.log(JSON.stringify(getAbcInfo(commonCase)));
    // {"renditions":["","X:1…"],"x":["","1"],"indicesOfX":{"1":[1],"":[0]}}
    console.log(JSON.stringify(getAbcInfo(brokenPledgeWithoutTheFirstLine)));
    // {"renditions":["T:Broken Pledge…","X:2…"],"x":["","2"],"indicesOfX":{"2":[1],"":[0]}}
    

    The renditions array always contains what precedes the first X: (if any) at index 0. This will normally be the empty string, but it might be a header with fields that the standard allows there, or even a full rendition if its X: line has simply been omitted (against the standard, but humans don't always follow standards).

    From index 1 on, the items of renditions are renditions starting with X: (actually whitespace is allowed, see the regex), and with trailing newlines stripped.

    The x array shares indices with the renditions array, giving the n of the X:n line of each rendition. Since the “rendition” at index 0 has no X:n line (it's “unnamed”, or rather, “unnumbered”), the x array will always have the empty string at index 0.

    The indicesOfX object allows you to get the array of indices in renditions given the n of X:n. In other words, it inverts the key-value relation of the x array.

    In case you want to extend the function to add, say, a titles array to the output, don't forget that you can't simply match a T:, because you have to consider whitespace (the regexes I used allow spaces, tabs and non-breaking spaces – don't use \s* because that includes \n), and also because the T: must be preceded by a newline, except for the rendition at index 0, where it can be at the start of the string. The text of the T: ends with a newline ([\r\n]).

    BTW, you might want to “normalize” newlines by replacing all \r with nothing, or, if you fear there could be old Mac Classic files around where newlines are just \r, replacing all \r\n with \n, and then all remaining \r with \n. Once you are sure you don't have \r newlines around, you can match the start of a new line AND the start of the string at the same time by using the ^ and the m (multiline) flag.