Search code examples
ctags

How to define ctag parser for custom file format


I regularly use a file format that doesn't have a parser for Ctags. I would like to write a parser for it, but I'm not sure how. The file format doesn't have keywords like a computer language does, but instead where you are in the file is dependent on the content of the last 10 columns of each line in the file. (Sorry, the ENDF format was created in the 1960s.)

How can I create a new parser that depends on the contents of a particular column?

Here is an abbreviated example of the file, but it still contains enough information to get the gist of what I'm trying to do:

                                                                  MMMMFFTTT
                               33        856        176          17434 1451
                               34          2        155          17434 1451
                               34         51        115          17434 1451
 0.000000+0 0.000000+0          0          0          0          07434 1  0
 0.000000+0 0.000000+0          0          0          0          07434 0  0
 7.418300+4 1.813790+2          0          0          1          07434 2151
 7.418300+4 1.000000+0          0          0          2          07434 2151
 1.000000-5 5.000000+3          1          7          0          17434 2151
 0.000000+0 0.000000+0          0          3          5          07434 2151
 0.000000+0 0.000000+0          2          0         24          47434 2151
 7.418300+4 1.813790+2          0          0          0          07434 3 28
-7.222000+6-7.222000+6          0          0          1         397434 3 28
         39          2                                            7434 3 28
 7.261820+6 0.000000+0 9.300000+6 0.000000+0 9.600000+6 2.18585-137434 3 28
 1.000000+7 5.01372-13 1.050000+7 1.32071-11 1.100000+7 8.70475-107434 3 28
 0.000000+0 0.000000+0          0          0          0          07434 3  0
 7.418300+4 1.813790+2          0          0          0          07434 3 37
-2.093600+7-2.093600+7          0          0          1         207434 3 37
 2.105140+7 0.000000+0 2.200000+7 7.150990-5 2.400000+7 2.707920-27434 3 37
 1.300000+8 5.411910-2 1.500000+8 3.895580-2                      7434 3 37
 0.000000+0 0.000000+0          0          0          0          07434 3  0
 7.418300+4 1.813790+2          0          0          0          07434 3 41
-1.328500+7-1.328500+7          0          0          1         267434 3 41
         26          2                                            7434 3 41
 1.335820+7 0.000000+0 1.550000+7 0.000000+0 1.600000+7 2.56183-147434 3 41
 1.700000+7 9.60380-12 1.800000+7 3.02742-10 1.900000+7 1.474340-77434 3 41
 1.300000+8 1.582280-2 1.500000+8 1.154350-2                      7434 3 41

I've labeled the columns MMMM, FF, and TT. When these change is when I need a "tag" (using the term loosely) to tell me that it has changed. Note, this is (kind of) nested in that, there are many TTs in each FF, and many FFs inside each MMMM.

I'm not sure what the tag output should look like. I've never even looked at the tag output; I've always relied on someone else to parse them for me. Please assist this novice as I try to learn.

I wrote a syntax parser for Vim several years ago and was hoping this might be a good addition.


Solution

  • My answer assumes you use Universal-ctags (https://ctags.io).

    I expect you know the basic concept of ctags: kinds and fields. See https://docs.ctags.io/en/latest/man/ctags.1.html#tag-entries if you don't know them.

    I expect you know the output format of ctags. See https://docs.ctags.io/en/latest/man/tags.5.html if you don't know.

    There are various ways to implement a parser in ctags. In this case, you may want to write the parser in C language with line-oriented way.

                        33        856        176          17434 1451
                        34          2        155          17434 1451
    ...
    

    You may expect the 7434 at the first line is tagged as mmmm. However you may not expect the 7434 at the second line. The parser must have an ability to track the state of input; the parser should not make a tag of which name is already tagged. It means you cannot define the parser for the language in your .ctags with regular expressions. You may have to write it in C.

    The inpue is line oriented. So you can use readLineFromInputFile function. It is the heart of line oriented parser.

    https://github.com/masatake/ctags/commit/e8e0015393ae7a3b447ee886bd0884f45d11ced2 is a runnable example illustrating how to use readLineFromInputFile.

    With the example, ctags emits following tags output:

    $ ctags --options=NONE --list-kinds=ENDF
    m  materials
    f  material files
    t  material subdivisions
    
    $ ctags --options=NONE  --sort=no -o - input.endf 
    434     input.endf  /^                               33        856        176          17434 1451$/;"   m
    14  input.endf  /^                               33        856        176          17434 1451$/;"   f   mat:434 
    51  input.endf  /^                               33        856        176          17434 1451$/;"   t   mf:434 14
    ...