Search code examples
rakunativecall

What's the minimum code required to make a NativeCall to the md_parse function in the md4c library?


Note: This post is similar, but not quite the same as a more open-ended questions asked on Reddit: https://www.reddit.com/r/rakulang/comments/vvpikh/looking_for_guidance_on_getting_nativecall/

I'm trying to use the md4c c library to process a markdown file with its md_parse function. I'm having no success, and the program just quietly dies. I don't think I'm calling it with the right arguments.

Documentation for the function is here: https://github.com/mity/md4c/wiki/Embedding-Parser%3A-Calling-MD4C

I'd like to at least figure out the minimum amount of code needed to do this without error. This is my latest attempt, though I've tried many:

use v6.d;
use NativeCall;

sub md_parse(str, int32, Pointer is rw ) is native('md4c') returns int32 { * }
md_parse('hello', 5, Pointer.new());


say 'hi'; # this never gets printed

Solution

  • md4c is a SAX-like streaming parser that calls your functions when it encounters markdown elements. If you call it with an uninitialised Pointer, or with an uninitialised CStruct then the code will SEGV when the md4c library tries to call a null function pointer.

    The README says:

    The main provided function is md_parse(). It takes a text in the Markdown syntax and a pointer to a structure which provides pointers to several callback functions.

    As md_parse() processes the input, it calls the callbacks (when entering or leaving any Markdown block or span; and when outputting any textual content of the document), allowing application to convert it into another format or render it onto the screen.

    The function signature of md_parse is:

    int md_parse(const MD_CHAR* text, MD_SIZE size, const MD_PARSER* parser, void* userdata);

    In order for md_parse() to work, you will need to:

    • define a native CStruct that matches the MD_PARSER type definition
    • create an instance of this CStruct
    • initialise all the function pointers with Raku functions that have the right function signature
    • call md_parse() with the initialised CStruct instance as the third parameter

    The 4th parameter to md_parse() is void* userdata which is a pointer that you provide which gets passed back to you as the last parameter of each of the callback functions. My guess is that it's optional and if you pass a null value then you'll get called back with a null userdata parameter in each callback.

    Followup

    This turned into an interesting rabbit hole to fall down.

    The code that makes it possible to pass a Raku sub as a callback parameter to a native function is quite complex and relies on MoarVM ops to build and cache the FFI callback trampoline. This is a piece of code that marshals the C calling convention parameters into a call that MoarVM can dispatch to a Raku sub.

    It will be a sizeable task to implement equivalent functionality to provide some kind of nativecast that will generate the required callback trampoline and return a Pointer that can be assigned into a CStruct.

    But we can cheat

    We can use a simple C function to return the pointer to a generated callback trampoline as if it was for a normal callback sub. We can then store this pointer in our CStruct and our problem is solved. The generated trampoline is specific to the function signature of the Raku sub we want to call, so we need to generate a different NativeCall binding for each function signature we need.

    The C function:

    void* get_pointer(void* p)
    {
        return p;
    }
    

    Binding a NativeCall sub for the function signature we need:

    sub get_enter_leave_fn(&func (uint32, Pointer, Pointer))
      is native('./getpointer') is symbol('get_pointer') returns Pointer { * }
    

    Initialising a CStruct attribute:

    $!enter_block := get_enter_leave_fn(&enter_block);
    

    Putting it all together:

    use NativeCall;
    
    enum BlockType < DOC QUOTE UL OL LI HR H CODE HTML P TABLE THEAD TBODY TR TH TD >;
    enum SpanType < EM STRONG A IMG SPAN_CODE DEL SPAN_LATEXMATH LATEXMATH_DISPLAY WIKILINK SPAN_U >;
    enum TextType < NORMAL NULLCHAR BR SOFTBR ENTITY TEXT_CODE TEXT_HTML TEXT_LATEXMATH >;
    
    sub enter_block(uint32 $type, Pointer $detail, Pointer $userdata --> int32) {
        say "enter block { BlockType($type) }";
    }
    
    sub leave_block(uint32 $type, Pointer $detail, Pointer $userdata --> int32) {
        say "leave block { BlockType($type) }";
    }
    
    sub enter_span(uint32 $type, Pointer $detail, Pointer $userdata --> int32) {
        say "enter span { SpanType($type) }";
    }
    
    sub leave_span(uint32 $type, Pointer $detail, Pointer $userdata --> int32) {
        say "leave span { SpanType($type) }";
    }
    
    sub text(uint32 $type, str $text, uint32 $size, Pointer $userdata --> int32) {
        say "text '{$text.substr(0..^$size)}'";
    }
    
    sub debug_log(str $msg, Pointer $userdata --> int32) {
        note $msg;
    }
    
    #
    # Cast functions that are specific to the required function signature.
    #
    # Makes use of a utility C function that returns its `void*` parameter, compiled
    # into a shared library called libgetpointer.dylib (on MacOS)
    #
    # gcc -shared -o libgetpointer.dylib get_pointer.c
    #
    # void* get_pointer(void* p)
    # {
    #     return p;
    # }
    #
    # Each cast function uses NativeCall to build an FFI callback trampoline that gets
    # cached in an MVMThreadContext. The generated callback code is specific to the
    # function signature of the Raku function that will be called.
    #
    
    sub get_enter_leave_fn(&func (uint32, Pointer, Pointer))
      is native('./getpointer') is symbol('get_pointer') returns Pointer { * }
    
    sub get_text_fn(&func (uint32, str, uint32, Pointer))
      is native('./getpointer') is symbol('get_pointer') returns Pointer { * }
    
    sub get_debug_fn(&func (str, Pointer))
      is native('./getpointer') is symbol('get_pointer') returns Pointer { * }
    
    class MD_PARSER is repr('CStruct') {
        has uint32                        $!abi_version; # unsigned int abi_version
        has uint32                        $!flags; # unsigned int flags
        has Pointer                       $!enter_block; # F:int ( )* enter_block
        has Pointer                       $!leave_block; # F:int ( )* leave_block
        has Pointer                       $!enter_span; # F:int ( )* enter_span
        has Pointer                       $!leave_span; # F:int ( )* leave_span
        has Pointer                       $!text; # F:int ( )* text
        has Pointer                       $!debug_log; # F:void ( )* debug_log
        has Pointer                       $!syntax; # F:void ( )* syntax
    
        submethod TWEAK() {
            $!abi_version = 0;
            $!flags = 0;
            $!enter_block := get_enter_leave_fn(&enter_block);
            $!leave_block := get_enter_leave_fn(&leave_block);
            $!enter_span := get_enter_leave_fn(&enter_span);
            $!leave_span := get_enter_leave_fn(&leave_span);
            $!text := get_text_fn(&text);
            $!debug_log := get_debug_fn(&debug_log);
        }
    }
    
    sub md_parse(str, uint32, MD_PARSER, Pointer is rw) is native('md4c') returns int { * }
    
    my $parser = MD_PARSER.new;
    
    my $md = '
    # Heading
    
    ## Sub Heading
    
    hello *world*
    ';
    
    md_parse($md, $md.chars, $parser, Pointer.new);
    

    The output:

    ./md4c.raku
    enter block DOC
    enter block H
    text 'Heading'
    leave block H
    enter block H
    text 'Sub Heading'
    leave block H
    enter block P
    text 'hello '
    enter span EM
    text 'world'
    leave span EM
    leave block P
    leave block DOC
    

    In summary, it's possible. I'm not sure if I'm proud of this or horrified by it. I think a long-term solution will require refactoring the callback trampoline generator into a separate nqp op that can be exposed to Raku as a nativewrap style operation.