I've recently discovered the power of the VTD-XML approach to XML parsing, mainly its speed. Just to be specific, I have built the C version 2.10 ( there are Java, C++ and C# implementations too ).
My objective is simple: I want to extract data from XML using VTD-XML for parsing, and using Perl to work with data. The easy way may be dump data with a C program I made, and send them via pipe to the Perl program. Maybe not elegant but it works.
Another, less easy way, consists of a Perl program that calls the C data collector subroutine using Inline::C.
So I started studying Inline::C and managed to do basic things I need to pass data back to Perl from C subroutines using Perl C API functions. Problems arise in the compiling phase when I write the C collector subroutine in the C source under Inline::C control.
There are symbol conflicts like this: bind() is defined both in socket.h ( Perl ) and in autoPilot.h ( VTD-XML ). Symbol conflicts can be avoided building VTD-XML as a shared library with an explicit export map ( gcc -Wl,-version-script=foo.map )... Is this the right way to go? Are there better ways?
I did reach my goal by adding a layer of indirection: awful, as it seems to me it works.
First of all, I made a shared library containing the VTD-XML API. Building this shared object, I had to avoid global scope pollution, exporting only symbols needed.
Then I built another shared library. This second shared libray hides the VTD-XML API and is supposed to be used from Perl via Inline::C. In this shared object I wrote a handful of functions, using libvtd.so partially exposed API.
The idea looks like this:
Perl -> Inline::C dynamic loader -> wrapper_API.so -> libvtd.so
Major issues came from runtime loading of shared libraries and from symbol collision/resolution.
Here is how I build libvtd.so, making it easy for the so called wrapper_API.so to use it.
Unfortunately, VTD-XML doesn't build a libvtd.so
shared object, so I had to build it myself linking together several .o object files with gcc:
gcc -shared -fPIC -Wl,-soname,libvtd.so.2.10 -Wl,--version-script=vtd-xml.map \
-o libvtd.so.2.10 libvtd.o arrayList.o fastIntBuffer.o fastLongBuffer.o \
contextBuffer.o vtdNav.o vtdGen.o autoPilot.o XMLChar.o XMLModifier.o intHash.o \
bookMark.o indexHandler.o transcoder.o elementFragmentNs.o
Symbol visibility was tuned with the linker option -Wl,--version-script=vtd-xml.map
, where the map file being:
{
global:
the_exception_context;
toString;
getText;
getCurrentIndex;
toNormalizedString;
toElement;
toElement2;
createVTDGen;
setDoc;
parse;
getNav;
freeVTDGen;
freeVTDNav;
getTokenCount;
local:
*;
};
Global ( "exported" ) symbols are under the global:
section, while the catchall *
under local says all other symbols are only known locally.
All object modules come from the VTD-XML distribution, with the exception of libvtd.o: this custom object was needed to address issues with exception handling library cexept.h. libvtd.c is only two lines of code.
#include "customTypes.h"
struct exception_context the_exception_context[ 1 ];
In the compilation phase I had to adjust CFLAGS of to make Position Independent Code ( gcc -fPIC
option ), in order to make shared objects.
readelf tool was useful to check symbol visibility:
readelf --syms libvtd.so.2.10
Symbol table '.dynsym' contains 35 entries:
Num: Value Size Type Bind Vis Ndx Name
...
280: 000000000000d010 117 FUNC LOCAL DEFAULT 12 writeIndex
281: 000000000003c5d0 154 FUNC LOCAL DEFAULT 12 setCursorPosition
282: 000000000003c1f0 56 FUNC LOCAL DEFAULT 12 resetIntHash
...
331: 0000000000004f50 3545 FUNC GLOBAL DEFAULT 12 toElement
332: 00000000000071e0 224 FUNC GLOBAL DEFAULT 12 getText
333: 000000000000d420 114 FUNC GLOBAL DEFAULT 12 freeVTDGen
...
339: 000000000000b600 731 FUNC GLOBAL DEFAULT 12 toElement2
340: 000000000000e650 120 FUNC GLOBAL DEFAULT 12 getNav
341: 0000000000025750 70567 FUNC GLOBAL DEFAULT 12 parse
The wrapperAPI.so consists of several functions that use VTD-XML API, its custom types, but accept and return only standard C types and/or structs. The wrapper came straight from a former standalone C program.