Search code examples
javajsonjacksonjackson-dataformat-xml

How to use Jackson to parse XML files with streaming API


I'm looking for a way to parse a large xml using kotlin.

My usual JSON parser is Jackson, and I know it can also be used to parse xml.

The source file is too large to be parsed using a DOM approach, and I must instead using the streaming API. I can find several example on how to use jackson streaming API with JSON, but nothing about XML. Documentation https://github.com/FasterXML/jackson-dataformat-xml says

Although module implements low-level (JsonFactory / JsonParser / JsonGenerator) abstractions, most usage is through data-binding level. This because a small number of work-arounds have been added at data-binding level, to work around XML peculiarities:

and this made my worried if a streaming approach of XML with this lib is even possibile and/or supported.


Solution

  • Read a tree structure require process STARTING (e.g. <element> for XML or { / [ for JSON) that's way is not possible read the entire object while it is processed in a streaming way.

    Let the root wrapper and a big list of cars (for brevity I use lombok annotations):

    @Getter
    @Setter
    @JacksonXmlRootElement
    @NoArgsConstructor
    @AllArgsConstructor
    static class CarBook {
        @JacksonXmlProperty(isAttribute = true)
        private int version;
        @JacksonXmlElementWrapper(localName = "cars")
        @JacksonXmlProperty(localName = "car")
        private List<Car> cars;
    }
    
    @Getter
    @Setter
    @ToString
    @NoArgsConstructor
    @AllArgsConstructor
    static class Car {
        private String model;
        private String plate;
    }
    

    then, you cannot get a CarBook object until all list (an may be other members) are fully readed.

    The usual way then, is use a XMLStreamReader and check token by token what you get but you can use jackson to parse entire objects using the XmlMapper method:

    /**
     * Method for reading a single XML value from given XML-specific input
     * source; useful for incremental data-binding, combining traversal using
     * basic Stax {@link XMLStreamReader} with data-binding by Jackson.
     * 
     * @since 2.4
     */
    public <T> T readValue(XMLStreamReader r, Class<T> valueType) throws IOException {
        return readValue(r, _typeFactory.constructType(valueType));
    } 
    

    as an example, let the big (1,2G) file:

    <CarBook version="1"><cars>
    <car><model>Alfa Romeo Spider</model><plate>27437</plate></car>
    <car><model>Almera</model><plate>6429</plate></car>
    <car><model>Audi 80 and 90</model><plate>4898</plate></car>
    <car><model>Audi A3</model><plate>21259</plate></car>
    <car><model>Audi A4</model><plate>21056</plate></car>
    <car><model>Audi Coupé</model><plate>5623</plate></car>
    <car><model>Austin Metro</model><plate>26446</plate></car>
    <car><model>BMW 3 Series</model><plate>16338</plate></car>
    <car><model>BMW 5 Series</model><plate>29859</plate></car>
    ...
    

    the, you can read lazily with

    public static void main(String... args) throws IOException, XMLStreamException {
        XmlMapper xm = new XmlMapper();
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xr = xif.createXMLStreamReader(new FileInputStream(/* 1,2G file */ "/home/josejuan/tmp/all.cars.xml"));
    
        // you must to read step by step
        while (xr.hasNext()) {
            xr.next();
            if (xr.getEventType() == START_ELEMENT) {
                System.out.println(xr.getLocalName());
                if ("car".equals(xr.getLocalName())) {
                    Car car = xm.readValue(xr, Car.class);
                    System.out.println(car);
                    if ("21056".equals(car.getPlate()))
                        break;
                }
            }
        }
    
        System.out.println("== End Of Process ==");
    }
    

    with output

    CarBook
    cars
    car
    WithLazyJackson.Car(model=Alfa Romeo Spider, plate=27437)
    car
    WithLazyJackson.Car(model=Almera, plate=6429)
    car
    WithLazyJackson.Car(model=Audi 80 and 90, plate=4898)
    car
    WithLazyJackson.Car(model=Audi A3, plate=21259)
    car
    WithLazyJackson.Car(model=Audi A4, plate=21056)
    == End Of Process ==
    

    reading only 5 cars out of 19.800.000