Search code examples
xmlapache-beamspotify-scio

process XML files with Spotify Scio (scala wrapper for apache beam)


Apache beam java sdk supports reading large xml input files, with org.apache.beam.sdk.io.xml.XmlIO (i looked at 2.1.0 version)

Does anyone know if Scio allows you to do the same or have an example? I have a set of very large xml files that i'd like to process.


Solution

  • You can do this with Scio by using a custom input transform. Typically, you'll need to do this for any input source that doesn't have a native Scio interface.

    Example:

    import org.apache.beam.sdk.io.xml._
    
    
    val xmlInputTransform = XmlIO.read()
      .from("file or pattern spec")         // TODO: specify file name or Java "glob" file pattern to read multiple XML files
      .withRootElement("root element")      // TODO: specify name of root element
      .withRecordElement("record element")  // TODO: specify name of record element
      .withRecordClass(classOf[Record])     // TODO: Define JAXB annotated Record class
    
    // xmls is an SCollection[Record]
    val xmls = sc.customInput("fromXML", xmlInputTransform)
    

    See the XmlIO.Read section in the Apache Beam Java SDK Reference for more details: https://beam.apache.org/documentation/sdks/javadoc/2.2.0/