Search code examples
javaxmlxpathxquerysax

Best way to extract big xml block from large xml file


I am extracting big blocks from XML files by using XPath. My xml files are large, they are from PubMed. An example of my file type is:

ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/medline17n0001.xml.gz

So, by using

 Node result = (Node)xPath.evaluate("PubmedArticleSet/PubmedArticle[MedlineCitation/PMID = "+PMIDtoSearch+"]", doc, XPathConstants.NODE);

I get the article with PMIDtoSearch, so its perfect. But it takes much time. I have to do it around 800.000 times, so with this solution it would take more than two months. Some blocks has more than 400 lines and each xml file has more than 4 millions of lines.

I also have tried a solution like this getElementsByTagName function but it takes almost the same time.

Do you know how improve the solution?

Thanks.


Solution

  • I took your document and loaded into exist-db then executed your query, essentially this:

    xquery version "3.0";
    let $medline := '/db/Medline/Data'
    let $doc := 'medline17n0001.xml'
    let $PMID := request:get-parameter("PMID", "")
    let $article := doc(concat($medline,'/',$doc))/PubmedArticleSet/PubmedArticle[MedlineCitation/PMID=$PMID]
    return
    $article
    

    The document is returned in 400 milliseconds from a remote server. If I beefed up that server, I would expect less than that and it could handle multiple concurrent requests. Or if you had everything local even faster.

    Try it yourself, I left the data in a test server (and remember this is querying remote to a Amazon micro server in California):

    http://54.241.15.166/get-article2.xq?PMID=8

    http://54.241.15.166/get-article2.xq?PMID=6

    http://54.241.15.166/get-article2.xq?PMID=1

    And of course, that entire document is there. You can just change that query to PMID=667 or 999 or whatever and get the target document fragment back.