Search code examples
regexxmlnotepad++regex-lookaroundsregex-greedy

RegEx for matching multiple target search


I am using the latest and greatest Notepad++. I have 1,500 XML files. My goal is to organize all 1,500 XML files by groups. Can someone please help me develop a RegX that will let me search multiple XML tag types across all 1,500 XML files.

So, for example I want Notepad++ to search for me how many XML files have embedded inside these two XML tags: <tag1> & <tag2>. The problem I am having is it only works by targeting a single tag. I would like to step it up by being able to search for 2, 3, or 4 tags and this will help me in grouping all 1,500 XML files under different categories.


Solution

  • How reliable do you need it to be? There's a problem here in that with 1500 input files, you're not going to be able to check the results by hand. So it only needs one rogue file that does something legitimate but unexpected (for example, writing <tag1 > instead of <tag1>, or having an instance of <tag1> that has been "commented out") to give you bad results that you won't detect. How much does this matter to you?

    This is why it's generally recommended never to use regular expressions for processing XML, instead always to use an XML parser and an XML query language such as XPath.

    XSLT 2.0+ and XQuery both give you the possibility to process a collection of XML files. You haven't given a very precise specification of requirements, but here's the kind of thing you can do:

    <xsl:for-each-group select="collection('file:///Users/me/data/')"
                        group-by="my:category(.)">
       <xsl:for-each select="current-group()">
          <xsl:result-document href="{my:output-file-name(current-grouping-key())}">
             <xsl:copy-of select="."/>
          </
       </
    </
    

    where my:category() is a user-written function that uses XPath logic to allocate a category to each document, and my:output-file-name() is a user-written function that decides where to place the documents in each category.