Search code examples
xmllinuxbashsvnfedora

Filtering svn ls --xml for recently modified files without an XML parser


I am extremely new to Linux and Bash Scripting and I am struggling a little bit on getting started with it. I have a list of elements in XML, and I would like to only select a few of them. Based on the element with the latest year and latest month (last changed), I want to select by name only those that got changed the last 4 months. Basically I want a list of names of elements used in the last 4 months. I am using svn ls --xml to spit the data out in xml and I am trying to pipe it to grep to do the above. I can't use an xml parser as that would require me to install it in every system the script will be running on. Here are two of such xml entries:

<entry
   kind="directory">
<name>foo</name>
<commit
   revision="69">
<author>myself</author>
<date>2016-05-13T00:21:59.396753Z</date>
</commit>
</entry>
<entry
   kind="directory">
<name>bar</name>
<commit
   revision="666">
<author>myself</author>
<date>2013-04-04T01:56:54.484359Z</date>
</commit>
</entry>
</list>
</lists>

Solution

  • Horrible, no-good, very-bad answer you asked for

    Assuming (and it's an assumption absolutely not guaranteed to hold in future releases) that the formatting of this output will remain constant in the future (in ways beyond the well-formedness guarantees provided by the XML specification), and that your filenames will never contain characters that need to be escaped in XML:

    date_re='^<date>(.*)</date>$'
    name_re='^<name>(.*)</name>$'
    end_re='^</entry>$'
    
    limit=$(date -d 'now - 4 months' '+%Y-%m-%dT%H:%M:%S') || exit
    
    date=; name=
    while read -r line; do
      [[ $line =~ $date_re ]] && date=${BASH_REMATCH[1]}
      [[ $line =~ $name_re ]] && name=${BASH_REMATCH[1]}
      [[ $line =~ $end_re && $date && $name ]] && [[ $date > $limit ]] && {
        printf '%s\t%q\n' "$date" "$name"
        date=; name=
      }
    done < <(svn ls --xml) | sort -r
    

    The output of this will be a stream that looks something like (for your input):

    2016-05-13T00:21:59.396753Z foo
    

    Note that this will behave badly if your filenames are at all interesting. Expect &gt;, &amp;, and similar in your output, whereas the actual filenames contain >, & or the like. It will also cease to work if future versions of SVN add attributes to these XML tags, which they're entirely allowed to do. Don't do this.


    The Right Thing

    ...to get the four newest files:

    xmlstarlet sel -t -m '//entry' -v './commit/date' -o $'\t' -v './name' -n \
      | sort -r \
      | head -n 4
    

    ...now, this is unambiguous only if we assume that Subversion can't store filenames with literal newlines. Fortunately, this is a rule it enforces in practice; thus, everything past the first tab character in this output stream can be safely interpreted as a filesystem component.


    The Right Thing, Portably

    The above xmlstarlet command is precisely equivalent to using xsltproc to apply the following template:

    <?xml version="1.0"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:exslt="http://exslt.org/common" version="1.0" extension-element-prefixes="exslt">
      <xsl:output omit-xml-declaration="yes" indent="no"/>
      <xsl:template match="/">
        <xsl:for-each select="//entry">
          <xsl:call-template name="value-of-template">
            <xsl:with-param name="select" select="./commit/date"/>
          </xsl:call-template>
          <xsl:text>        </xsl:text>
          <xsl:call-template name="value-of-template">
            <xsl:with-param name="select" select="./name"/>
          </xsl:call-template>
          <xsl:value-of select="'&#10;'"/>
        </xsl:for-each>
      </xsl:template>
      <xsl:template name="value-of-template">
        <xsl:param name="select"/>
        <xsl:value-of select="$select"/>
        <xsl:for-each select="exslt:node-set($select)[position()&gt;1]">
          <xsl:value-of select="'&#10;'"/>
          <xsl:value-of select="."/>
        </xsl:for-each>
      </xsl:template>
    </xsl:stylesheet>
    

    If this is saved as names-and-dates.xslt, then:

    xsltproc names-and-dates.xslt - < <(svn ls --xml) | sort -r | head
    

    ...will apply it accordingly.


    Footnote: Applying a date cutoff

    Replace head with awk -v min_date=$(date -d 'now - 4 months' '+%Y-%m-%dT%H:%M:%S') '($1 < min_date) { exit } { print }' in the above if you want to enforce a date cutoff rather than taking the last-N approach of head, as preferred.

    If you want to take four months relative to the first entry, rather than relative to the current date, you could instead pipe the results through the following:

    {
       read -r date name
       min_date=$(date -d "$date - 4 months" '+%Y-%m-%dT%H:%M:%S')
       printf '%s\t%s\n' "$date" "$name"
       while read -r date name; do
         [[ $date > $min_date ]] || break
         printf '%s\t%s\n' "$date" "$name"
       done
    }
    

    Note that this assumes GNU date; adjusting for portability to non-GNU platforms is left as an exercise to the reader.