Search code examples
pythonhtmlregexxmlpcre

Regex for XML document


I am trying to come up with a regex for an XML document which is essentially a DASH mpd file. Use case is that this XML document has AdaptationSet tag which in-turn can have multiple Representation tags as shown. I need to match all Representation tag which have bandwidth attribute more than the specified input i.e 2000000 or 4000000 shown below. I could come up with the following one but it doesn't address the case when attributes span multiple lines as shown in Representation with id=1.

RANGE in regex can take any value from 1-9 which can be assumed to be in integer format ready to be consumed by regex. RANGE with following 6 digits will make the match to be made for bandwidth value of 1000000 or 2000000 or 3000000 and so on based on whether value of RANGE is 1 or 2 or 3 respectively.

regex:

<[Rr]epresentation.*?[Bb]andwidth="0?[%(RANGE)]\d{6}"[\s\S]*?[Rr]epresentation>

    <AdaptationSet segmentAlignment="true" maxWidth="1280" maxHeight="720" maxFrameRate="24" par="16:9">
     <Representation id="1" 
        mimeType="video/mp4" 
        codecs="avc1.4d401f" 
        width="512" 
        height="288" 
        frameRate="24" 
        sar="1:1" 
        startWithSAP="1" 
        bandwidth="1000000">
        <SegmentTemplate timescale="12288" duration="61440" media="BBB_512_640K_video_$Number$.mp4" startNumber="1" initialization="BBB_512_640K_video_init.mp4" />
      </Representation>
      <Representation id="2" mimeType="video/mp4" codecs="avc1.4d401f" width="512" height="288" frameRate="24" sar="1:1" startWithSAP="1" bandwidth="2000000">
        <SegmentTemplate timescale="12288" duration="61440" media="BBB_512_640K_video_$Number$.mp4" startNumber="1" initialization="BBB_512_640K_video_init.mp4" />
      </Representation>
      <Representation id="3" mimeType="video/mp4" codecs="avc1.4d401f" width="768" height="432" frameRate="24" sar="1:1" startWithSAP="1" bandwidth="4000000">
        <SegmentTemplate timescale="12288" duration="61440" media="BBB_768_1440K_video_$Number$.mp4" startNumber="1" initialization="BBB_768_1440K_video_init.mp4" />
      </Representation>
    </AdaptationSet>

Solution

  • you can use this regex

    <[Rr]epresentation[^>]*?[Bb]andwidth="0?[2-9]\d{6}"[\s\S]*?[Rr]epresentation>
    

    https://regex101.com/r/MmUkzc/9