Search code examples
xmlwindowspowershellxml-parsingcommand-line-tool

Extract embedded XML data from an Audio File in windows


We have a platform that records our callcentre calls and at the end of the wav file adds some xml that holds important metadata about that call.

I'm trying to read a folder of these wav files and pull the meta data in to a list for a user, their preference is for the list to be in excel, however I'm struggling to find a method that would reliably work on a normal windows computer without having something special installed, like Python.

Like Excel has an xml import function but that fails as the xml is at the end of the files and excel reads from the start and gets confused by the audio part, just need to skip down to <recording> and read from there until <\recording>.

I've tried the following in powershell:

$directory = "C:\test"
$wavFiles = Get-ChildItem -Path $directory -Filter *.wav

foreach ($file in $wavFiles) {
    Write-Host "Processing file: $($file.Name)"
    $content = Get-Content -Path $file.FullName -Raw -Encoding Byte
    $decodedContent = [System.Text.Encoding]::UTF8.GetString($content)
    $match = [regex]::Match($decodedContent, '<recording>.+?</recording>')
    if ($match.Success) {
        $xmlContent = $match.Value
        Write-Host "Found XML in file $($file.Name):"
        Write-Host $xmlContent
    } else {
        Write-Host "No XML found in file $($file.Name)."
    }
}

And this correctly locates the file, but is unable to parse the xml. Which can be seen when opening the file in a text editor like notepad++

Processing file: 131346032527__8115_02-13-2024-11-12-58.wav

No XML found in file 131346032527__8115_02-13-2024-11-12-58.wav.

Any ideas?


Solution

  • Note:

    • The following assumes that a well-formed <recording> XML element is indeed embedded in your .wav files, and that it is represented as a single, contiguous block of bytes.

    By default, . in the .NET regex engine matches any character except a newline character (\n).

    • Thus, one possible explanation for not being able to find a match is that your <recording> XML element spans multiple lines.

      • To make . match newlines too, you need to set the SingleLine .NET regex option, which you can do in one of the following ways:

        • Use the equivalent s inline option, as part of the regex:

          [regex]::Match($decodedContent, '(?s)<recording>.+?</recording>')
          
        • Use the [regex]::Match() overload that takes a System.Text.RegularExpressions.RegexOptions argument:

          [regex]::Match($decodedContent, '<recording>.+?</recording>', 'SingleLine')
          
      • Note:

        • This solution may have to be combined with using a different character encoding - see the next major bullet point.

        • Also, note that .NET's regex engine is case-sensitive by default (unlike the PowerShell functionality that builds on it, notably -match and -replace); to make the above case-insensitive, use (?si) / 'SingleLine, IgnoreCase'

    • Independently, perhaps the embedded XML uses a character encoding other than UTF-8, notably one that isn't at least compatible with UTF-8 with respect to ASCII-range characters.

      • E.g., if the actual encoding is UTF-16LE, use the following:

         [System.Text.Encoding]::Unicode.GetString($content)