Search code examples
sqlexcelxmlimport

How to combine multiple xml files into sql table?


I have 50K xml files (zml technically, but behave the same), each with metadata for a single file. I need to combine that metadata into a single SQL table so I can manipulate it there. The xml files have varying metadata tags, with some overlapping and some not. How can I get from 50K xml to a single SQL table?

I've looked at Powershell and can get part of the data from a single file parsed out, but haven't figured out writing it into a csv yet.

I've looked at XML data maps in Excel but it only seems to allow the import of a single xml file (1 row of data essentially).

I've also used a co-worker's script to import a single file's worth into SQL, but it puts the whole SQL into a single field, and would then need to be parsed out.

Looking for the best, most efficient path forward, or other ideas.

I have also considered combining all the individual xml files first, but there isn't anything in the file that would identify which file its from - so no way to link it back to the file its describing.


Solution

  • Figured it out! Used powershell to parse out each section.

    ##File with list of paths to parse
    
    $CSV = Import-Csv "C:\Folder\AllZMLFiles.csv"
    
    foreach($LINE in $CSV)
        {
        $Path = $LINE.filpath
    
    
     [xml]$xmlFile = Get-Content -Path $Path
    
     ## For each XML section, list out all objects
    $xmlFile.document.File |  ForEach-Object {
    
         ## output the result object
         [pscustomobject]@{
            Source = $Path
             DateCreated = $_.DateCreated
             DateModified = $_.DateModified
             JustUploaded = $_.JustUploaded
             OrigFilename = $_.OrigFilename
         }
     } | Export-Csv -Path C:\Folder\File.csv -NoTypeInformation -Append ##Append to CSV
    
    
     [xml]$xmlFile = Get-Content -Path $Path
    
    
    $xmlFile.document.Fields |  ForEach-Object {
    
         ## output the result object
         [pscustomobject]@{
            Source = $Path
             Producer = $_.Producer
             ProcessDate = $_.ProcessDate
             Version = $_.Version
             OcrEngine = $_.OcrEngine
             secure = $_.secure
             LABEL = $_.LABEL
             Numero_Contrat = $_.Numero_Contrat
             Date_du_contrat = $_.Date_du_contrat
             Fin_du_contrat = $_.Fin_du_contrat
             Duree = $_.Duree
             Type_Contrat = $_.Type_Contrat
             Artiste = $_.Artiste
             Nom_civil_artiste = $_.Nom_civil_artiste
             Contractant = $_.Contractant
             Type_Document = $_.Type_Document
             Objet = $_.Objet
             Titre_phono_video = $_.Titre_phono_video
             Commentaires = $_.Commentaires
             Statut = $_.Statut
             Conteneur = $_.Conteneur
             Boite = $_.Boite
             Origine = $_.Origine
             OrigFilename = $_.OrigFilename
             Type = $_.Type
         }
     } | Export-Csv -Path C:\Folder\Fields.csv -NoTypeInformation -Append
    
    
     [xml]$xmldtInfos = Get-Content -Path $Path
    
    
    $xmldtInfos.document.dtInfos |  ForEach-Object {
    
         ## output the result object
         [pscustomobject]@{
            Source = $Path
             ID = $_.ID
             TypeID = $_.TypeID
             WordCount = $_.WordCount
          
         }
     } | Export-Csv -Path C:\ACG\Zedoc\dtInfos.csv -NoTypeInformation -Append
    
        }