Search code examples
xmlazureazure-sql-databaseazure-blob-storageazure-data-lake

azure solutions for analysing xml data


we are looking at developing a BI solution in Azure to analyse client airline search requests to our system. The requests are stored as xmls and there are around 50 million generated each day. What azure solutions would you recommend to load these to the cloud, analyse this data...trends across time (so keeping historical data), real time system performance/errors..the data is included in the xmls?

I assume we would also need something like express route to transfer the data to the cloud


Solution

  • You can use Azure Blob Storage to upload that data to Azure. If that XML data is complex to analyze it you may have to transform it to a much simpler format like CSV format.

    You can use Azure Data Factory to transform XML files to CSV files, then you can load all generated CSV files into corresponding tables in Azure SQL Data Warehouse via Polybase and a developed Stored Procedure activity on Azure Data Factory.

    If you want to have a folder on BLOB storage for XML files not processed, another folder for XML files processed and another folder for the CSV files created and you want Azure Data Factory to move files from one folder to another during all the process, then you need to use HDInsight and create a map reduce activity for that purpose.

    PolyBase allows to use T-SQL statements to access data stored in Hadoop or Azure Blob Storage and query it in an adhoc fashion. It also can let you query semi-structured data and join the results with relational data sets stored in SQL Data Warehouse. You should be able to create external tables pointing to data in Blob Storage using PolyBase. Then using a simple stored procedure you can select all the data in the external table and load them into the corresponding tables in SQL Data Warehouse.

    Once all the transformed data is loaded into the corresponding tables, you can write stored procedures to generate the data for aggregation tables which you can then use to feed PowerBI dashboards.

    Hope this helps.