Search code examples
azuredata-scienceazure-data-factory

Extracting/Removing/skipping Free Text from Json Files in Azure Data Factory


I am trying to do some source transformation in ADF where I have alot server logs in the following format:

#PartnerName    QA Server
#ApplicationName    T_GSPClient
#AccountName    DoNotModifyDMS
#SDK    desktop
#ClientVersion  5.1.1894.3
#InputChannel   DesktopMic
#User   JohnDoe
#NmsLogin   JohnDoe
#SessionId  7ba732d6-3445-4b16-b7e8-345fgd4f5g4
#ClientIP   209.122.69.109
#SRTechnology   S2
#SROptions  NoTextBefore
#GeneralLogLevel    Trace
#ModuleLogLevels    
#ServerDateTimeUTC  2023-07-06 15:28:33.105
{"date":"2023-07-06\t15:28:09.653","level":"TRACE","msg":"DMVAServerMessage-Initialize:{"dmhMessage":{"messageHeader":{"messageType":"unsubscribe","sessionId":"00000000-0000-0000-0000-000000000000","messageId":"aeb65e55-a7c0-4e96-a960-dc2a252d6b2c","transactionType":"acknowledgement","clientType":"ttsChannel","version":"1.0","application":"DesktopSDK"},"messageResponse":{"resultCode":"SERVER_ERROR","errorMessage":"The subscription id does not exist, unsubscribe did not remove an entry"}}} ","traceId":"409e0d44-ad50-4f17-84c7-0521e01e11fc","spanId":"e0750e44-ad50-4f17-90a9-3b6940e0294b","resource":{"module":".NET","class":"Nuance.SpeechAnywhere.Internal.DMVA.DMVAServerMessage","function":"Initialize","line":70,"pid":26612,"thread":"[28-27280]"}}
{"date":"2023-07-06\t15:28:09.653","level":"TRACE","msg":"DMVAServerMessage-Initialize:{"dmhMessage":{"messageHeader":{"messageType":"unsubscribe","sessionId":"00000000-0000-0000-0000-000000000000","messageId":"aeb65e55-a7c0-4e96-a960-dc2a252d6b2c","transactionType":"acknowledgement","clientType":"ttsChannel","version":"1.0","application":"DesktopSDK"},"messageResponse":{"resultCode":"SERVER_ERROR","errorMessage":"The subscription id does not exist, unsubscribe did not remove an entry"}}} ","traceId":"409e0d44-ad50-4f17-84c7-0521e01e11fc","spanId":"e0750e44-ad50-4f17-90a9-3b6940e0294b","resource":{"module":".NET","class":"Nuance.SpeechAnywhere.Internal.DMVA.DMVAServerMessage","function":"Initialize","line":70,"pid":26612,"thread":"[28-27280]"}}
{"date":"2023-07-06\t15:28:09.653","level":"TRACE","msg":"DMVAServerMessage-Initialize:{"dmhMessage":{"messageHeader":{"messageType":"unsubscribe","sessionId":"00000000-0000-0000-0000-000000000000","messageId":"aeb65e55-a7c0-4e96-a960-dc2a252d6b2c","transactionType":"acknowledgement","clientType":"ttsChannel","version":"1.0","application":"DesktopSDK"},"messageResponse":{"resultCode":"SERVER_ERROR","errorMessage":"The subscription id does not exist, unsubscribe did not remove an entry"}}} ","traceId":"409e0d44-ad50-4f17-84c7-0521e01e11fc","spanId":"e0750e44-ad50-4f17-90a9-3b6940e0294b","resource":{"module":".NET","class":"Nuance.SpeechAnywhere.Internal.DMVA.DMVAServerMessage","function":"Initialize","line":70,"pid":26612,"thread":"[28-27280]"}}
{"date":"2023-07-06\t15:28:09.653","level":"TRACE","msg":"DMVAServerMessage-Initialize:{"dmhMessage":{"messageHeader":{"messageType":"unsubscribe","sessionId":"00000000-0000-0000-0000-000000000000","messageId":"aeb65e55-a7c0-4e96-a960-dc2a252d6b2c","transactionType":"acknowledgement","clientType":"ttsChannel","version":"1.0","application":"DesktopSDK"},"messageResponse":{"resultCode":"SERVER_ERROR","errorMessage":"The subscription id does not exist, unsubscribe did not remove an entry"}}} ","traceId":"409e0d44-ad50-4f17-84c7-0521e01e11fc","spanId":"e0750e44-ad50-4f17-90a9-3b6940e0294b","resource":{"module":".NET","class":"Nuance.SpeechAnywhere.Internal.DMVA.DMVAServerMessage","function":"Initialize","line":70,"pid":26612,"thread":"[28-27280]"}}

My objective is remove/skip the first free text lines and keep the other json data and then move it to another blob for further transformation. I tried source transformation with derived column but my Data Flow is still showing json format error. I also used Copy Data activity via ForEach with "enableSkipIncompatibleRow": true, it doesn't work. It only works if I just work with single file, not when I try to iterate over many files and skip/remove those lines.


Solution

  • In order to remove the first few lines which are not in Json format using ADF dataflow, follow the below approach.

    • Take the source transformation with the delimitted source dataset . Take any column delimitter which is not part of the text.

    enter image description here

    • Here, I took the tilde ~ as column delimitter. So, the entire row is in the same column.

    enter image description here

    • Then take the filter transformation and give the filter on condition as, substring(Column_name,1,1)=='{'. This removes all the rows which do not begin with {.

    enter image description here

    • Then take the sink transformation and set the filename option to Output to single file and give the filename. In optimize, select single partition.

    enter image description here

    Here, I took sink file as delimited dataset with column delimiter as No delimitter and Quote character as No quote character.

    enter image description here

    Data in Output file

    enter image description here