Search code examples
apache-nifi

how to parse a csv file in nifi whose field contains a delimited value


I have a requirement to interpret and parse a log file whose contents are delimited by comma. The contents of the log file look like below. There is no header column in the file. I am putting it for our convenience.

Event_TimeStamp, Target_IP, UserName, Source_IP, DatabaseName, Activity, QueryText,            Flag
202101xxxx,       10.2.x.x,     u1,   172.5.x.x,    DB1,       Connect,                       ,   0
202101xxxx,       10.2.x.x,     u1,   172.5.x.x,    DB1,       Query,   'select * from table1',   0
202101xxxx,       10.2.x.x,     u1,   172.5.x.x,    DB1,       Query,   'select * from table2',   0
202101xxxx,       10.2.x.x,     u1,   172.5.x.x,    DB1,       Query,   'select colname1,colname2,colname3,colname4 from table1',  0
202101xxxx,       10.2.x.x,     u1,   172.5.x.x,    DB1,       Query,   'select colname1,\'some Stringliteral\' as colname2,colname3,colname4 from table2',  0
202101xxxx,       10.2.x.x,     u1,   172.5.x.x,    DB1,       DisConnect,                    ,   0

I have used the convertRecord and configured CSVRecordReader and CSVSetRecordWriter to read/interpret the data and write the data to flow file respectively. The schema for CSV Read is as below.

{
     "type": "record",
     "namespace": "nifi",
     "name": "db",
     "fields": [
       { "name": "Event_Timestamp", "type": "string" },
       { "name": "SourceIP", "type": "string" },
       { "name": "UserName", "type": "string" },
       { "name": "TargetIP", "type": "string" },
       { "name": "DatabaseName", "type": "string" },
       { "name": "Activity", "type": "string" },
       { "name": "QueryText", "type": "string" },
       { "name": "Flag", "type": "int" },      
     ]
} 

The nifi flow is failing in reading the data because the delimiter configured while setting up CSVRecordReader is ","(comma) and the QueryText also contains comma within the text. If I only put the first 3 lines in the new log file and run the flow, then the job is successful since the Querytext doesn't contain any commas within.

I do not know if the CSVReader is the right processor to use here. Can anyone help me how to read a log file that is comma-separated but, one or a few fields has comma's within the text. Any help is highly appreciated.


Solution

  • QueryText field values are in single quote (') so configure your CSVRecordReader property Quote Character = ' default is "

    Also set property CSV Parser = Jackson CSV