Search code examples
regexcsvlogstashlogstash-grok

Logstash Messy CSV File


I am trying to use Logstash and grok to parse a messy CSV file.

I was using the CSV filter originally but it meant I had to remove a bunch of header data in pre-processing first.

Ideally I'd like to use the CSV filter again due to its simplicity. I have no control of how the CSV files arrive. Ideally I'd like Logstash to handle everything without any pre-processing.

Below is an example of my CSV file:

1,2,3,4,5,6,7
"text"
"text"

"01-Jan-2012"
"0123456789"

0,0,0,0,0,0,0,0,0,0

"col1Header",[...],col17Header"
"col1UoM",[...],col17UoM"

01-Jan-2012 11:00:01,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
01-Jan-2012 11:00:02,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
01-Jan-2012 11:00:03,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
01-Jan-2012 11:00:04,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

This is my Logstash configuration, it produces the error shown in the comments:

input{
file{
    path => ["/opt/docs/*"]
    type => "log"
    start_position => "beginning"
    sincedb_path => "/dev/null"
    ignore_older => 0
    }
}
filter{
    grok{
        # error being returned here
        # error is: "Expected one of #, {, } at line 27, column 110 (byte 906) after filter{\n\t\n\n\t
# the regex following is to match all the header data that I don't want.
        match => {"header_data" => "(?<header_data>[0-9].*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*"\n)"}
    } # my plan was to then drop the header_data field (not implemented) and the data would be sent to the csv filter
    csv{
        columns => ["col17Header",[...],"col17Header]
    }
    mutate{
        convert => {"col2" => "float",[...] => "float","col17" => "float"}
    }
    date{
        match => ["col1","dd-MMM-YYYY HH:mm:ss"]
    }
}


output{
    elasticsearch{
        action => "index"
        hosts => ["192.168.1.118:9200"]
        index => "foo-logs"
    }
}

For clarity here is the error being produced:

"Expected one of #, {, } at line 27, column 110 (byte 906) after filter{\n\t\n\n\t # the regex following is to match all the header data that I don't want. match => {"header_data" => "(?[0-9].\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.*"\n)"}

I'd like to remove all the data above the dated 4 lines at the bottom. I made (what I assume are inefficient) regex patterns to find the header and CSV data.

All I need from the CSV file are the last 4 lines in my example file, that's all the data I need.

My thoughts are that I'm currently not going about doing this the right way, so I'm open to any and all suggestions.


Solution

  • From your example, the lines you want have a unique pattern:

    ^%{MONTHDAY}-%{MONTH}-%{YEAR}
    

    grok for that pattern. For the lines that don't match, you'll get a grokparsefailure and can then use the drop{} filter to ignore them.