I am trying to use Logstash and grok to parse a messy CSV file.
I was using the CSV filter originally but it meant I had to remove a bunch of header data in pre-processing first.
Ideally I'd like to use the CSV filter again due to its simplicity. I have no control of how the CSV files arrive. Ideally I'd like Logstash to handle everything without any pre-processing.
Below is an example of my CSV file:
1,2,3,4,5,6,7
"text"
"text"
"01-Jan-2012"
"0123456789"
0,0,0,0,0,0,0,0,0,0
"col1Header",[...],col17Header"
"col1UoM",[...],col17UoM"
01-Jan-2012 11:00:01,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
01-Jan-2012 11:00:02,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
01-Jan-2012 11:00:03,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
01-Jan-2012 11:00:04,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
This is my Logstash configuration, it produces the error shown in the comments:
input{
file{
path => ["/opt/docs/*"]
type => "log"
start_position => "beginning"
sincedb_path => "/dev/null"
ignore_older => 0
}
}
filter{
grok{
# error being returned here
# error is: "Expected one of #, {, } at line 27, column 110 (byte 906) after filter{\n\t\n\n\t
# the regex following is to match all the header data that I don't want.
match => {"header_data" => "(?<header_data>[0-9].*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*"\n)"}
} # my plan was to then drop the header_data field (not implemented) and the data would be sent to the csv filter
csv{
columns => ["col17Header",[...],"col17Header]
}
mutate{
convert => {"col2" => "float",[...] => "float","col17" => "float"}
}
date{
match => ["col1","dd-MMM-YYYY HH:mm:ss"]
}
}
output{
elasticsearch{
action => "index"
hosts => ["192.168.1.118:9200"]
index => "foo-logs"
}
}
For clarity here is the error being produced:
"Expected one of #, {, } at line 27, column 110 (byte 906) after filter{\n\t\n\n\t # the regex following is to match all the header data that I don't want. match => {"header_data" => "(?[0-9].\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.*"\n)"}
I'd like to remove all the data above the dated 4 lines at the bottom. I made (what I assume are inefficient) regex patterns to find the header and CSV data.
All I need from the CSV file are the last 4 lines in my example file, that's all the data I need.
My thoughts are that I'm currently not going about doing this the right way, so I'm open to any and all suggestions.
From your example, the lines you want have a unique pattern:
^%{MONTHDAY}-%{MONTH}-%{YEAR}
grok for that pattern. For the lines that don't match, you'll get a grokparsefailure and can then use the drop{} filter to ignore them.