So currently I'm analysing data from my MySQL subtitles db, and putting them in ElasticSearch 5.2. Regardless, my ES logstash has the following filter:
filter {
grok {
match => ["subtitles", "%{TIME:[_subtitles][start]} --> %{TIME:[_subtitles][end]}%{GREEDYDATA:[_subtitles][sentence]}" ]
}
}
which produces the following:
"_subtitles": {
"sentence": [
"im drinking latte",
"im drinking coffee",
"while eating a missisipi cake"
],
"start": [
"00:00:00.934",
"00:00:01.934",
"00:00:04.902"
],
"end": [
"00:00:02.902",
"00:00:03.902",
"00:00:05.839"
]
}
but what I want is this:
"_subtitles": [
{
"sentence": "im drinking latte",
"start": "00:00:00.934",
"end": "00:00:02.902"
},
{... same structure as above},
{... same structure as above},
]
Having in mind that _subtitles will be nested by predefined mapping.
And the original data is as follow:
00:00:00.934 --> 00:00:02.902
im drinking latte
00:00:01.934 --> 00:00:03.902
im drinking coffee
00:00:04.902 --> 00:00:05.839
while eating a missisipi cake
How can I achieve this using Grok's match pattern and placeholders?
I found the best way to do it is either : - Leave Logstash and do my own script for migrating from mysql to Elastic, but then i'd have to do all the pattern recognition and replacement, which can get somehow complicated. - post-process the fields with a Ruby script/filter.
The solution was as follow:
ruby {
code => "
subtitles = []
starts = event.get('start')
ends = event.get('end')
sentences = event.get('sentence')
counter = 0
starts.each do |v|
temp_hash = {}
temp_hash['index'] = counter
temp_hash['start'] = v
temp_hash['end'] = ends[counter]
temp_hash['sentence'] = sentences[counter]
counter += 1
subtitles.push(temp_hash)
end
event.set('subtitles', subtitles)
"
}
Hope that helps.
But now i'm trying to improve this, because my ElasticSearch container fails with something like "cannot handle requests"/ goes off for a while.. just because of the indexing (currently around 20k row from mysql) into Elastic with around 40 nested objects for each.
Anything that i can do to make faster?
maybe a way to flag docs so i dont process them and mark them as processed the previous day or some'n ?
Thanks, Regards.