Search code examples
elasticsearchlogstashlogstash-grok

Logstash - Separate results into different objects using Grok match pattern


So currently I'm analysing data from my MySQL subtitles db, and putting them in ElasticSearch 5.2. Regardless, my ES logstash has the following filter:

filter {
    grok {
           match => ["subtitles", "%{TIME:[_subtitles][start]} --> %{TIME:[_subtitles][end]}%{GREEDYDATA:[_subtitles][sentence]}" ]
          }
}

which produces the following:

"_subtitles": {
                  "sentence": [
                     "im drinking latte",
                     "im drinking coffee",
                     "while eating a missisipi cake"
                  ],
                  "start": [
                     "00:00:00.934",
                     "00:00:01.934",
                     "00:00:04.902"
                  ],
                  "end": [
                     "00:00:02.902",
                     "00:00:03.902",
                     "00:00:05.839"
                  ]
               }

but what I want is this:

 "_subtitles": [
                     {
                          "sentence": "im drinking latte",
                          "start": "00:00:00.934",
                          "end": "00:00:02.902"
                       },
                     {... same structure as above},
                     {... same structure as above},
]

Having in mind that _subtitles will be nested by predefined mapping.

And the original data is as follow:

00:00:00.934 --> 00:00:02.902
im drinking latte

00:00:01.934 --> 00:00:03.902
im drinking coffee

00:00:04.902 --> 00:00:05.839
while eating a missisipi cake

How can I achieve this using Grok's match pattern and placeholders?


Solution

  • So after a lot of research and reading i found THE ANSWER

    I found the best way to do it is either : - Leave Logstash and do my own script for migrating from mysql to Elastic, but then i'd have to do all the pattern recognition and replacement, which can get somehow complicated. - post-process the fields with a Ruby script/filter.

    The solution was as follow:

    ruby {
          code => "
            subtitles = []
            starts = event.get('start')
            ends = event.get('end')
            sentences = event.get('sentence')
            counter = 0
            starts.each do |v|
             temp_hash = {}
             temp_hash['index'] = counter
             temp_hash['start'] = v
             temp_hash['end'] = ends[counter]
             temp_hash['sentence'] = sentences[counter]
             counter += 1
             subtitles.push(temp_hash)
            end
            event.set('subtitles', subtitles)
          "
      }
    

    Hope that helps.

    But now i'm trying to improve this, because my ElasticSearch container fails with something like "cannot handle requests"/ goes off for a while.. just because of the indexing (currently around 20k row from mysql) into Elastic with around 40 nested objects for each.

    Anything that i can do to make faster?

    maybe a way to flag docs so i dont process them and mark them as processed the previous day or some'n ?

    Thanks, Regards.