Search code examples
elasticsearchlogstashelk

Logstash parsing different line than 1st line as header


I have a sample data:

employee_name,user_id,O,C,E,A,N
Yvette Vivien Donovan,YVD0093,38,19,29,15,36
Troy Alvin Craig,TAC0118,34,40,24,15,34
Eden Jocelyn Mcclain,EJM0952,20,37,48,35,34
Alexa Emma Wood,AEW0655,25,20,18,40,38
Celeste Maris Griffith,CMG0936,36,13,18,50,29
Tanek Orson Griffin,TOG0025,40,36,24,19,26
Colton James Lowery,CJL0436,39,41,27,25,28
Baxter Flynn Mcknight,BFM0761,42,32,28,17,22
Olivia Calista Hodges,OCH0195,37,36,39,38,32
Price Zachery Maldonado,PZM0602,24,46,30,18,29
Daryl Delilah Atkinson,DDA0185,17,43,33,18,25

And logstash config file as:

input {
  file {
    path => "/path/psychometric_data.csv"
    start_position => "beginning"
  }
}
filter {
  csv {
      separator => ","
      autodetect_column_names => true
      autogenerate_column_names => true
  }
}
output {
    amazon_es {
       hosts => [ "https://xxx-xxx-es-xxx.xx-xx-1.es.amazonaws.com:443" ]
       ssl => true
       region => "ap-south-1"
       index => "psychometric_data"
    }
}

I am expecting 1st row(i.e. employee_name,user_id,O,C,E,A,N) as a Elasticsearch field name(header), but I am gettting 3rd row(i.e.Troy Alvin Craig,TAC0118,34,40,24,15,34) as header as follows.

 {
        "_index": "psychometric_data",
        "_type": "_doc",
        "_id": "md4hm3YB8",
        "_score": 1,
        "_source": {
          "15": "21",
          "24": "17",
          "34": "39",
          "40": "37",
          "@version": "1",
          "@timestamp": "2020-12-25T18:20:00.759Z",
          "message": "Ishmael Mannix Velazquez,IMV0086,22,37,17,21,39\r",
          "path": "/path/psychometric_data.csv",
          "Troy Alvin Craig": "Ishmael Mannix Velazquez",
          "host": "xx-ThinkPad-xx",
          "TAC0118": "IMV0086"
        }
 }

What might be the reason for it?


Solution

  • If you set autodetect_column_names to true then the filter interprets the first line that it sees as the column names. If pipeline.workers is set to more than one then it is a race to see which thread sets the column names first. Since different workers are processing different lines this means it may not use the first line. You must set pipeline.workers to 1.

    In addition to that, the java execution engine (enabled by default) does not always preserve the order of events. There is a setting pipeline.ordered in logstash.yml that controls that. In 7.9 that keeps event order iff pipeline.workers is set to 1.

    You do not say which version you are running. For anything from 7.0 (when java_execution became the default) to 7.6 the fix is to disable the java engine using either pipeline.java_execution: false in logstash.yml or --java_execution false on the command line. For any 7.x release from 7.7 onwards, make sure pipeline.ordered is set to auto or true (auto is the default in 7.x). In future releases (8.x perhaps) pipeline.ordered will default to false.