Search code examples
schemainfluxdbnode-red

InfluxDB 1.8 schema design for industrial application?


I have node-red-S7PLC link pushing the following data to InfluxDB at 1.5 second cycle.

msg.payload = {
    name: 'PLCTEST',
level1_m: msg.payload.a90, "value payload from PLC passed to influx"
power1: msg.payload.a93,
valvepos_%: msg.payload.a107,
temp1: msg.payload.a111,
washer_acidity: msg.payload.a113,
etc.
}
return msg;

In total 130 individual data points consisting of binary states like alarms and button presses and measurements (temp, pressure, flow...)

This has been running a week now as a stress test for DB writes. Writing seems to be fine but I have noticed that if i swap from 10 temperature measurements with 30min query window to 3hr query in Grafana dashboard the load times are starting to get annoyingly long. 12hr window is a no go. This i assume is because all my things are pushed as fieldkeys and field values. Without indexes this is straining the database.

Grafana query inspector gives me 1081 rows per measurement_query so x10 = 10810 rows/dasboard_query. But the whole pool influx has to go through is 130 measurements x 1081 = 140530 rows / 3hr window.

I would like to get a few pointers on how to optimize the schema. I have the following in mind.

DB: Aplication_nameX

Measurement: Process_metrics,

Tags: Temp,press,flow,%,Level,acidity, Power
Tag_values: CT-xx1...CT-xxn, CP-xx1...CP-xxn, CF-xx1...CF-xxn,....
Fieldkey= Value, fieldvalue= value
Measurement: Alarms_On,
Fieldkey= State, fieldvalue= "trues", "false"
Measurement:Binary_ON
Fieldkey: State, fieldvalue= "trues", "false"

This would then be in node-red for few temps (i think):

     msg.payload = [{
     Value: msg.payload.xxx, "value payload from PLC passed to influx"
     Value: msg.payload.xxx,
     Value: msg.payload.xxx
     },
     {
     Temp:"CT_xx1",
     Temp:"CT_xx2",
     Temp:"CT_xx2"

     }];
     return msg;  

EDIT: Following Roberts comments.

I read the influx manuals for a week and other samples online before writing here. Some how influx is just different and unique enough from normal SQL mind set that i do find this unusually difficult. But i did have a few moments of clarity over the weekend.

I think the following would be more appropriate.

DB: Station_name 
measurements: Process_metrics,Alarms, Binary.   
Tags: "SI_metric"  
Values= "Temperature", "Pressure" etc.   
Fieldkey: "proces_position"= CT/P/F_xxx.  
values= process_values   

This should prevent the cardinality going bonkers vs. my original thought.

I think alarms and binary can be left as fieldkey/fieldvalue only and separating them to own measurements should give enough filtering. These are also logged only at state change thus a lot less input to the database than analogs at 1s cycle.

Following my original node-red flow code this would translate to batch output function:

     msg.payload = [
     {
        measurement: "Process_metrics",
        fields: {
            CT_xx1:  msg.payload.xxx,
            CT_xx2:  msg.payload.xxx,
            CT_xx3:  msg.payload.xxx
        },
            tags:{
            metric:"temperature"
        },
     {
        measurement: "Process_metrics",
        fields: {
            CP_xx1:  msg.payload.xxx,
            CP_xx2:  msg.payload.xxx,
            CP_xx3:  msg.payload.xxx
        },
        tags:{
            metric:"pressure"
        },
     {
        measurement: "Process_metrics",
        fields: {
            CF_xx1:  msg.payload.xxx,
            CF_xx2:  msg.payload.xxx,
            CF_xx3:  msg.payload.xxx
        },
        tags:{
            metric:"flow"
        },
     {
        measurement: "Process_metrics",
        fields: {
            AP_xx1:  msg.payload.xxx,
            AP_xx2:  msg.payload.xxx,
            AP_xx3:  msg.payload.xxx
        },
        tags:{
            metric:"Pumps"
        },
     {
        measurement: "Binary_states",
        fields: {
            Binary1:  msg.payload.xxx,
            Binary2:  msg.payload.xxx,
            Binary3:  msg.payload.xxx
        },
        {
        measurement: "Alarms",
        fields: {
            Alarm1:  msg.payload.xxx,
            Alarm2:  msg.payload.xxx,
            Alarm3:  msg.payload.xxx
        }
    ];
    return msg;

EDIT 2:

Final thoughts after testing my above idea and refining it further.

My second idea did not work as intended. The final step with Grafana variables did not work as the process data had info needed in fields and not as tags. This made the Grafana side annoying with rexec queries to get the plc tag names info from fields to link to grafana variable drop down lists. Thus again running resource intensive field queries.

I stumbled on a blog post on the matter of how to get your mind straight with TSDB and the above idea is still too SQL like approach to data with TSDB. I refined the DB structure some more and i seem to have found a compromise with coding time in different steps (PLC->NodeRed->influxDB->Grafana) and query load on the database. From 1gb ram usage when stressing with write and query to 100-300MB in normal usage test.

Currently in testing:

Python script to crunch the PLC side tags and descriptions from csv to a copypastable format for Node-Red. Example for extracting temperature measurements from the csv and formating to nodered.

import pandas as pd
from pathlib import Path

file1 = r'C:\\Users\\....pandastestcsv.csv
df1 = pd.read_csv(file1, sep=';')

dfCT= df1[df1['POS'].str.contains('CT', regex=False, na=False)]

def my_functionCT(x,y):
      print( "{measurement:"+'"temperature",'+"fields:{value:msg.payload."+ x +",},tags:{CT:\"" + y +'\",},},' )

result = [my_functionCT(x, y) for x, y in zip(dfCT['ID'], dfCT['POS'])]

Output of this is all the temperature measurements CT from the CSV. {measurement:"temperature",fields:{value:msg.payload.a1,},tags:{CT:"tag description with process position CT_310",},},

This list can be copypasted to Node-Red datalink payload to influxDB.

InfluxDB:
database: PLCTEST
Measurements: temperature, pressure, flow, pumps, valves, Alarms, on_off....
tag-keys: CT,CP,CF,misc_mes....
tag-field: "PLC description of the tag"
Field-key: value
field-value: "process measurement value from PLC payload"

This keeps the cardinality per measurement in check within reason and queries can be better targeted to relevant data without running through the whole DB. Ram and CPU loads are now minor and jumping from 1h to 12h query in Grafana loads in seconds without lock ups.


Solution

  • While designing InfluxDB measurement schema we need to be very careful on selecting the tags and fields.

    Each tag value will create separate series and as the number of tag values increase the memory requirement of InfluxDB server will increase exponentially.

    From the description of the measurement given in the question, I can see that you are keeping high cardinality values like temperature, pressure etc as tag values. These values should be kept as field instead.

    By keeping these values as tags, influxdb will index those values for faster search. For each tag value a separate series will be created. As the number of tag values increase, the number of series also will increase leading to Out of Memory errors.

    Quoting from InfluxDB documentation.

    Tags containing highly variable information like UUIDs, hashes, and random strings lead to a large number of series in the database, also known as high series cardinality. High series cardinality is a primary driver of high memory usage for many database workloads.

    Please refer the influxDB documentation for designing schema for more details.

    https://docs.influxdata.com/influxdb/v1.8/concepts/schema_and_data_layout/