We noticed about the following strange behavior in our presto cluster ( presto installed on Linux machines )
We have 9 presto workers machines ,
And from the presto dashboard we can see that some time there are 7-8
active workers and some time all presto workers - 9
is it normal behavior ?
From the presto workers logs I cant see something unusual
And I not sure if we need to searched any network problem or any other issue ?
Note - when I restart all presto workers , then after restart the presto workers are stable on the dashboard , but after 5-10
Hours we get again the strange behavior again , we are Helpless with this situation ,
Note1 - we check if presto binaries restart in accidentally - but this isnt the case , all presto workers binaries are stable
./launcher status
Running as 22815
I must to said additionally that Presto dashboard not show which of the presto workers was down , so this is very difficult to understand which are the "bad" presto workers ,
*** in the presto coordinator log- we can see message like this:
- but not sure this are related to our issues?
WARN http-client-memoryManager-scheduler com.facebook.presto.memory.RemoteNodeMemory Error fetching memory info from http://105.14.25.4:1010/v1/memory: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
i am so apologize for the inconvenience , about my question
actually this is my mistake and I will explain
in this presto cluster we have 9 presto workers
but I forget to delete the same host name workers from other cluster
so this behavior is because 3 duplicate host names ( presto workers )
after removing the duplicate presto workers , now presto is very stable