Search code examples
prestotrino

Presto dashboard + presto workers machines are not stable


We noticed about the following strange behavior in our presto cluster ( presto installed on Linux machines )

We have 9 presto workers machines ,

And from the presto dashboard we can see that some time there are 7-8 active workers and some time all presto workers - 9

is it normal behavior ?

From the presto workers logs I cant see something unusual

And I not sure if we need to searched any network problem or any other issue ?

enter image description here

Note - when I restart all presto workers , then after restart the presto workers are stable on the dashboard , but after 5-10 Hours we get again the strange behavior again , we are Helpless with this situation ,

Note1 - we check if presto binaries restart in accidentally - but this isnt the case , all presto workers binaries are stable

./launcher status
Running as 22815

I must to said additionally that Presto dashboard not show which of the presto workers was down , so this is very difficult to understand which are the "bad" presto workers ,

*** in the presto coordinator log- we can see message like this:

- but not sure this are related to our issues? 

WARN    http-client-memoryManager-scheduler     com.facebook.presto.memory.RemoteNodeMemory     Error fetching memory info from http://105.14.25.4:1010/v1/memory: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed

Solution

  • i am so apologize for the inconvenience , about my question

    actually this is my mistake and I will explain

    in this presto cluster we have 9 presto workers

    but I forget to delete the same host name workers from other cluster

    so this behavior is because 3 duplicate host names ( presto workers )

    after removing the duplicate presto workers , now presto is very stable