Search code examples
cloud-foundrycf-bosh

Stuck in "bosh deploy" step of bosh-lite - keytool maxes out on bosh-lite to ~ 99% cpu time


Following all the prescribed steps, I am stuck with bosh deploy on my bosh-lite vagrant vm using the virtualbox provider :

  Started creating missing vms
  Started creating missing vms > consul_z1/0 (76a9bb67-d3b0-4e0d-be8b-118487e896d8)
  Started creating missing vms > ha_proxy_z1/0 (6c74591f-341f-4eb0-a2d9-709ca306fd75)
  Started creating missing vms > nats_z1/0 (81dd7301-6de8-495d-9c3d-969e853818f2)
  Started creating missing vms > etcd_z1/0 (f82c506d-0c2b-4861-9020-a009ac858a6e)
  Started creating missing vms > blobstore_z1/0 (167d3ef7-e218-4548-8101-e2b25b4aee76)
  Started creating missing vms > postgres_z1/0 (80c4a6b4-a358-46f1-b930-b20d9e078cd1)
  Started creating missing vms > uaa_z1/0 (204cbdf2-d75d-46c9-a028-eefccd1a37d8)
  Started creating missing vms > hm9000_z1/0 (861b7196-a404-4525-b229-fb1e22eead03)
  Started creating missing vms > api_z1/0 (4b92e6f3-063d-444a-b02e-43b39c52339f)
  Started creating missing vms > runner_z1/0 (c0388082-e9bc-46b0-839b-25b2b7abcb7f)
  Started creating missing vms > doppler_z1/0 (772f715e-5a47-4c3e-b3fd-b2c85464c4d4)
  Started creating missing vms > loggregator_trafficcontroller_z1/0 (0503827b-d731-4a43-893b-2a3a2364b3f
2)
  Started creating missing vms > router_z1/0 (4e4100a3-aef2-4622-b776-2af616cda481)
     Done creating missing vms > ha_proxy_z1/0 (6c74591f-341f-4eb0-a2d9-709ca306fd75) (00:00:34)
     Done creating missing vms > postgres_z1/0 (80c4a6b4-a358-46f1-b930-b20d9e078cd1) (00:00:34)
     Done creating missing vms > router_z1/0 (4e4100a3-aef2-4622-b776-2af616cda481) (00:00:34)
     Done creating missing vms > loggregator_trafficcontroller_z1/0 (0503827b-d731-4a43-893b-2a3a2364b3f
2) (00:00:34)
     Done creating missing vms > nats_z1/0 (81dd7301-6de8-495d-9c3d-969e853818f2) (00:00:35)
     Done creating missing vms > hm9000_z1/0 (861b7196-a404-4525-b229-fb1e22eead03) (00:00:36)
     Done creating missing vms > runner_z1/0 (c0388082-e9bc-46b0-839b-25b2b7abcb7f) (00:00:36)
     Done creating missing vms > consul_z1/0 (76a9bb67-d3b0-4e0d-be8b-118487e896d8) (00:00:37)
     Done creating missing vms > api_z1/0 (4b92e6f3-063d-444a-b02e-43b39c52339f) (00:00:37)
     Done creating missing vms > blobstore_z1/0 (167d3ef7-e218-4548-8101-e2b25b4aee76) (00:00:37)
     Done creating missing vms > etcd_z1/0 (f82c506d-0c2b-4861-9020-a009ac858a6e) (00:00:37)
     Done creating missing vms > doppler_z1/0 (772f715e-5a47-4c3e-b3fd-b2c85464c4d4) (00:00:36)
     Done creating missing vms > uaa_z1/0 (204cbdf2-d75d-46c9-a028-eefccd1a37d8) (00:00:37)
     Done creating missing vms (00:00:37)

  Started updating job consul_z1 > consul_z1/0 (76a9bb67-d3b0-4e0d-be8b-118487e896d8). Done (00:00:40)
  Started updating job ha_proxy_z1 > ha_proxy_z1/0 (6c74591f-341f-4eb0-a2d9-709ca306fd75). Done (00:00:4
0)
  Started updating job nats_z1 > nats_z1/0 (81dd7301-6de8-495d-9c3d-969e853818f2). Done (00:00:22)
  Started updating job etcd_z1 > etcd_z1/0 (f82c506d-0c2b-4861-9020-a009ac858a6e). Done (00:00:42)
  Started updating job blobstore_z1 > blobstore_z1/0 (167d3ef7-e218-4548-8101-e2b25b4aee76). Done (00:00
:47)
  Started updating job postgres_z1 > postgres_z1/0 (80c4a6b4-a358-46f1-b930-b20d9e078cd1). Done (00:00:4
0)
  Started updating job uaa_z1 > uaa_z1/0 (204cbdf2-d75d-46c9-a028-eefccd1a37d8)

This indicates that the deployment is stuck on the uaa_z1 job.On vagrant ssh'ing from another window I see the following :

vagrant@bosh-lite:~/cf-release$ bosh vms
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Acting as user 'admin' on 'Bosh Lite Director'
Deployment `cf-warden'

Director task 34
bosh logs
Task 34 done

+---------------------------------------------------------------------------+---------+-----+-----------
+--------------+
| VM                                                                        | State   | AZ  | VM Type
| IPs          |
+---------------------------------------------------------------------------+---------+-----+-----------
+--------------+
| api_z1/0 (4b92e6f3-063d-444a-b02e-43b39c52339f)                           | running | n/a | large_z1
| 10.244.0.138 |
| blobstore_z1/0 (167d3ef7-e218-4548-8101-e2b25b4aee76)                     | running | n/a | medium_z1
| 10.244.0.130 |
| consul_z1/0 (76a9bb67-d3b0-4e0d-be8b-118487e896d8)                        | running | n/a | small_z1
| 10.244.0.54  |
| doppler_z1/0 (772f715e-5a47-4c3e-b3fd-b2c85464c4d4)                       | running | n/a | medium_z1
| 10.244.0.146 |
| etcd_z1/0 (f82c506d-0c2b-4861-9020-a009ac858a6e)                          | running | n/a | medium_z1
| 10.244.0.42  |
| ha_proxy_z1/0 (6c74591f-341f-4eb0-a2d9-709ca306fd75)                      | running | n/a | router_z1
| 10.244.0.34  |
| hm9000_z1/0 (861b7196-a404-4525-b229-fb1e22eead03)                        | running | n/a | medium_z1
| 10.244.0.142 |
| loggregator_trafficcontroller_z1/0 (0503827b-d731-4a43-893b-2a3a2364b3f2) | running | n/a | small_z1
| 10.244.0.150 |
| nats_z1/0 (81dd7301-6de8-495d-9c3d-969e853818f2)                          | running | n/a | medium_z1
| 10.244.0.6   |
| postgres_z1/0 (80c4a6b4-a358-46f1-b930-b20d9e078cd1)                      | running | n/a | medium_z1
| 10.244.0.30  |
| router_z1/0 (4e4100a3-aef2-4622-b776-2af616cda481)                        | running | n/a | router_z1
| 10.244.0.22  |
| runner_z1/0 (c0388082-e9bc-46b0-839b-25b2b7abcb7f)                        | running | n/a | runner_z1
| 10.244.0.26  |
| uaa_z1/0 (204cbdf2-d75d-46c9-a028-eefccd1a37d8)                           | stopped | n/a | medium_z1
| 10.244.0.134 |
+---------------------------------------------------------------------------+---------+-----+-----------
+--------------+

VMs total: 13

I then tried to look at the looks for the vm which was shown with a status of "stopped". Here is what I got :-

vagrant@bosh-lite:~/cf-release$ bosh logs 204cbdf2-d75d-46c9-a028-eefccd1a37d8
wrong number of arguments (1 for 2).

Usage: logs <job> <index_or_id> [--agent] [--job] [--only filter1,filter2,...] [--dir destination_direct
ory] [--all]
vagrant@bosh-lite:~/cf-release$ bosh logs uaa_z1 0
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Acting as user 'admin' on deployment 'cf-warden' on 'Bosh Lite Director'

Director task 35
Error 100: Redis lock lock:deployment:cf-warden is acquired by another thread

Task 35 error
Error retrieving logs
vagrant@bosh-lite:~/cf-release$

Partial top from my vagrant host reveals:-

top - 15:42:18 up 11:31,  2 users,  load average: 1.46, 1.49, 1.51
Tasks: 421 total,   2 running, 418 sleeping,   0 stopped,   1 zombie
%Cpu(s):  0.5 us, 57.8 sy,  0.0 ni, 40.7 id,  0.3 wa,  0.0 hi,  0.7 si,  0.0 st
KiB Mem:   5079636 total,  4512380 used,   567256 free,   619912 buffers
KiB Swap:  1048572 total,    44192 used,  1004380 free.  2197008 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
25746 root       0 -20       0      0      0 Z 100.0  0.0 422:40.75 keytool
20883 vcap       0 -20 32.467g  21316  12684 S  11.9  0.4   4:10.47 consul
 1297 vcap      10 -10  159160  65176   6288 S   1.3  1.3   8:40.50 ruby
 1369 vcap      10 -10   32252   6668   2608 S   1.0  0.1   2:58.49 nginx
 8589 vagrant   20   0  463856  48804   4280 S   1.0  1.0   3:50.10 bosh
23406 vcap       0 -20  229772  17144  11684 S   1.0  0.3   4:09.67 consul
    7 root      20   0       0      0      0 S   0.7  0.0  10:28.52 rcu_sched
    3 root      20   0       0      0      0 S   0.3  0.0   2:22.93 ksoftirqd/0
  194 root      20   0       0      0      0 S   0.3  0.0   1:01.06 jbd2/dm-0-8
  295 root       0 -20       0      0      0 S   0.3  0.0   0:23.73 kworker/0:1H
 1364 vcap      10 -10   59072   9564   7480 S   0.3  0.2   1:15.73 postgres
14082 vagrant   20   0   25228   3148   2356 R   0.3  0.1   0:00.01 top
19285 root       0 -20   91484   2728   2488 S   0.3  0.1   0:09.72 monit
22377 root       0 -20       0      0      0 S   0.3  0.0   0:46.71 loop1
28019 vagrant   20   0  105644   4628   3644 S   0.3  0.1   0:01.10 sshd
    1 root      20   0   33640   3536   2168 S   0.0  0.1   0:05.79 init
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.02 kthreadd
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
    8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh
    9 root      20   0       0      0      0 S   0.0  0.0   4:15.00 rcuos/0
   10 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/0
   11 root      rt   0       0      0      0 S   0.0  0.0   0:10.96 migration/0
   12 root      rt   0       0      0      0 S   0.0  0.0   0:13.37 watchdog/0
   13 root      rt   0       0      0      0 S   0.0  0.0   0:09.00 watchdog/1
   14 root      rt   0       0      0      0 S   0.0  0.0   0:05.79 migration/1
   15 root      20   0       0      0      0 S   0.0  0.0   1:43.96 ksoftirqd/1
   17 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/1:0H
   18 root      20   0       0      0      0 S   0.0  0.0   3:05.94 rcuos/1
   19 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/1
   20 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 khelper
   21 root      20   0       0      0      0 S   0.0  0.0   0:00.09 kdevtmpfs

Here is the stemcell I am using :-

vagrant@bosh-lite:~/cf-release$ bosh deployments
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Acting as user 'admin' on 'Bosh Lite Director'

+-----------+--------------+--------------------------------------------------+--------------+
| Name      | Release(s)   | Stemcell(s)                                      | Cloud Config |
+-----------+--------------+--------------------------------------------------+--------------+
| cf-warden | cf/233+dev.1 | bosh-warden-boshlite-ubuntu-trusty-go_agent/3147 | none         |
+-----------+--------------+--------------------------------------------------+--------------+

Deployments total: 1

And, this is my bosh status :-

vagrant@bosh-lite:~/cf-release$ bosh status
Config
             /home/vagrant/.bosh_config

Director
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
  Name       Bosh Lite Director
  URL        https://192.168.218.4:25555
  Version    1.3197.1.0 (00000000)
  User       admin
  UUID       44586d7c-bcce-4f5f-ae80-42bb8a1ed08b
  CPI        warden_cpi
  dns        disabled
  compiled_package_cache enabled (provider: local)
  snapshots  disabled

Deployment
  Manifest   /home/vagrant/cf-release/bosh-lite/deployments/cf.yml
vagrant@bosh-lite:~/cf-release$

Can someone help me fix this ?
TIA.


Solution

  • Fixed by an answer I received on the same issue here on github -> https://github.com/cloudfoundry/bosh-lite/issues/349