We've been running an OpenStack environment for the last 2 and a half years with a few hiccups along the way, but mostly with little downtime. Recently we've been trying to add a new piece of hardware to the stack as a nova-compute node to provide more CPU cores and RAM to our VMs. Unfortunately, for some reason, the install is not going well.
We're running Xenial/Queens with JuJu and MaaS for deployment/provisioning. We were running Xenial/Pike until December when we upgraded. We're starting to suspect that the upgrade to Queens is what's causing the trouble as we were able to add new hardware before the upgrade. We even went as far as removing one of our existing machines that was acting as a nova-compute node and tried adding it back to the stack and it too is now exhibiting the same problems as our new hardware.
The root cause of the problems seems to be with the neutron-openvswitch application. When we install the nova-compute charm via JuJu everything seems to go smoothly up until the (automatic) installation/configuration of the subordinate neutron-openvswitch charm. While watching the logs at a certain point during the install connectivity on our OpenStack admin network (10.10.30.0/24 on eno1) is lost. We're able to force the installation to proceed a bit further by adding a second connection on eno2 (a different external network), but the loss of connectivity on eno1 remains and the compute service isn't able to communicate with the rest of the stack.
Looking at our other compute nodes in the stack that are functional, it looks like the admin network bridge (br-eno1) is not being created by the neutron-openvswitch charm. Some part of the process looks like it's taking down eno1 in preparation of creating the bridge, but then fails, leaving the machine unable to communicate on that interface with the rest of the stack.
None of our configuration has changed since the upgrade to Queens, but perhaps there is some deprecation or change to the default configuration that came along with the Pike -> Queens upgrade we are unaware of? We've read through the release notes but can't seem to find anything that would explain this behavior.
Any help would be greatly appreciated. I'm including a few segments of log files I think are relevant below but can provide anything else that might be needed. Thanks in advance!
Broken server ifconfig
eno1 Link encap:Ethernet HWaddr FF:FF:FF:FF:FF:FF (redacted)
inet addr:10.10.30.101 Bcast:10.10.30.255 Mask:255.255.255.0
inet6 addr: fe80::4ed9:8fff:fec5:2e3/64 Scope:Link
UP BROADCAST RUNNING PROMISC MULTICAST MTU:1500 Metric:1
RX packets:487314 errors:0 dropped:0 overruns:0 frame:0
TX packets:91955 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:255807482 (255.8 MB) TX bytes:6693026 (6.6 MB)
Interrupt:17
eno2 Link encap:Ethernet HWaddr FF:FF:FF:FF:FF:FF (redacted)
inet addr:10.189.134.103 Bcast:10.189.134.255 Mask:255.255.255.0
inet6 addr: fe80::4ed9:8fff:fec5:2e4/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:195386 errors:0 dropped:0 overruns:0 frame:0
TX packets:89021 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:29175518 (29.1 MB) TX bytes:37673375 (37.6 MB)
Interrupt:18
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:181496 errors:0 dropped:0 overruns:0 frame:0
TX packets:181496 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:22574807 (22.5 MB) TX bytes:22574807 (22.5 MB)
lxdbr0 Link encap:Ethernet HWaddr FF:FF:FF:FF:FF:FF (redacted)
inet6 addr: fe80::1/64 Scope:Link
inet6 addr: fe80::b8c2:36ff:fe60:de08/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:7 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:650 (650.0 B)
Broken Server ovs-vsctl show
fc878983-8ae5-479f-999f-d809f5a2ba8f
Manager "ptcp:6640:127.0.0.1"
is_connected: true
Bridge br-data
Port "eno1"
Interface "eno1"
Port br-data
Interface br-data
type: internal
Bridge br-ex
Port br-ex
Interface br-ex
type: internal
Bridge br-int
Controller "tcp:127.0.0.1:6633"
is_connected: true
fail_mode: secure
Port br-int
Interface br-int
type: internal
ovs_version: "2.9.5"
Working server ifconfig:
br-eno1 Link encap:Ethernet HWaddr FF:FF:FF:FF:FF:FF (redacted)
inet addr:10.10.30.117 Bcast:10.10.30.255 Mask:255.255.255.0
inet6 addr: fe80::1a66:daff:fe55:6bdc/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:9552045918 errors:0 dropped:4 overruns:0 frame:0
TX packets:8731602524 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:25169343655058 (25.1 TB) TX bytes:20302362419370 (20.3 TB)
eno1 Link encap:Ethernet HWaddr FF:FF:FF:FF:FF:FF (redacted)
inet6 addr: fe80::1a66:daff:fe55:6bdc/64 Scope:Link
UP BROADCAST RUNNING PROMISC MULTICAST MTU:1500 Metric:1
RX packets:27433132917 errors:0 dropped:821138 overruns:0 frame:0
TX packets:25763792601 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:31217303277897 (31.2 TB) TX bytes:26547305328673 (26.5 TB)
Interrupt:18
eno2 Link encap:Ethernet HWaddr FF:FF:FF:FF:FF:FF (redacted)
inet addr:10.189.134.118 Bcast:10.189.134.255 Mask:255.255.255.0
inet6 addr: fe80::1a66:daff:fe55:6bdd/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:23432963 errors:0 dropped:0 overruns:0 frame:0
TX packets:34 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2858920977 (2.8 GB) TX bytes:2404 (2.4 KB)
Interrupt:19
eno3 Link encap:Ethernet HWaddr FF:FF:FF:FF:FF:FF (redacted)
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Interrupt:19
eno4 Link encap:Ethernet HWaddr FF:FF:FF:FF:FF:FF (redacted)
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Interrupt:16
gre_sys Link encap:Ethernet HWaddr FF:FF:FF:FF:FF:FF (redacted)
inet6 addr: fe80::d061:36ff:fecd:3bdf/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65000 Metric:1
RX packets:1247735590 errors:0 dropped:0 overruns:0 frame:0
TX packets:1053172217 errors:0 dropped:8 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:934609315304 (934.6 GB) TX bytes:1138575443474 (1.1 TB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:874404497 errors:0 dropped:0 overruns:0 frame:0
TX packets:874404497 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:1422560696594 (1.4 TB) TX bytes:1422560696594 (1.4 TB)
lxdbr0 Link encap:Ethernet HWaddr FF:FF:FF:FF:FF:FF (redacted)
inet addr:10.0.216.1 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::d83b:4eff:fedb:7be0/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:9 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:750 (750.0 B)
qbr267cccc8-45 Link encap:Ethernet HWaddr FF:FF:FF:FF:FF:FF (redacted)
UP BROADCAST RUNNING MULTICAST MTU:1458 Metric:1
RX packets:257167 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:8981790 (8.9 MB) TX bytes:0 (0.0 B)
.
.
.
.
tap267cccc8-45 Link encap:Ethernet HWaddr FF:FF:FF:FF:FF:FF (redacted)
inet6 addr: fe80::fc16:3eff:fede:d180/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1458 Metric:1
RX packets:4801309 errors:0 dropped:0 overruns:0 frame:0
TX packets:6300403 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:12100707022 (12.1 GB) TX bytes:3222243030 (3.2 GB)
.
.
.
.
vethWY9OQC Link encap:Ethernet HWaddr FF:FF:FF:FF:FF:FF (redacted)
inet6 addr: fe80::fc50:b6ff:fe7a:2584/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:533168318 errors:0 dropped:0 overruns:0 frame:0
TX packets:468982413 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:191221371188 (191.2 GB) TX bytes:227602758832 (227.6 GB)
Working Server ovs-vsctl show
be5c20fd-46ef-4991-8dc3-3860944308e5
Manager "ptcp:6640:127.0.0.1"
is_connected: true
Bridge br-data
Controller "tcp:127.0.0.1:6633"
is_connected: true
fail_mode: secure
Port "eno1"
Interface "eno1"
error: "could not add network device eno1 to ofproto (Device or resource busy)"
Port "eno2"
Interface "eno2"
Port br-data
Interface br-data
type: internal
Port phy-br-data
Interface phy-br-data
type: patch
options: {peer=int-br-data}
Bridge br-tun
Controller "tcp:127.0.0.1:6633"
is_connected: true
fail_mode: secure
Port patch-int
Interface patch-int
type: patch
options: {peer=patch-tun}
Port "gre-0a0a1e7f"
Interface "gre-0a0a1e7f"
type: gre
options: {df_default="true", in_key=flow, local_ip="10.10.30.117", out_key=flow, remote_ip="10.10.30.127"}
Port "gre-0a0a1e74"
Interface "gre-0a0a1e74"
type: gre
options: {df_default="true", in_key=flow, local_ip="10.10.30.117", out_key=flow, remote_ip="10.10.30.116"}
Port "gre-0a0a1e76"
Interface "gre-0a0a1e76"
type: gre
options: {df_default="true", in_key=flow, local_ip="10.10.30.117", out_key=flow, remote_ip="10.10.30.118"}
Port br-tun
Interface br-tun
type: internal
Bridge br-int
Controller "tcp:127.0.0.1:6633"
is_connected: true
fail_mode: secure
Port "qvo5560dd35-7e"
tag: 2
Interface "qvo5560dd35-7e"
Port patch-tun
Interface patch-tun
type: patch
options: {peer=patch-int}
Port "qvo97c660e7-e3"
tag: 1
Interface "qvo97c660e7-e3"
Port "qvo44aeabe3-de"
tag: 1
Interface "qvo44aeabe3-de"
Port "qvo267cccc8-45"
tag: 1
Interface "qvo267cccc8-45"
Port "qvofdf0ce36-50"
tag: 2
Interface "qvofdf0ce36-50"
Port "qvof193baf6-c0"
tag: 1
Interface "qvof193baf6-c0"
Port "qvod9facd45-41"
tag: 1
Interface "qvod9facd45-41"
Port "qvoeeab657c-df"
tag: 1
Interface "qvoeeab657c-df"
Port "qvodd4b9252-e5"
tag: 1
Interface "qvodd4b9252-e5"
Port br-int
Interface br-int
type: internal
Port "qvoc841a7f1-25"
tag: 2
Interface "qvoc841a7f1-25"
Port "qvod6b38e4c-a1"
tag: 2
Interface "qvod6b38e4c-a1"
Port int-br-data
Interface int-br-data
type: patch
options: {peer=phy-br-data}
Bridge br-ex
Port br-ex
Interface br-ex
type: internal
ovs_version: "2.9.2"
Broken server /var/log/juju/unit-neutron-openvswitch.log These are the final lines before the machine loses connectivity on the admin network (eno1).
2020-05-26 18:08:02 DEBUG config-changed net.netfilter.nf_conntrack_max = 1000000
2020-05-26 18:08:02 DEBUG config-changed net.ipv4.neigh.default.gc_thresh2 = 28672
2020-05-26 18:08:02 DEBUG config-changed net.ipv6.neigh.default.gc_thresh1 = 128
2020-05-26 18:08:02 DEBUG config-changed net.nf_conntrack_max = 1000000
2020-05-26 18:08:02 DEBUG config-changed sysctl: setting key "net.netfilter.nf_conntrack_buckets"
2020-05-26 18:08:02 DEBUG config-changed net.ipv4.neigh.default.gc_thresh3 = 32768
2020-05-26 18:08:02 DEBUG config-changed net.ipv4.neigh.default.gc_thresh1 = 128
2020-05-26 18:08:02 DEBUG config-changed net.ipv6.neigh.default.gc_thresh2 = 28672
2020-05-26 18:08:02 DEBUG config-changed net.ipv6.neigh.default.gc_thresh3 = 32768
2020-05-26 18:08:02 DEBUG config-changed active
2020-05-26 18:08:03 INFO juju-log Creating bridge br-int
2020-05-26 18:08:03 INFO juju-log Creating bridge br-ex
2020-05-26 18:08:03 WARNING juju-log Support for use of upstream ``apt_pkg`` module in conjunctionwith charm-helpers is deprecated since 2019-06-25
2020-05-26 18:08:03 INFO juju-log Creating bridge br-data
2020-05-26 18:08:03 DEBUG juju-log Interface eno1 is not a Linux bridge
2020-05-26 18:08:03 INFO juju-log Adding port eno1 to bridge br-data
2020-05-26 18:08:03 DEBUG config-changed Failed to restart os-charm-phy-nic-mtu.service: Unit os-charm-phy-nic-mtu.service not found.
Then, we see the following (only accessible on site or by coming in through the eno2 connection):
2020-05-26 18:08:53 ERROR juju.api monitor.go:59 health ping timed out after 30s
2020-05-26 18:08:53 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
2020-05-26 18:08:53 INFO juju-log Loaded template from templates/queens/openvswitch_agent.ini
2020-05-26 18:08:53 INFO juju-log Rendering from template: /etc/neutron/plugins/ml2/openvswitch_agent.ini
2020-05-26 18:08:53 INFO juju-log Wrote template /etc/neutron/plugins/ml2/openvswitch_agent.ini.
2020-05-26 18:08:54 DEBUG juju-log Generating template context for amqp
2020-05-26 18:08:54 DEBUG config-changed Traceback (most recent call last):
2020-05-26 18:08:54 DEBUG config-changed File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/config-changed", line 266, in <module>
2020-05-26 18:08:54 DEBUG config-changed main()
2020-05-26 18:08:54 DEBUG config-changed File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/config-changed", line 259, in main
2020-05-26 18:08:54 DEBUG config-changed hooks.execute(sys.argv)
2020-05-26 18:08:54 DEBUG config-changed File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/core/hookenv.py", line 914, in execute
2020-05-26 18:08:54 DEBUG config-changed self._hooks[hook_name]()
2020-05-26 18:08:54 DEBUG config-changed File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1568, in wrapped_f
2020-05-26 18:08:54 DEBUG config-changed stopstart, restart_functions)
2020-05-26 18:08:54 DEBUG config-changed File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/core/host.py", line 741, in restart_on_change_helper
2020-05-26 18:08:54 DEBUG config-changed r = lambda_f()
2020-05-26 18:08:54 DEBUG config-changed File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1567, in <lambda>
2020-05-26 18:08:54 DEBUG config-changed (lambda: f(*args, **kwargs)), __restart_map_cache['cache'],
2020-05-26 18:08:54 DEBUG config-changed File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/config-changed", line 150, in config_changed
2020-05-26 18:08:54 DEBUG config-changed CONFIGS.write_all()
2020-05-26 18:08:54 DEBUG config-changed File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/contrib/openstack/templating.py", line 334, in write_all
2020-05-26 18:08:54 DEBUG config-changed [self.write(k) for k in six.iterkeys(self.templates)]
2020-05-26 18:08:54 DEBUG config-changed File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/contrib/openstack/templating.py", line 334, in <listcomp>
2020-05-26 18:08:54 DEBUG config-changed [self.write(k) for k in six.iterkeys(self.templates)]
2020-05-26 18:08:54 DEBUG config-changed File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/contrib/openstack/templating.py", line 321, in write
2020-05-26 18:08:54 DEBUG config-changed _out = self.render(config_file)
2020-05-26 18:08:54 DEBUG config-changed File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/contrib/openstack/templating.py", line 281, in render
2020-05-26 18:08:54 DEBUG config-changed ctxt = ostmpl.context()
2020-05-26 18:08:54 DEBUG config-changed File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/contrib/openstack/templating.py", line 112, in context
2020-05-26 18:08:54 DEBUG config-changed _ctxt = context()
2020-05-26 18:08:54 DEBUG config-changed File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/neutron_ovs_context.py", line 633, in __call__
2020-05-26 18:08:54 DEBUG config-changed host_ip = get_relation_ip('neutron-plugin')
2020-05-26 18:08:54 DEBUG config-changed File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/contrib/network/ip.py", line 583, in get_relation_ip
2020-05-26 18:08:54 DEBUG config-changed address = network_get_primary_address(interface)
2020-05-26 18:08:54 DEBUG config-changed File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/core/hookenv.py", line 1043, in inner_translate_exc2
2020-05-26 18:08:54 DEBUG config-changed return f(*args, **kwargs)
2020-05-26 18:08:54 DEBUG config-changed File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/core/hookenv.py", line 1239, in network_get_primary_address
2020-05-26 18:08:54 DEBUG config-changed stderr=subprocess.STDOUT).decode('UTF-8').strip()
2020-05-26 18:08:54 DEBUG config-changed File "/usr/lib/python3.5/subprocess.py", line 626, in check_output
2020-05-26 18:08:54 DEBUG config-changed **kwargs).stdout
2020-05-26 18:08:54 DEBUG config-changed File "/usr/lib/python3.5/subprocess.py", line 708, in run
2020-05-26 18:08:54 DEBUG config-changed output=stdout, stderr=stderr)
2020-05-26 18:08:54 DEBUG config-changed subprocess.CalledProcessError: Command '['network-get', '--primary-address', 'neutron-plugin']' returned non-zero exit status 1
2020-05-26 18:08:54 ERROR juju.worker.uniter.operation runhook.go:113 hook "config-changed" failed: exit status 1
2020-05-26 18:09:13 INFO juju-log Registered config file: /etc/neutron/neutron.conf
2020-05-26 18:09:13 INFO juju-log Registered config file: /etc/neutron/plugins/ml2/openvswitch_agent.ini
SOLVED!
It turns out that after the upgrade to Queens JuJu was handing out a bad network config to this server. In addition, the OpenVSwitch install was assigning eno1 to br-data instead of creating br-eno1 like on my other servers. The steps to resolve the problem were:
ovs-vsctl del-port br-data eno1
/etc/network/interfaces
file and comment out the line that reads the (busted) cloud config file from /etc/network/interface.d/50-cloud-init.cfg
ifconfig
for the eno1 and eno2 interfacesI don't yet know exactly what caused JuJu to stop sending a proper network config after the upgrade.
My final interfaces file looked like this. Anyone else copying this file will of course have to change all of their IPs.
auto lo
iface lo inet loopback
auto lo
iface lo inet loopback
dns-nameservers 10.10.30.99 10.244.0.66 10.244.0.67
dns-search maas
auto eno1
iface eno1 inet manual
mtu 1500
auto eno2
iface eno2 inet static
address 10.189.134.103/24
dns-nameservers 10.189.134.99 10.244.0.66 10.244.0.67
mtu 1500
auto br-eno1
iface br-eno1 inet static
address 10.10.30.101/24
dns-nameservers 10.10.30.99 10.244.0.66 10.244.0.67
gateway 10.10.30.254
bridge_ports eno1
I found the following sites helpful when troubleshooting: