I would like to be able to monitor (logs, performance metrics) VM's in Azure (and other clouds) using Google Cloud Logging and Monitoring.
As a proof of concept,
When I check the status of the Ops Agent, I see the following (mildly redacted)
● google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2022-02-16 22:39:22 UTC; 1min 5s ago
Process: 2730195 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
Process: 2730208 ExecStart=/opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --config=${RUNTIME_DIRECTORY}/otel.yaml (code=exited, status=1/FAILURE)
Main PID: 2730208 (code=exited, status=1/FAILURE)
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-opentelemetry-collector.service: Scheduled restart job, restart counter is at 5.
Feb 16 22:39:22 HOSTNAME systemd[1]: Stopped Google Cloud Ops Agent - Metrics Agent.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-opentelemetry-collector.service: Start request repeated too quickly.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-opentelemetry-collector.service: Failed with result 'exit-code'.
Feb 16 22:39:22 HOSTNAME systemd[1]: Failed to start Google Cloud Ops Agent - Metrics Agent.
● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2022-02-16 22:39:22 UTC; 1min 5s ago
Process: 2730194 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STATE_DIRECTORY} (code=exited, status=0/SUCCESS)
Process: 2730207 ExecStart=/opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config ${RUNTIME_DIRECTORY}/fluent_bit_main.conf --parser ${RUNTIME_DIRECTORY}/fluent_bit_parser.conf --log_file ${LOGS_DIRECTORY}/logging-module.log --storage_path ${STATE_DIRECTORY}/buffers (co>
Main PID: 2730207 (code=exited, status=255/EXCEPTION)
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-fluent-bit.service: Scheduled restart job, restart counter is at 5.
Feb 16 22:39:22 HOSTNAME systemd[1]: Stopped Google Cloud Ops Agent - Logging Agent.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-fluent-bit.service: Start request repeated too quickly.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'exit-code'.
Feb 16 22:39:22 HOSTNAME systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent.
● google-cloud-ops-agent.service - Google Cloud Ops Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
Active: active (exited) since Wed 2022-02-16 22:39:21 UTC; 1min 7s ago
Process: 2730090 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/google-cloud-ops-agent/config.yaml (code=exited, status=0/SUCCESS)
Process: 2730102 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
Main PID: 2730102 (code=exited, status=0/SUCCESS)
Feb 16 22:39:21 HOSTNAME systemd[1]: Starting Google Cloud Ops Agent...
Feb 16 22:39:21 HOSTNAME systemd[1]: Finished Google Cloud Ops Agent.
The Ops Agent logs show
[2022/02/16 22:39:22] [ info] [engine] started (pid=2730207)
[2022/02/16 22:39:22] [ info] [storage] version=1.1.5, initializing...
[2022/02/16 22:39:22] [ info] [storage] root path '/var/lib/google-cloud-ops-agent/fluent-bit/buffers'
[2022/02/16 22:39:22] [ info] [storage] normal synchronization mode, checksum enabled, max_chunks_up=128
[2022/02/16 22:39:22] [ info] [storage] backlog input plugin: storage_backlog.2
[2022/02/16 22:39:22] [ info] [cmetrics] version=0.2.2
[2022/02/16 22:39:22] [ info] [input:storage_backlog:storage_backlog.2] queue memory limit: 47.7M
[2022/02/16 22:39:22] [ info] [output:stackdriver:stackdriver.0] metadata_server set to http://metadata.google.internal
[2022/02/16 22:39:22] [ warn] [output:stackdriver:stackdriver.0] client_email is not defined, using a default one
[2022/02/16 22:39:22] [ warn] [output:stackdriver:stackdriver.0] private_key is not defined, fetching it from metadata server
[2022/02/16 22:39:22] [ warn] [net] getaddrinfo(host='metadata.google.internal', err=-2): Name or service not known
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] failed to create metadata connection
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] can't fetch token from the metadata server
[2022/02/16 22:39:22] [ warn] [output:stackdriver:stackdriver.0] token retrieval failed
[2022/02/16 22:39:22] [ warn] [net] getaddrinfo(host='metadata.google.internal', err=-2): Name or service not known
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] failed to create metadata connection
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] can't fetch project id from the metadata server
[2022/02/16 22:39:22] [error] [output] failed to initialize 'stackdriver' plugin
[2022/02/16 22:39:22] [ info] [input] pausing fluentbit_metrics.0
[2022/02/16 22:39:22] [ info] [input] pausing tail.1
[2022/02/16 22:39:22] [ info] [input] pausing storage_backlog.2
I notice private_key is not defined, fetching it from metadata server
, which suggests that the key file is not being picked up.
The documentation says The Ops Agent is the primary agent for collecting telemetry from your Compute Engine instances. See here.
Can the Ops Agent only be run on Compute Engine instances or is it reasonable to expect that it could be run anywhere if properly configured?
When google-cloud-ops-agent.service is started, it starts google-cloud-ops-agent-fluent-bit.service and google-cloud-ops-agent-opentelemetry-collector.service and then exits. Environment variables added as overrides to google-cloud-ops-agent.service do not persist to the others.
I found that I had to add GOOGLE_APPLICATION_CREDENTIALS to google-cloud-ops-agent-opentelemetry-collector.service and GOOGLE_SERVICE_CREDENTIALS to google-cloud-ops-agent-fluent-bit.service. You can override the systemd units non-interactively:
SYSTEMD_EDITOR=tee systemctl edit google-cloud-ops-agent-fluent-bit.service <<'EOF'
[Service]
Environment='GOOGLE_SERVICE_CREDENTIALS=/etc/google/auth/application_default_credentials.json'
EOF
SYSTEMD_EDITOR=tee systemctl edit google-cloud-ops-agent-opentelemetry-collector.service <<'EOF'
[Service]
Environment='GOOGLE_APPLICATION_CREDENTIALS=/etc/google/auth/application_default_credentials.json'
EOF