Search code examples
mysqldatabaseerlangcpu-usageejabberd

Ejabberd is using all available CPU, how to debug


I have problems with my ejabberd installation and i am struggling to figure out what is going on.

After a few minutes (15-20 minutes) my CPU usage spikes to 100%. No aparent reason I can find. And from there on it stays flat out full CPU. I have tried to upgrade the hardware of the server but still I cannot get it to handle the load. The server is a quite modern one with Xeon process KVM virtualized. 8 cores and 32GB RAM, no other workloads.

I have tried to run etop but that does not work:

root@collaboration:/# ./usr/lib/erlang/lib/observer-2.9.4/priv/bin/etop -node ejabberd@localhost Erlang/OTP 23 [erts-11.0.3] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1]

Eshell V11.0.3 (abort with ^G) (etop@collaboration)1> {"init terminating in do_boot",{{badmatch,{error,nxdomain}},[{etop_tr,reader,1,[{file,"etop_tr.erl"},{line,62}]},{etop,init_data_handler,1,[{file,"etop.erl"},{line,146}]},{etop,start,1,[{file,"etop.erl"},{line,129}]},{init,start_em,1,[]},{init,do_boot,3,[]}]}} init terminating in do_boot ({{badmatch,{error,nxdomain}},[{etop_tr,reader,1,[{},{}]},{etop,init_data_handler,1,[{},{}]},{etop,start,1,[{},{}]},{init,start_em,1,[]},{init,do_boot,3,[]}]})

Crash dump is being written to: erl_crash.dump...done

My error log has many entries of strange content. I suspect basically my database is not in a healthy state. The DB is 10 years old with many upgrades so there is high probability of problems. Downloadable error.log here: https://fil.email/u1U0Y1wu

Pastebin extracts from error.log: https://pastebin.com/umpf51aU

Recently I upgraded to ejabberd 20.07, and I have tried to apply all the MySQL schema updates etc. This cannot have worked as well as I hoped because there are traces of problems in the logs. This one here at least fails: https://docs.ejabberd.im/admin/upgrade/from_19.05_to_19.08/

root@:~# mysql -u ejabberd ejabberd -p << EOF

ALTER TABLE users MODIFY server_host varchar(191) NOT NULL; ALTER TABLE last MODIFY server_host varchar(191) NOT NULL; ALTER TABLE rosterusers MODIFY server_host varchar(191) NOT NULL; ALTER TABLE rostergroups MODIFY server_host varchar(191) NOT NULL; ALTER TABLE sr_group MODIFY server_host varchar(191) NOT NULL; ALTER TABLE sr_user MODIFY server_host varchar(191) NOT NULL; ALTER TABLE spool MODIFY server_host varchar(191) NOT NULL; ALTER TABLE archive MODIFY server_host varchar(191) NOT NULL; ALTER TABLE archive_prefs MODIFY server_host varchar(191) NOT NULL; ALTER TABLE vcard MODIFY server_host varchar(191) NOT NULL; ALTER TABLE vcard_search MODIFY server_host varchar(191) NOT NULL; ALTER TABLE privacy_default_list MODIFY server_host varchar(191) NOT NULL; ALTER TABLE privacy_list MODIFY server_host varchar(191) NOT NULL; ALTER TABLE private_storage MODIFY server_host varchar(191) NOT NULL; ALTER TABLE roster_version MODIFY server_host varchar(191) NOT NULL; ALTER TABLE muc_room MODIFY server_host varchar(191) NOT NULL; ALTER TABLE muc_registered MODIFY server_host varchar(191) NOT NULL; ALTER TABLE muc_online_room MODIFY server_host varchar(191) NOT NULL; ALTER TABLE muc_online_users MODIFY server_host varchar(191) NOT NULL; ALTER TABLE motd MODIFY server_host varchar(191) NOT NULL; ALTER TABLE sm MODIFY server_host varchar(191) NOT NULL; ALTER TABLE route MODIFY server_host varchar(191) NOT NULL; ALTER TABLE push_session MODIFY server_host varchar(191) NOT NULL; ALTER TABLE mix_pam MODIFY server_host varchar(191) NOT NULL; EOF Enter password: ERROR 1054 (42S22) at line 1: Unknown column 'server_host' in 'users'

Since I am a litte lost as to why we are having all the CPU issues I am contemplating dropping the database and importing a backup on a fresh installed server. How would I go about exporting as much healthy data as possible and importing this into a new database? Preferrably do an export of users with passwords and rosters as a minimum. There are no MUC rooms or similar. If possible SSL certs (ACME) should be migrated as letsencrypt is not too happy with new certs being requested all the time. If you have any type of guidance on this issue I would be very happy!

Just a FYI with the above log and load I have 155 users online, 12500 registered users.


Solution

  • From your logs:

    exception exit: {undef,
                        [{xmpp_stream_out,stop_async,[<0.4108.0>],[]},
    

    Here erlang reports that there is a function undefined (not defined in the source code).

    Looking at the sources, that function was defined in xmpp 1.4.6: https://github.com/processone/xmpp/commit/c23e66ebac8fdec4aa08c8926091b0dcf6dacf22

    And its usage was added to ejabberd in ejabberd 20.04 https://github.com/processone/ejabberd/commit/1bd560f3f25d0a644bac3d06904ca97e20a6f7d9

    So, initially it seems as if you are running ejabberd 20.04 or newer, but using a version of xmpp library older than 1.4.6