Search code examples
kubernetessystemtap

Why systemTap script report a read fault near operator error?


I'm running SystemTap on CentOS Linux release 7.6.1810. The version of SystemTap is:

$ stap -V
Systemtap translator/driver (version 4.0/0.172/0.176, rpm 4.0-11.el7)
Copyright (C) 2005-2018 Red Hat, Inc. and others
This is free software; see the source for copying conditions.
tested kernel versions: 2.6.18 ... 4.19-rc7
enabled features: AVAHI BOOST_STRING_REF DYNINST BPF JAVA PYTHON2 LIBRPM LIBSQLITE3 LIBVIRT LIBXML2 NLS NSS READLINE


$ uname -rm
3.10.0-957.21.3.el7.x86_64 x86_64

$ rpm -qa | grep kernel-devel
kernel-devel-3.10.0-957.21.3.el7.x86_64

$ rpm -qa | grep kernel-debuginfo
kernel-debuginfo-3.10.0-957.21.3.el7.x86_64
kernel-debuginfo-common-x86_64-3.10.0-957.21.3.el7.x86_64

I have a systemTap script named sg.stp, which use to monitor why k8s pods of a rabbitmq cluster terminated with exit code 137 occasionally:

global target_pid = 32719
probe signal.send{
  if (sig_pid == target_pid) {
    printf("%s(%d) send %s to %s(%d)\n", execname(), pid(), sig_name, pid_name, sig_pid);
    printf("parent of sender: %s(%d)\n", pexecname(), ppid())
    printf("task_ancestry:%s\n", task_ancestry(pid2task(pid()), 1));
  }
}

When I run the script, it reported an error after a while:

$  stap sg.stp
ERROR: read fault [man error::fault] at 0x4a8 near operator '@cast' at /usr/share/systemtap/tapset/linux/task.stpm:2:5
epmd(29073) send SIGCHLD to rabbitmq-server(32719)
parent of sender: rabbitmq-server(32719)
WARNING: Number of errors: 1, skipped probes: 0
WARNING: /usr/bin/staprun exited with status: 1
Pass 5: run failed.  [man error::pass5]

Solution

  • pid2task() can return NULL

    Check for pid2task(pid()) or current_task() returning NULL like that:

    task = pid2task(pid());
    if (task) {
      printf("task_ancestry:%s\n", task_ancestry(task, 1));
    } else {
      printf("task_ancestry more available\n");
    }
    

    Note that I am not completely sure about following explanation:

    It can happen that the task_struct is no more available, even when you are in the context of the running pid(), because the process has already died and the task_struct is cleaned up because it is no more needed.

    In that case pid2task() returns NULL. AFAICS this can happen to pid() in following two situations (and perhaps more):

    • Your probe is asynchronous to the running process - in your case with signal probes this seems to be the case here.

    • The .return probe executes too late, perhaps because it was stuck in the kernel too long (like for blocking calls).

    For the latter there seems to be some easy workaround:

    Instead of task_ancestry(current_task()) use @entry(task_ancestry(current_task())). This way the data is gathered at the entry point of the syscall, where it is very likely that the process is still perfectly alive.

    However in your Signal case I do not see such simple workaround, hence you must check for NULL.


    Note that I am not completely sure that this is your problem and that checking for NULL without some page locking is the perfect solution. Because even if you get a pointer to some structure, the pages which contain the structure might go away in the middle of the probe, thanks to SMP. Perhaps stap somehow protects against this. But I doubt. Race conditions like this are really weird to debug and avoid.