Code Review: On Signals

POSIX signals can be a topic with a bit of mythos and fear around it. In this post, I’ll try to dispel some of this fear with examples from real systems and how they solve problems via signals.

Part 1: POSIX signals (on Linux)

POSIX signals come with complicated rules, are often a sign of a bug (the dreaded “Segmentation fault”) and are rarely used as part of core functionality.

This post will attempt to document useful design patterns relying on signals and explain the inner workings of signal delivery. I will particularly focus on Linux as that’s where most of my experience lies but a lot of the mechanisms are similar on other Unix-like systems.

What this post will not contain is a thorough overview of signals. There is already a pretty substantial amount of information on general usage and caveats:

I will however point out a couple of common misconceptions in using signals.

Thread-local signal masks and global handlers

There were a handful of bugs in systems I worked on, as well as in open source code, that can be traced to misunderstanding the following distinction:

This asymmetry is partly an artifact of threads being bolted on after signals were already entrenched, and no meaningful way was added to specify thread-specific signal handlers.

I also blame this partly on POSIX sigprocmask(3) pages that contain scary lines like:

The use of the sigprocmask() function is unspecified in a multi-threaded process.

This is technically correct. POSIX only specifies pthread_sigmask(3) as safe in a multi-threaded program.

What’s explicitly different on Linux is that both sigprocmask(3) and pthread_sigmask(3) (the libc functions) are implemented via sigprocmask(2) (the syscall).

Signals arrive on a random thread

The POSIX specification distinguishes between two types of signal generation:

Signals which are generated by some action attributable to a particular thread, such as a hardware fault, shall be generated for the thread that caused the signal to be generated.

Signals that are generated in association with a process ID or process group ID or an asynchronous event, such as terminal activity, shall be generated for the process.

If the signal you care about is thread-targeted, no other thread can receive it. If you block it (remember, the signal mask is thread-local), you’ll get the default disposition for it.

If the signal in question is process-directed, any thread can receive it. However, there’s no magic, the code in the kernel is pretty self-explanatory. The main thread (tid == pid) is attempted first. The rest of the threads are attempted in a round-robin order to load-balance the signal delivery.

People have come up with more portable ways to deal with this perceived randomness - from “signal pipes” to a single signal-handling thread, this is something you can architect into your application.

Lastly, you can send either of these types of signals via kill(2) (process-targeted) and tgkill(2) (thread-targeted).

You cannot handle a “fatal” signal

POSIX has only this to say about returning from a signal handler:

The behavior of a process is undefined after it returns normally from a signal-catching function for a SIGBUS, SIGFPE, SIGILL, or SIGSEGV signal that was not generated by kill(), sigqueue(), or raise().

On Linux, the kernel retries the instruction that raised the trap and raises the signal again. Seeing how this is part of the ABI, this behavior is unlikely to ever change.

That said, returning is not the only way to get out of a signal handler. Non-local returns such as setjmp/longjmp or even make/set/getcontext calls are all perfectly fine ways of resuming program execution after doing something interesting in the signal handler. In fact, this was designed into POSIX from the beginning - this is why the signal-mask-preserving sigsetjmp and siglongjmp exist.

One possible pattern is the following (obviously, pseudocode):

thread_local jmp_context;
thread_local in_dangerous_work;

handler() {
  if (in_dangerous_work) {
    in_dangerous_work = false;
    siglongjmp(&jmp_context, 0);
  }
  // call previous handler here
}

work() {
  register_handler_for_dangerous_work(&handler);
  // calls sigaction for the signals we care about

  // the second argument means that the return
  // value from the sigsetjmp will be 1 if we
  // jumped to it
  if (sigsetjmp(&jmp_context, 1) == 0) {
    in_dangerous_work = true;
    do_dangerous_work();
    in_dangerous_work = false;
  } else {
    // we crashed while doing the dangerous work,
    // do something else.
  }
}

By using thread-local storage and long jumps, we can safely identify the risky section and bail out only within it, without compromising global correctness of the program.

Examples of dangerous work that you may need to sandbox in this way include:

You can’t do anything useful from a signal handler

Signal handlers are similar in function to interrupt handlers in the kernel. They have to be able to operate in an arbitrary context, have access to the state of the thing they just interrupted (the SA_SIGINFO flag gives you access to a ucontext, which contains all of the thread’s registers at the time of the signal delivery), and have pretty clear semantics on how they get delivered and handled.

While it does require a lot of care to design the right synchronization primitives, it’s not impossible, as I hope the following projects demonstrate.

Part 2: Interesting uses for signals

The following is a list of interesting uses for signals that I’ve stumbled upon in my work and google searches.

VM internals (JavaScriptCore)

When implementing a virtual machine, one of the mechanisms you need to implement is the ability to suspend the execution of a VM thread at desired points in time. Sometimes you need to do that in order to walk the thread’s stack for garbage collection. Other times you need this in order to implement debugger breakpoints.

JavaScriptCore, WebKit’s JavaScript engine uses signals to implement suspend/resume primitives on Linux. It also uses signals to implement what they call “VM traps” - the ability to attach debugger breakpoints, terminate a thread and more while the thread is running.

Exception handling in VMs (JavaScriptCore, ART, HotSpot)

Continuing the trend of “signal uses in VMs”, one of the interesting uses is allowing incorrect memory access to occur, detecting it, and only then raising the exception.

For example, in Java, the VM is supposed to raise NullPointerException when the program dereferences a null reference.

When the VM compiles the code (e.g., JIT compilation or AOT in ART’s case), it can elide all the null-checking code under the assumption that NullPointerExceptions are not a common path. However, in order to ensure correctness at runtime, it can use a signal handler to determine if the exception was raised within compiled code and raise the NPE from the signal handler.

Examples:

In fact, if you read the HotSpot code around the linked line, they implement a lot of VM functionality using signals - division by zero and stack overflows are right there as well.

Userspace page faulting (libsigsegv)

Some interesting applications of SIGSEGV handling comes from the GNU libsigsegv project main page:

  • memory-mapped access to persistent databases,
  • generational garbage collectors,
  • stack overflow handlers,
  • distributed shared memory,

When dealing with memory-mapped files, a lot of control is taken out of the userspace program and hidden in the kernel. Which pages to prefetch, whether we’re currently doing sequential or random reads, where the information is on disk (is it on disk?), when to evict data from the page cache - these are all decisions the kernel makes on behalf of the userspace program.

By handling SIGSEGV (whether via libsigsegv or standard POSIX calls), we can take execution control when the program accesses a given address range and make our own decisions.

The desire to do this is in fact so common that the Linux kernel implemented userfaultfd(2) as an easier way to implement userspace page faulting.

Profiling (gperftools)

A few of the POSIX profiling APIs are based on signals. For example, the POSIX way to measure CPU time usage is to use setitimer(3) with ITIMER_PROF. This timer sends a thread-directed SIGPROF to the thread that moves the CPU clock by the specified amount.

For example, the gperftools project uses setitimer(3) and a stack-unwinding library to implement stack trace attribution for CPU-time.

Crash diagnostics (breakpad)

Lastly, we get to one of the most common uses for signal handling - crash reporting. In way, this is not really handling the signal, just recording its presence.

breakpad implements a crash collection system by handling SIGSEGV, SIGABRT and other terminal signals and collecting as much information as possible from within the signal handler. It records the state of the registers, captures the stack of all threads, as well as unwinds the stack of the crashing thread as far as possible. All this is achieved via pre-allocated memory and within the “unsafe” context of a signal handler.

Conclusion

I hope this article has shed some light on the often-feared world of POSIX signals. I hope it inspired you or at the very least dispelled some of the fear. I know I would have loved to read something like this when I was first starting out.

If you have other interesting uses for signals, leave them in the comments below.


← Droidcon NYC Talk