Dumping core

April 23, 2010

Your program just crashed, and you didn’t have a debugger attached. You can’t reproduce the crash after many attempt. How you do debug the problem?

Well, if your program had left a core dump, you could easily attach a debugger postmortem and get some kind of idea what state the program was in before it died. A core dump is essentially a dump of all memory in your program’s virtual address space: stack, heap, code and everything else.

On most systems, though, you won’t get a core dump when you crash, where a crash can come from a segfault (or any other signal), a call to abort(2) (such as via a failed assertion), a call to terminate() (such as via throwing an uncaught exception), or other similar avenues. Core dumps are rather large (after all, it’s all of the memory from the process) — they can easily be tens or hundreds of megabytes, even for simple programs, due to a large number of shared libraries being loaded. Your hard drive would fill up very quickly if every program that crashed left a core dump.

If you’re just poking around in the shell, you can enable core dumps with ulimit(1) to raise the core dump file size limit from 0 (the default) to something non-zero such as unlimited. This will cause any crashing programs spawned by that shell to leave core dumps. For example:

$ cat crash.c
int main(void)
{
    *(int *)1 = 2;  // cause a segfault
}
$ gcc crash.c -o crash
$ ./crash
Segmentation fault
$ ulimit -c unlimited
$ ./crash
Segmentation fault (core dumped)

Where the core dump ends up depends on your operating system. By default, Linux puts it in a file named core in the current working directory, and Mac OS X puts it in a file named /cores/core.<PID>, where <PID> is the process ID of the process that crashed. The exact name and location may vary by flavor and version of OS. See the core(5) man page for detailed discussion of core files on Linux.

Ok, so that’s all well and good if someone has the good nature to run ulimit before running your program, but few (if any) people will do so. If you want to say, “No really, I want core dumps!, you can call setrlimit(2) to set the limit for yourself and any child processes (which is all ulimit really does). Just make sure not to annoy your users by filling up their hard drives with core dumps. Which of course you won’t do because your code is perfect and never crashes anyways.

You’ve gone through the trouble of creating a core dump, but when your program crashes in some far away land, how do you actually get your hands on the core dump? You could ask your users to email it to you, but they’re not going to do that. They’re just going to complain on the Internet that your software sucks and that people shouldn’t use it. Some operating systems have a nice Crash Reporter or Error Reporting Service, but those send crash reports to first parties, something you might not want, and getting the crash data back to you is far from trivial.

One solution is to install your own error handlers in-process to catch things such as segfaults and instead of letting the operating system handle the error, you handle it yourself: you do your own stack trace, grab important data such as filenames, optionally pop up a UI asking the user if he wants to send an error report and for supplemental information, and sending the crash report your way. This is a lot of work, and it’s also dangerous: if your program has crashed, there’s no telling what state it’s in. Trying to do something like sending an email from a signal handler could easily fail — your heap might be corrupted, so you could crash again the moment you do something as mundane as try to allocated some memory. If you decide to go this route, a good place to start would be with signal(2)/sigaction(2) (*nix and OSX) or Structured Exception Handling (Windows).

A solution that I like better is out-of-process. Just let the process crash and dump core as before, but this time we’ll have a watchdog process running. The watchdog just waits for the main process to exit (normally or abnormally); if it sees an abnormal exit and a core dump, then it sends off the crash report into the ether. This is much safer, since you don’t have to worry about things such as a corrupted heap when sending a crash report. The only downside to this you now have twice as many processes running.

Here’s a full example of a watchdog with core dumps. The program forks, with the parent as the watchdog. The child intentionally crashes, and then the parent grabs the core dump if one was made.

#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <sys/resource.h>
#include <sys/wait.h>

int main(int argc, char **argv)
{
  // Try to enable core dumps
  struct rlimit core_limit;
  core_limit.rlim_cur = RLIM_INFINITY;
  core_limit.rlim_max = RLIM_INFINITY;
  
  if(setrlimit(RLIMIT_CORE, &core_limit) < 0)
    fprintf(stderr, "setrlimit: %s\nWarning: core dumps may be truncated or non-existant\n", strerror(errno));


  int status;
  switch(fork())
  {
  case 0:
    // We are the child process -- run the actual program
    *(int *)1 = 2;  // segfault
    break;

  case -1:
    // An error occurred, shouldn't happen
    perror("fork");
    return -1;

  default:
    // We are the parent process -- wait for the child process to exit
    if(wait(&status) < 0)
      perror("wait");
    printf("child exited with status %d\n", status);
    if(WIFSIGNALED(status) && WCOREDUMP(status))
    {
      printf("got a core dump\n");
      // find core dump, email it to your servers, etc.
    }
  }
  
  return 0;
}

If you compile and run this program, you’ll get a core dump from the child process, which the parent process will detect, and it can then do whatever it wants with it. Email it to you, upload it to a server, analyze it and trim it down before doing those, or anything else you can write code to do. All from the safety of an uncrashed process. If you run ulimit -c 0 before running this program, you’ll see the warning about setrlimit failing and you won’t get a core dump. This is because, if you look at the documentation for setrlimit, you’ll see that the soft limit can never exceed the hard limit, and the hard limit can only be decreased by unprivileged processes.

So there you have it. You now have a way to have your software dump core when it crashes and send those core dumps back to you without any extra hassle on the user’s part. Though depending on who your users are, it may still be a good idea to ask them if they want to send a crash report before actually doing so, since core dumps can easily contain private information in them. If you had anything like usernames or passwords in memory anywhere in your process, they’ll be in the core dump. So keep that in mind and take appropriate measures to protect users’ privacy. Encrypt the core dump if necessary. Maybe even attach a cryptographic signature to ensure authenticity.

Links for further enrichment:

One Response to “Dumping core”

  1. posted on HN http://news.ycombinator.com/item?id=1338033