Debugging strange errors with strace
Yesterday my collegue came to me with a strange error in our development environment. For some reason the message queue consumers failed fetching new messages from Amazon SQS with a SSL library error. The error message we got was
[Errno 185090050] _ssl.c:340: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib"
which could be interpreted as the process could not load the certificate revocation list properly.
This error appeared after my collegue deployed a very minor and totally unrelated code change. However, we proceeded to deploy the previous version, just to be sure. The error persisted.
After googling a bit I found that the file it was supposed to be looking for was the cacerts.txt
inside boto
. I went ahead and check if that file existed in our virtualenv and it did. Very strange.
That meant I needed to know the absolute path of the file it could not find. For that I needed to see what syscall it failed on, so strace comes to the rescue. Strace is a neat utility to trace all syscalls a process does and the signals it receives. I attached strace to one of our server processes by the following command.
strace -v -p <pid> 2>&1 | grep cacert
The output I got was similar to the following (I didn't save the output at the time)
open("/xxxxxx/1398298388/lib/python2.7/site-packages/boto/cacerts/cacerts.txt", O_RDONLY) = -1
So it was looking for the file I checked for. However, there is one thing here, you might recognize 1398298388
as a unix timestamp. It is the time when our virtual environment was created. We create a new one for each build and then deploy the entire virtual environement to our servers, activate it and restart the processes. In the xxxxxxx
directory we have the latest virtual environements we have deployed.
This specific virtual environment the process was launched from was now missing from our server. It is due to that we only keep the last few environments on the server. What had happend was that our processes had not been restarted properly after the last deploys. We have supervisor monitoring our server processes and we had recently change how processes are started by introducing a wrapper script. This script did not forward the shutdown signals properly. Restarting the processes properly solved the SSL error.
The conclusion that can be drawn is that strace is very useful for debugging errors that have no obvious reason and you need to look closer at the process to see what it actually tries to do.