Mailcow

rspamd always crashes after switching CPU topology

A different number of cores/threads can already be enough for this to occur. For me, this happened after moving my Mailcow Instance to another server, although this can already be triggered by rescaling a cloud server or switching it to a different CPU.

Symptoms

  • Sending Mail not possible, neither with external clients, nor the integrated SoGo
  • the rspamd container starts and exits immediately every few seconds

Problem

Rspamd compiles so called “hyperscan” files, which help to more efficiently filter mails to detect spam. These filters are very optimized for a CPU configuration (for efficiency reasons), which make them crash as soon as the something changes. A change in topology, such as assigning more cores of the same CPU to a VM already seems to be enough to make these filters crash.

Unfortunately, these filters also seem to be backed up by the official backup script, which causes problems when migrating to a new server.

Solution

Delete the “temporary” files of rspamd. They are stored, because they take some time to build (maybe a minute or two, depending on your hardware), but they can be rebuilt without any problems.

For my case, I wasn’t able to follow the official guide of starting the rspamd-container and then executing the two rm -rf commands, since the containers crashed faster than I was able to start a shell inside the container and execute any programs. In the end I just deleted the whole rspamd volume, which might be a risk, since I don’t know what else is stored inside the volumes (only do this in production when you have backups), but for me this worked out well and fixed the problem. That being said, I was using the default configuration of rspamd, so if you changed something, this might be gone after deleting the volume, in this case you might want to look for another solution.

The next start of the rspamd-container took about one minute (if I recall correctly), which of course depends on the performance of your server, but everything worked fine since.

Sources