Bug 13764

Summary: Suricata crash leaves machine and network exposed
Product: IPFire Reporter: Dan Zenchelsky <web>
Component: ---Assignee: Michael Tremer <michael.tremer>
Status: VERIFIED --- QA Contact:
Severity: Security    
Priority: - Unknown - CC: adolf.belka, arne.fitzenreiter, michael.tremer, stefan.schantl
Version: 2Keywords: Security
Hardware: all   
OS: Unspecified   
URL: https://community.ipfire.org/t/scary-bug-suricata-crash-leaves-machine-and-network-exposed/12083/1
Attachments: suricata.rej from failed patching
The suricata patch file
The output from the iptables command in comment 14

Description Dan Zenchelsky 2024-09-09 07:55:29 UTC
I am a long time user and supporter of IPFIRE. When upgrading to CORE 187, I encountered a particularly nasty bug which left my entire network exposed to the internet. Here is what happened:

TLDR: When the system’s out-of-memory killer was triggered (by a bug in Suricata?), my IPFIRE firewall was suddenly left completely open, exposing all ipfire service ports to the internet and disabling all firewall rules across GREEN/ORANGE/BLUE/RED. This happened several hours after upgrading to CORE 187, and I am able to replicate the bug on a fresh install of CORE 187 after enabling only Suricata and the emergingthreats community rules.

This is a particularly scary bug to me, because I would reasonably expect a firewall that runs out of memory to behave in a FAIL-SAFE manner (blocking all connections, for example), which is NOT what happened. Instead, my IPFIRE system (and the networks connected to it) were left exposed to the internet.

Long version:

On the morning of August 9, I upgraded to CORE 187 (from 184), did a quick test where everything seemed fine, and walked away to do other things. Several hours later (while watching TV), I got a strange alert from one of my monitoring systems telling me that two machines that should NOT be able to talk to each other across the firewall (one on BLUE, one on GREEN) were in-fact suddenly communicating. Upon looking at this, I immediately saw that the BLUE to GREEN connections were no longer being blocked by the ipfire firewall.

I then went to grc.com 3 ShieldsUP, where it showed that my IPFIRE machine was now completely exposed to the internet, with all running services exposed (SSH, HTTPS, etc.)

At the time, I did not understand how this happened, so, in a bit of a panic, I quickly shut the machine down and disconnected the internet and then proceeded to rebuild a new firewall machine from scratch, leaving the old one for future forensic analysis.

Today, I had the time to go back and forensically analyze the logs and the system, to try to find out what happened and whether I had been actively hacked.

I found the following two events which seemed correlated:

    I was alerted about the “connectivity issue” at 2024-08-09 18:18:15 -0400, and shortly after that (as described above), I discovered that my firewall had effectively been disabled.

    /var/log/messages shows the oom-killer kicking in about 1 minute earlier:

Aug 9 18:17:03 ipfire kernel: Suricata-Main invoked oom-killer: gfp_mask=0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO), order=0, oom_score_adj=0

The fact that they were time-correlated made me suspicious, so I wanted to see if I could reproduce the failure.

I then booted up the old/failed ipfire system in a lab environment to see how it behaved.

The good news is upon booting, everything looked fine and nothing was incorrectly exposed to the internet.

I googled how to simulate an out-of-memory oom-failure and came across the following suggestion, which I tried:

swapoff -a && tail /dev/zero

Upon doing this, I found the ipfire machine was now in the same state that I had seen on August 9. All IPFIRE service ports were EXPOSED on the RED interface (!!)

I then tried re-installing CORE 184 and restoring from a backup. Through limited testing, I was NOT able to reproduce the failure on CORE 184.

I then upgraded the system to CORE 187. Once I did that, I was able to again replicate the failure with the same commands listed above.

I then spent several hours trying to create a simplified and reproducible test case on a fresh install of CORE 187. Here are the steps that I came up with:

    1. Fresh install CORE 187.

    2. Configure all 4 interfaces.

    3. In Intrusion Prevention System, choose Provider “Emergingthreats.net Community Rules” and click ADD

    4. In Intrusion Prevention System, click to enable RED, GREEN, BLUE, and ORANGE as “Monitored Interfaces”

    5. In Intrusion Prevention System, click on “Enable Intrusion Prevention System” and then click SAVE

    6. Verify that it now says “RUNNING”

    7. On a seperate linux machine, connected to the RED interface, perform an “nmap -v [address of RED INTERFACE]” - this should show normal operation, all ports inaccessible.

    8. Log in to the IPFIRE console as root.

    9. Type “killall -9 /usr/bin/suricata” to simulate Suricata being killed
    (alternately, you can try “swapoff -a && tail /dev/zero” to simulate out of memory, but I find this only sometimes results in Suricata being killed, and the bug seems to only appear when suricata actually dies)

    10. On a seperate linux machine, connected to the RED interface, perform an “nmap -v [address of RED INTERFACE]” - this should now show that all ports with running services are EXPOSED. This is the bug.
Comment 1 Adolf Belka 2024-09-09 07:58:25 UTC
I have tested this out with a clone of my CU187 vm machine and have been able to reproduce it.
Comment 2 Adolf Belka 2024-09-09 08:49:19 UTC
I was also able to reproduce it on Core Update 184 so it looks like a problem that has been around for a while.
Comment 3 Michael Tremer 2024-09-09 15:49:32 UTC
Hello Dan,

thank you for your detailed report. Adolf has raised awareness of this ticket and we have decided to make it non-public for the time being until we have found a solution.

We have seen this behaviour before, however, we have tested this as successfully fixed. Therefore we now assume that something has changed in the behaviour of Suricata or the kernel.

We will go and investigate. As soon as we have a solution, we will come back here and make the ticket public again.
Comment 4 Michael Tremer 2024-09-10 16:50:04 UTC
Hello,

sorry for updating this so late, but as of yesterday, we have a solution which will prevent this from happening again:

> https://patchwork.ipfire.org/project/ipfire/patch/20240910143748.3469271-2-michael.tremer@ipfire.org/

There is more detail on the mailing list:

> https://lists.ipfire.org/hyperkitty/list/development@lists.ipfire.org/thread/W7EY3BGIA4RBIY42MSQAP5KNIW6FTIGI/

It would be great if you could test this and give us feedback if this solves it for you, too.

I will make the ticket public again.
Comment 5 Dan Zenchelsky 2024-09-10 19:15:32 UTC
Hello,

Thanks for following up.  I'm happy to test, but I need instructions for applying this patch.

For context, I tried to apply the linked patch to my existing CORE 187 machine, which I now have in the lab, but it would not apply cleanly into /etc/init.d -- seemingly getting stuck on the line:

if [ "$zone" == "red" ] && [ "$RED_TYPE" == "PPPOE" ] && [ "$RED_DRIVER" != "qmi_wwan" ]; then

(which didn't match what was in my /etc/init.d/suricata file)

I then tried altering that line to match what WAS in my suricata file.  Once I altered that line, the patch seemed to apply cleanly, but upon reboot, my system just hangs during boot - stopping while trying to bring up dhcpd on RED - and seemingly never continuing.

Obviously I broke something.  What should I be doing instead?

-Dan
Comment 6 Dan Zenchelsky 2024-09-10 19:18:14 UTC
Hmmm... Going back to the console, I can see that it DID eventually boot.  Just took a very long time...  I will continue testing.
Comment 7 Dan Zenchelsky 2024-09-10 20:26:24 UTC
After applying the patch, something is definitely broken with respect to dhcpcd on RED.  Reverting the patch, it consistently takes 40s from the time it says it's bringing up RED to the time it completes that step.  (Tested through 3 reboots).

After applying the patch, it takes 19 MINUTES for that stage during boot (!!)

Is there a different way I should be applying/testing this?
Comment 8 Adolf Belka 2024-09-11 08:22:29 UTC
I had that same problem originally but after Michael and I worked with rustdesk on my vm system it no longer had that problem and the ports were not open after crashing suricata with killall -9 /usr/bin/suricata.

I just tested my vm system that we worked on and it has the fixes and it only waits around 20 to 30 seconds for the dhcpcd to complete.

I just installed the complete patch set from Michael and that also only waited the normal time for the dhcpcd.

On your system after the dhcpcd eventually completes did you try crashing suricata and confirming if the ports now stay closed?

I will try and apply only the patches that Michael provided the link to onto a CU187 vm system and see if I get a similar issue with the line

-----if [ "$zone" == "red" ] && [ "$RED_TYPE" == "PPPOE" ] && [ "$RED_DRIVER" != "qmi_wwan" ]; then-----

as you did.
Comment 9 Adolf Belka 2024-09-11 09:44:37 UTC
Created attachment 1598 [details]
suricata.rej from failed patching

I tried to apply the two patches to a cloned CU187 system.

The firewall patch applied with no problems but the second one failed on the second Hunk. Failure is at a different location to what Dan found. I also can't tell what the problem with the patch is. It looks like it should apply but it refuses to do so.

To apply the patches I split the firewall and the suricata single patch into two as they are designed to apply to the build tree.

I then applied the patches by running the command

patch -b /etc/rc.d/init.d/suricata -i /tmp/suricata.patch.

and got the error message

patching file /etc/rc.d/init.d/suricata
Hunk #2 FAILED at 71.
1 out of 2 hunks FAILED -- saving rejects to file /etc/rc.d/init.d/suricata.rej

I have attached the suricata.rej file

The suricata.patch file I used I will add as an attachment in the next comment.
Comment 10 Adolf Belka 2024-09-11 09:47:06 UTC
Created attachment 1599 [details]
The suricata patch file

This is the suricata patch file I split out of Michael's 
01-20-suricata-Move-the-IPS-into-the-mangle-table.diff file
Comment 11 Michael Tremer 2024-09-11 09:51:16 UTC
It is probably easiest to just download the files with all the changes:

/etc/init.d/firewall from here: https://git.ipfire.org/?p=people/ms/ipfire-2.x.git;a=blob_plain;f=src/initscripts/system/firewall;hb=f700f345381f244de1fb53789dbd12fcbabebdb3

/etc/init.d/suricata from here: https://git.ipfire.org/?p=people/ms/ipfire-2.x.git;a=blob_plain;f=src/initscripts/system/suricata;hb=f700f345381f244de1fb53789dbd12fcbabebdb3

/etc/init.d/networking/functions.network from here: https://git.ipfire.org/?p=people/ms/ipfire-2.x.git;a=blob_plain;f=src/initscripts/networking/functions.network;hb=f700f345381f244de1fb53789dbd12fcbabebdb3

Make sure the initscripts are executable and then reboot.

The changes in the CGI file don't matter for this ticket.
Comment 12 Adolf Belka 2024-09-11 10:14:56 UTC
(In reply to Adolf Belka from comment #8)
> 
> I just installed the complete patch set from Michael and that also only
> waited the normal time for the dhcpcd.
> 

This bit is not correct as I took a wrong iso from the build I did. I am repeating this install with the correct iso.

After installing the full patch set build iso the dhcpcd took around 20 to 30 seconds.

Then I restored a backup from CU187 and rebooted. This time the dhcpcd took around 50 seconds but it still completed with no problems.

I then tested crashing suricata with killall -9 /usr/bin/suricata and suricata automatically restarted so the suricata-watcher script is working fine and all the ports stayed closed.
So the patch set looks good to me. Also can see that the Intrusion Prevention System tables are now all like the others.

The only thing found is that on the Services page suricata shows up stopped when it is actually running. This was previously fixed. Something to do with the services widget for suricata.

As the full patch set worked then the issue I had with the firewall and suricata files must have been some error that I did as the same files are in the full patch set.
Comment 13 Adolf Belka 2024-09-11 10:36:01 UTC
(In reply to Michael Tremer from comment #11)
> It is probably easiest to just download the files with all the changes:
> 
> /etc/init.d/firewall from here:
> https://git.ipfire.org/?p=people/ms/ipfire-2.x.git;a=blob_plain;f=src/
> initscripts/system/firewall;hb=f700f345381f244de1fb53789dbd12fcbabebdb3
> 
> /etc/init.d/suricata from here:
> https://git.ipfire.org/?p=people/ms/ipfire-2.x.git;a=blob_plain;f=src/
> initscripts/system/suricata;hb=f700f345381f244de1fb53789dbd12fcbabebdb3
> 
> /etc/init.d/networking/functions.network from here:
> https://git.ipfire.org/?p=people/ms/ipfire-2.x.git;a=blob_plain;f=src/
> initscripts/networking/functions.network;
> hb=f700f345381f244de1fb53789dbd12fcbabebdb3
> 
> Make sure the initscripts are executable and then reboot.
> 
> The changes in the CGI file don't matter for this ticket.

I followed this with my CU187 vm clone.

I then rebooted and the dhcpcd took 20 seconds and came up.

I would suggest that Dan follows the same approach as you defined. It should then hopefully work well also for him in terms of the reboot.

However, with those three files changed, suricata was not running and starting it came back with an OK but checking the status immediately showed it to not be working.

Maybe there is something else from the full patch set needed.
Comment 14 Michael Tremer 2024-09-11 12:35:49 UTC
(In reply to Adolf Belka from comment #12)
> (In reply to Adolf Belka from comment #8)
> > 
> > I just installed the complete patch set from Michael and that also only
> > waited the normal time for the dhcpcd.
> > 
> 
> This bit is not correct as I took a wrong iso from the build I did. I am
> repeating this install with the correct iso.
> 
> After installing the full patch set build iso the dhcpcd took around 20 to
> 30 seconds.
> 
> Then I restored a backup from CU187 and rebooted. This time the dhcpcd took
> around 50 seconds but it still completed with no problems.

This should not happen at all. Could you post the output of "iptables -t mangle -L -nv"?

> The only thing found is that on the Services page suricata shows up stopped
> when it is actually running. This was previously fixed. Something to do with
> the services widget for suricata.

I have a fix for that here:

> https://git.ipfire.org/?p=people/ms/ipfire-2.x.git;a=commitdiff;h=d7d0031912ce61292258ebd39efef0d81ca991ba

Suricata does not write its PID when it is not running in daemon mode. Who knew?
Comment 15 Adolf Belka 2024-09-11 12:54:14 UTC
Created attachment 1600 [details]
The output from the iptables command in comment 14

I had already turned off that vm to work on another one.

I just restarted the full patch set vm and this time it completed the dhcpcd in the normal ~20 seconds so maybe my local network was tied up with the other things that were running when I did the reboot after the restore.

I am not sure with the dhcpcd reboot time now being normal whether the iptables command is still worthwhile but the output is attached here in case it is still of use.
Comment 16 Michael Tremer 2024-09-11 13:54:54 UTC
(In reply to Adolf Belka from comment #15)
> I am not sure with the dhcpcd reboot time now being normal whether the
> iptables command is still worthwhile but the output is attached here in case
> it is still of use.

If everything is working fine now, I won't need the output.
Comment 17 Dan Zenchelsky 2024-09-11 15:08:22 UTC
Hi Michael,

After installing the 3 files you linked, the dhcpcd problem disappeared, however suricata did not start, since I was missing suricata-watcher. 

I changed the suricata script to launch suricata directly (as done in the previous version), and rebooted, and this time suricata launched successfully.

I was then able to re-test.

1. ps -aux | grep suricata : shows suricata is running

2. [external] nmap -v ipfire : shows nothing unusual is exposed

3. swapoff -a && tail /dev/zero 

4. ps -aux | grep suricata : shows suricata is NOT running (successfully killed by oom)

5. [external] nmap -v ipfire : STILL shows nothing unusual is exposed (TEST PASSED)

*** Is there a corresponding automated test to catch this if it regresses? ***
Comment 18 Michael Tremer 2024-09-24 14:11:48 UTC
> https://www.ipfire.org/blog/ipfire-2-29-core-update-189-is-available-for-testing

The changes have been merged into c189 which is available as an update. This might make installation easier.

There is also a new graph and some more fixed for other problems since we last talked on here.

I would like to hear your feedback on those changes so that we can release this update as soon as possible.
Comment 19 Adolf Belka 2024-09-26 21:21:08 UTC
On my vm IPFire with build 0555434e I nmap -v ipfire.domain.org and all ports were closed.

ran killall -9 /usr/bin/suricata

suricata-watcher immediately re-started suricata.

Ran nmap again ann all 1000 ports still closed.

So verification on CU189 Testing that this bug has been fixed.
Comment 20 Michael Tremer 2024-09-26 21:28:48 UTC
Thank you for double-checking once again for me. I have not heard much from the latest update apart from graphs, so hopefully we have some more feedback.
Comment 21 Michael Tremer 2024-09-29 12:23:56 UTC
(In reply to Dan Zenchelsky from comment #17)
> *** Is there a corresponding automated test to catch this if it regresses?
> ***

No, sadly we don't have tests like these. I would like to have them, but we simply don't have enough man power.