Bug 10607

Summary: dnsmasq segfaults after a while
Product: IPFire Reporter: Michael Tremer <michael.tremer>
Component: ---Assignee: Michael Tremer <michael.tremer>
Status: CLOSED FIXED QA Contact: Arne.F <arne.fitzenreiter>
Severity: Major Usability    
Priority: Will affect most users CC: bbitsch, chemobejk, fkienker, samuel, zandhi
Version: 2   
Hardware: x86_64   
OS: All   
Attachments: Workaround: Bash script to restart dnsmasq only when it crashed
dnsmasq 2.72 for ipfire
dnsmasq 2.72 with patches
dnsmasq 2.73 test version

Description Michael Tremer 2014-08-28 20:04:26 UTC
There have been numerous (sometimes unconfirmed) claims that dnsmasq segfaults after a while. Reasons are unclear and the error is not easily reproducible.

> dnsmasq[21866]: segfault at 0 ip 0805c2a5 sp 5fde30b0 error 4 in dnsmasq[8048000+2f000] gid/egid:0/0

http://forum.ipfire.org/index.php?topic=11401.0
Comment 1 Fred Kienker 2014-08-29 16:04:05 UTC
This issue only started after the core 81 update. System in question has been running IPFire *without issues* for several years. Hardware has been checked for proper operation. System connects to a US internet provider (Charter Cable) via a pseudo-static connection. It uses DHCP to obtain a "fixed" IP address from the provider. The issue occurs after about 48 hours of the last reboot which more or less coincides with the renewal time of the DHCP to the provider. When it occurs there is *no* connectivity to the IPFire box and no access to the web page or to a SSH session. It can be accessed from the the console and it appears the system is "live" but all routing and communications via the network interfaces has stopped. Rebooting cures the issue till the end of the next 48 hour period. Attached is the kernel log which shows the crash.

15:16:13	kernel: 	dnsmasq[1333]: segfault at 0 ip 0805c2a5 sp b0ddec10 error 4 in dnsmasq[8048000+ 2f000] 

15:16:13	kernel: 	grsec: Segmentation fault occurred at (nil) in /usr/sbin/dnsmasq[dnsmasq:1333 ] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid :0/0

15:16:13	kernel: 	grsec: bruteforce prevention initiated due to crash of /usr/sbin/dnsmasq against uid 99, banning suid/sgid execs for 15 minutes. Please investigate the crash report for /usr/sbin/dnsmasq[dnsmasq:1333] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0
Comment 2 Michael Tremer 2014-08-30 05:08:58 UTC
Thanks for the details.

I know of people who are experiencing this problem with static IP assignments or DSL. So I guess we can rule out the DHCP client option being the cause of the segmentation fault. I have a cable connection using DHCP at home as well without problems.

I also got reports from people that the problem *seems* to happen quickly after dnsmasq has been started. If you say that you have this problem after about 48 hours, this would not confirm that first impression.

I *assume* with absolutely no evidence that the crash may be caused by malformed responses from the upstream nameservers. Some servers seem to trigger the problem more often than others.
Comment 3 Samuel 2014-09-07 19:16:13 UTC
I also have some problems with dnsmasq. I did not notice anything before core update 81.

07:02:27	kernel: 	dnsmasq[7873]: segfault at 0 ip 0805c2a5 sp 59e4a000 error 4 in dnsmasq[8048000+ 2f000]
07:02:27	kernel: 	grsec: Segmentation fault occurred at (nil) in /usr/sbin/dnsmasq[dnsmasq:7873 ] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid :0/0
07:02:27	kernel: 	grsec: bruteforce prevention initiated due to crash of /usr/sbin/dnsmasq against uid 99, banning suid/sgid execs for 15 minutes. Please investigate the crash r eport for /usr/sbin/dnsmasq[dnsmasq:7873] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0

Only sometimes, the dnsmasq process consumes all CPU (95 to 98 %) after crashing.
DNS log looks unsuspicious:
07:04:00	dnsmasq[8248]: 	started, version 2.71 cachesize 2500
07:04:00	dnsmasq[8248]: 	compile time options: IPv6 GNU-getopt no-DBus no-i18n IDN no-DHCP no-TFTP no-con ntrack ipset auth DNSSEC
07:04:00	dnsmasq[8248]: 	DNSSEC validation enabled
07:04:00	dnsmasq[8248]: 	reading /var/ipfire/red/resolv.conf
07:04:00	dnsmasq[8248]: 	using nameserver 195.50.140.248#53
07:04:00	dnsmasq[8248]: 	using nameserver 195.50.140.246#53
07:04:00	dnsmasq[8248]: 	reading /var/state/dhcp/dhcpd.leases
07:04:00	dnsmasq[8248]: 	read /etc/hosts - 3 addresses

Engineer's workaround: execute "/etc/init.d/dnsmasq restart" regularly via cron ...
Comment 4 Fred Kienker 2014-09-08 16:08:35 UTC
Samuel, how often does this cron job to restart run?
Michael, I am going to replace the affected system with new hardware (same brand, type and specs) and see if this provides some relief.
Comment 5 Bernhard Bitsch 2014-09-08 16:22:45 UTC
Just an proposal for investigation:
It is reported that this error is coupled to DHCP renewal on RED. If there is a memory problem, it would be interesting to know the interval until the failure measured in number of renewals. A time ( e.g. 48h ) is not sufficient.
Comment 6 Michael Tremer 2014-09-08 20:22:09 UTC
Changing hardware should not change anything with this problem. The only workaround is to disable DNSSEC on affected systems.

(In reply to comment #5)
> Just an proposal for investigation:
> It is reported that this error is coupled to DHCP renewal on RED. If there
> is a memory problem, it would be interesting to know the interval until the
> failure measured in number of renewals. A time ( e.g. 48h ) is not
> sufficient.

The problem happens on systems with static IP address assignments and PPPoE dial-in, too. So I don't see that this is in any way linked to the DHCP client.
Comment 7 Samuel 2014-09-08 20:29:24 UTC
Fred: At the moment, every five minutes. This might seem often, but I want my VOIP phone to work. A more clean workaround would be a watchdog script that tests if dnsmasq is still running properly and restarts the service only on demand. Unfortunately, my limited shell script knowledge prevents me from implementing this.

Bernhard: I could not see any correlation. Usually, I get a new IP once a day (DSL, PPPoE, provider forces daily disconnect).

I am willing to contribute any log or other information requested.
Hardware: http://fireinfo.ipfire.org/profile/54235a991f7323bd440a4f82ac0c56679dc4b91b

Over the course of the day, there were three more crashes:
08:59:42	kernel: 	dnsmasq[16261]: segfault at 0 ip 0805c2a5 sp 5d265670 error 4 in dnsmasq[8048000 +2f000]
08:59:42	kernel: 	grsec: Segmentation fault occurred at (nil) in /usr/sbin/dnsmasq[dnsmasq:1626 1] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egi d:0/0
08:59:42	kernel: 	grsec: bruteforce prevention initiated due to crash of /usr/sbin/dnsmasq against uid 99, banning suid/sgid execs for 15 minutes. Please investigate the crash r eport for /usr/sbin/dnsmasq[dnsmasq:16261] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0

10:01:18	kernel: 	dnsmasq[19403]: segfault at 0 ip 0805c2a5 sp 5bed7d50 error 4 in dnsmasq[8048000 +2f000]
10:01:18	kernel: 	grsec: Segmentation fault occurred at (nil) in /usr/sbin/dnsmasq[dnsmasq:1940 3] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egi d:0/0
10:01:18	kernel: 	grsec: bruteforce prevention initiated due to crash of /usr/sbin/dnsmasq against uid 99, banning suid/sgid execs for 15 minutes. Please investigate the crash r eport for /usr/sbin/dnsmasq[dnsmasq:19403] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0

11:08:46	kernel: 	dnsmasq[22547]: segfault at 0 ip 0805c2a5 sp 5ca9ee70 error 4 in dnsmasq[8048000 +2f000]
11:08:46	kernel: 	grsec: Segmentation fault occurred at (nil) in /usr/sbin/dnsmasq[dnsmasq:2254 7] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egi d:0/0
11:08:46	kernel: 	grsec: bruteforce prevention initiated due to crash of /usr/sbin/dnsmasq against uid 99, banning suid/sgid execs for 15 minutes. Please investigate the crash r eport for /usr/sbin/dnsmasq[dnsmasq:22547] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0
Comment 8 Fred Kienker 2014-09-08 21:05:05 UTC
Michael, you are certainly correct that changing hardware should have nothing to do with the problem. But this is a production system and the end users frustration limit with the issues have reached a breaking point. I'm trying to eliminate several possibilities at once and potentially cure the issue for those affected. At the *same* time this retains the "problem child" in it's entirety for further testing.

This is install done several years ago and has updated numerous times over the years without issues till v81. If the accumulation of many updates is the root of the problem this will be nearly impossible for you to duplicate in a controlled setting. If the problem remains even after the substitution then I can safely assume it is an ISP or v81 issue.
Comment 9 Samuel 2014-09-21 23:21:26 UTC
Created attachment 222 [details]
Workaround: Bash script to restart dnsmasq only when it crashed

I put the script in /etc/fcron.minutely/. It catched three crashes so far, I suppose it works.

Simple logic: Does dnsmasq have NO PID? => restart
Does ist have a PID, but uses more than 90% CPU? => restart

This helps me to keep downtime short and avoids filling the DNS log (compared to an unconditioned restart every 5 minutes).
If you test the script with another process, please be aware that it only processes with one PID. httpd does not work ...
Comment 10 Michael Tremer 2014-09-26 18:14:51 UTC
Created attachment 224 [details]
dnsmasq 2.72 for ipfire

I built the latest release of dnsmasq for IPFire. I already gave the second release candidate of dnsmasq 2.72 some testing and it still looks good for me - although I am not able to reproduce the issue at all.

Please copy the attached file to /usr/sbin/dnsmasq and restart the service. There are some changes that let me hope that this update may fix our troubles. Please don't forget to send me your feedback!

The changelog:

version 2.72
            Add ra-advrouter mode, for RFC-3775 mobile IPv6 support.

	    Add support for "ipsets" in *BSD, using pf. Thanks to 
	    Sven Falempim for the patch.

	    Fix race condition which could lock up dnsmasq when an 
	    interface goes down and up rapidly. Thanks to Conrad 
	    Kostecki for helping to chase this down.

	    Add DBus methods SetFilterWin2KOption and SetBogusPrivOption
	    Thanks to the Smoothwall project for the patch.

	    Fix failure to build against Nettle-3.0. Thanks to Steven 
	    Barth for spotting this and finding the fix. 
	    
	    When assigning existing DHCP leases to intefaces by comparing 
	    networks, handle the case that two or more interfaces have the
	    same network part, but different prefix lengths (favour the
	    longer prefix length.) Thanks to Lung-Pin Chang for the 
	    patch.
	    
	    Add a mode which detects and removes DNS forwarding loops, ie 
	    a query sent to an upstream server returns as a new query to 
	    dnsmasq, and would therefore be forwarded again, resulting in 
	    a query which loops many times before being dropped. Upstream
	    servers which loop back are disabled and this event is logged.
	    Thanks to Smoothwall for their sponsorship of this feature.

	    Extend --conf-dir to allow filtering of files. So
	    --conf-dir=/etc/dnsmasq.d,\*.conf
	    will load all the files in /etc/dnsmasq.d which end in .conf
 
            Fix bug when resulted in NXDOMAIN answers instead of NODATA in
            some circumstances.

	    Fix bug which caused dnsmasq to become unresponsive if it 
	    failed to send packets due to a network interface disappearing.
	    Thanks to Niels Peen for spotting this.
	    	    
            Fix problem with --local-service option on big-endian platforms
	    Thanks to Richard Genoud for the patch.
Comment 11 Stefan Becker 2014-09-27 08:57:46 UTC
I can also confirm the same segfault, although it happens rarely: I had only 2 crashes. It also started for me after updating to core 81.

My IPfire box is connected to a bridge port on the FTTH terminal and uses DHCP on RED. Although the IP is dynamic it rarely changes. As far as I can tell the dnsmasq crashes are not related to DHCP release renewal.


# ls -lht  /var/log/pakfire/ | head
total 3.8M
-rw-r--r-- 1 root root 5.1K 2014-09-23 21:53 update-core-upgrade-82.log
-rw-r--r-- 1 root root 2.4K 2014-08-11 16:56 update-core-upgrade-81.log
-rw-r--r-- 1 root root 8.1K 2014-08-05 13:58 update-core-upgrade-80.log


# head /var/log/messages
Aug  3 01:01:02 ipfire syslogd 1.5.0: restart.


# fgrep -e segfault -e grsec -e leased /var/log/messages
....
Sep  3 05:01:44 ipfire dhcpcd[1840]: red0: leased 83.148.237.234 for 28800 secon
ds
Sep  3 06:55:57 ipfire kernel: dnsmasq[1538]: segfault at 0 ip 0805c2a5 sp 5ce68
040 error 4 in dnsmasq[8048000+2f000]
Sep  3 06:55:57 ipfire kernel: grsec: Segmentation fault occurred at    (nil) in
 /usr/sbin/dnsmasq[dnsmasq:1538] uid/euid:99/99 gid/egid:40/40, parent /sbin/ini
t[init:1] uid/euid:0/0 gid/egid:0/0
Sep  3 06:55:58 ipfire kernel: grsec: bruteforce prevention initiated due to cra
sh of /usr/sbin/dnsmasq against uid 99, banning suid/sgid execs for 15 minutes. 
 Please investigate the crash report for /usr/sbin/dnsmasq[dnsmasq:1538] uid/eui
d:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0
Sep  3 09:01:44 ipfire dhcpcd[1840]: red0: leased 83.148.237.234 for 28800 secon
ds
....
Sep 26 23:33:25 ipfire dhcpcd[1840]: red0: leased 83.148.237.234 for 7200 seconds
Sep 27 00:16:27 ipfire kernel: dnsmasq[10290]: segfault at 0 ip 0805c2a5 sp 59aa2ca0 error 4 in dnsmasq[8048000+2f000]
Sep 27 00:16:27 ipfire kernel: grsec: From 192.168.2.2: Segmentation fault occurred at    (nil) in /usr/sbin/dnsmasq[dnsmasq:10290] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0
Sep 27 00:16:27 ipfire kernel: grsec: From 192.168.2.2: bruteforce prevention initiated due to crash of /usr/sbin/dnsmasq against uid 99, banning suid/sgid execs for 15 minutes.  Please investigate the crash report for /usr/sbin/dnsmasq[dnsmasq:10290] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0
Sep 27 00:33:25 ipfire dhcpcd[1840]: red0: leased 83.148.237.234 for 7200 seconds


I'll update to the new binary now.
Comment 12 Michael Tremer 2015-01-02 12:25:30 UTC
Created attachment 243 [details]
dnsmasq 2.72 with patches

Please test the attached updated version of dnsmasq. It includes some patches from Simon Kelley which *might* fix the crashes with DNSSEC.

http://git.ipfire.org/?p=ipfire-2.x.git;a=commitdiff;
h=b56472d49b2e50d6c8f84023b80c3ee43114bfe1
Comment 13 Michael Tremer 2015-02-02 00:32:10 UTC
Created attachment 254 [details]
dnsmasq 2.73 test version

This is an updated version taken from the git snapshot as of today.
Comment 14 Fred Kienker 2015-02-04 16:01:37 UTC
I have put the updated program into production on three systems. So far there have been no issues with the new program. I will report back after a few days as to whether there are any. The big problem will be the issue was isolated and not frequent.
Comment 15 Zandhi 2015-03-02 11:53:01 UTC
Hello,
I had problems with this, resulting in having to boot my IPFire every 2 days or my memory would be filled up and exessive swapping made Inet unreachable.
The top command showed the main culprit for memory to be dnsmasq so my search landed here.
I tested the script "/etc/init.d/dnsmasq restart" a few times and found the triggered restart to consitently solve my memory consumption issue so I went ahead and installed the "dnsmasq 2.73 test version" provided here and have no issues since then (2,5 weeks ago).

Happy that this problem seems solved ith that, thanks for the documentation guys :)

I will now install core update 87 and watch closely how memory behaves and replace with the dnsmasq 2.73 test version again if needed.

thnx n take care
Comment 16 Michael Tremer 2015-04-29 20:34:14 UTC
Alright. This seems to be solved with the update of dnsmasq that was shipped with Core Update 89.