Summary: | dnsmasq segfaults after a while | ||
---|---|---|---|
Product: | IPFire | Reporter: | Michael Tremer <michael.tremer> |
Component: | --- | Assignee: | Michael Tremer <michael.tremer> |
Status: | CLOSED FIXED | QA Contact: | Arne.F <arne.fitzenreiter> |
Severity: | Major Usability | ||
Priority: | Will affect most users | CC: | bbitsch, chemobejk, fkienker, samuel, zandhi |
Version: | 2 | ||
Hardware: | x86_64 | ||
OS: | All | ||
Attachments: |
Workaround: Bash script to restart dnsmasq only when it crashed
dnsmasq 2.72 for ipfire dnsmasq 2.72 with patches dnsmasq 2.73 test version |
Description
Michael Tremer
2014-08-28 20:04:26 UTC
This issue only started after the core 81 update. System in question has been running IPFire *without issues* for several years. Hardware has been checked for proper operation. System connects to a US internet provider (Charter Cable) via a pseudo-static connection. It uses DHCP to obtain a "fixed" IP address from the provider. The issue occurs after about 48 hours of the last reboot which more or less coincides with the renewal time of the DHCP to the provider. When it occurs there is *no* connectivity to the IPFire box and no access to the web page or to a SSH session. It can be accessed from the the console and it appears the system is "live" but all routing and communications via the network interfaces has stopped. Rebooting cures the issue till the end of the next 48 hour period. Attached is the kernel log which shows the crash. 15:16:13 kernel: dnsmasq[1333]: segfault at 0 ip 0805c2a5 sp b0ddec10 error 4 in dnsmasq[8048000+ 2f000] 15:16:13 kernel: grsec: Segmentation fault occurred at (nil) in /usr/sbin/dnsmasq[dnsmasq:1333 ] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid :0/0 15:16:13 kernel: grsec: bruteforce prevention initiated due to crash of /usr/sbin/dnsmasq against uid 99, banning suid/sgid execs for 15 minutes. Please investigate the crash report for /usr/sbin/dnsmasq[dnsmasq:1333] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0 Thanks for the details. I know of people who are experiencing this problem with static IP assignments or DSL. So I guess we can rule out the DHCP client option being the cause of the segmentation fault. I have a cable connection using DHCP at home as well without problems. I also got reports from people that the problem *seems* to happen quickly after dnsmasq has been started. If you say that you have this problem after about 48 hours, this would not confirm that first impression. I *assume* with absolutely no evidence that the crash may be caused by malformed responses from the upstream nameservers. Some servers seem to trigger the problem more often than others. I also have some problems with dnsmasq. I did not notice anything before core update 81. 07:02:27 kernel: dnsmasq[7873]: segfault at 0 ip 0805c2a5 sp 59e4a000 error 4 in dnsmasq[8048000+ 2f000] 07:02:27 kernel: grsec: Segmentation fault occurred at (nil) in /usr/sbin/dnsmasq[dnsmasq:7873 ] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid :0/0 07:02:27 kernel: grsec: bruteforce prevention initiated due to crash of /usr/sbin/dnsmasq against uid 99, banning suid/sgid execs for 15 minutes. Please investigate the crash r eport for /usr/sbin/dnsmasq[dnsmasq:7873] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0 Only sometimes, the dnsmasq process consumes all CPU (95 to 98 %) after crashing. DNS log looks unsuspicious: 07:04:00 dnsmasq[8248]: started, version 2.71 cachesize 2500 07:04:00 dnsmasq[8248]: compile time options: IPv6 GNU-getopt no-DBus no-i18n IDN no-DHCP no-TFTP no-con ntrack ipset auth DNSSEC 07:04:00 dnsmasq[8248]: DNSSEC validation enabled 07:04:00 dnsmasq[8248]: reading /var/ipfire/red/resolv.conf 07:04:00 dnsmasq[8248]: using nameserver 195.50.140.248#53 07:04:00 dnsmasq[8248]: using nameserver 195.50.140.246#53 07:04:00 dnsmasq[8248]: reading /var/state/dhcp/dhcpd.leases 07:04:00 dnsmasq[8248]: read /etc/hosts - 3 addresses Engineer's workaround: execute "/etc/init.d/dnsmasq restart" regularly via cron ... Samuel, how often does this cron job to restart run? Michael, I am going to replace the affected system with new hardware (same brand, type and specs) and see if this provides some relief. Just an proposal for investigation: It is reported that this error is coupled to DHCP renewal on RED. If there is a memory problem, it would be interesting to know the interval until the failure measured in number of renewals. A time ( e.g. 48h ) is not sufficient. Changing hardware should not change anything with this problem. The only workaround is to disable DNSSEC on affected systems. (In reply to comment #5) > Just an proposal for investigation: > It is reported that this error is coupled to DHCP renewal on RED. If there > is a memory problem, it would be interesting to know the interval until the > failure measured in number of renewals. A time ( e.g. 48h ) is not > sufficient. The problem happens on systems with static IP address assignments and PPPoE dial-in, too. So I don't see that this is in any way linked to the DHCP client. Fred: At the moment, every five minutes. This might seem often, but I want my VOIP phone to work. A more clean workaround would be a watchdog script that tests if dnsmasq is still running properly and restarts the service only on demand. Unfortunately, my limited shell script knowledge prevents me from implementing this. Bernhard: I could not see any correlation. Usually, I get a new IP once a day (DSL, PPPoE, provider forces daily disconnect). I am willing to contribute any log or other information requested. Hardware: http://fireinfo.ipfire.org/profile/54235a991f7323bd440a4f82ac0c56679dc4b91b Over the course of the day, there were three more crashes: 08:59:42 kernel: dnsmasq[16261]: segfault at 0 ip 0805c2a5 sp 5d265670 error 4 in dnsmasq[8048000 +2f000] 08:59:42 kernel: grsec: Segmentation fault occurred at (nil) in /usr/sbin/dnsmasq[dnsmasq:1626 1] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egi d:0/0 08:59:42 kernel: grsec: bruteforce prevention initiated due to crash of /usr/sbin/dnsmasq against uid 99, banning suid/sgid execs for 15 minutes. Please investigate the crash r eport for /usr/sbin/dnsmasq[dnsmasq:16261] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0 10:01:18 kernel: dnsmasq[19403]: segfault at 0 ip 0805c2a5 sp 5bed7d50 error 4 in dnsmasq[8048000 +2f000] 10:01:18 kernel: grsec: Segmentation fault occurred at (nil) in /usr/sbin/dnsmasq[dnsmasq:1940 3] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egi d:0/0 10:01:18 kernel: grsec: bruteforce prevention initiated due to crash of /usr/sbin/dnsmasq against uid 99, banning suid/sgid execs for 15 minutes. Please investigate the crash r eport for /usr/sbin/dnsmasq[dnsmasq:19403] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0 11:08:46 kernel: dnsmasq[22547]: segfault at 0 ip 0805c2a5 sp 5ca9ee70 error 4 in dnsmasq[8048000 +2f000] 11:08:46 kernel: grsec: Segmentation fault occurred at (nil) in /usr/sbin/dnsmasq[dnsmasq:2254 7] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egi d:0/0 11:08:46 kernel: grsec: bruteforce prevention initiated due to crash of /usr/sbin/dnsmasq against uid 99, banning suid/sgid execs for 15 minutes. Please investigate the crash r eport for /usr/sbin/dnsmasq[dnsmasq:22547] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0 Michael, you are certainly correct that changing hardware should have nothing to do with the problem. But this is a production system and the end users frustration limit with the issues have reached a breaking point. I'm trying to eliminate several possibilities at once and potentially cure the issue for those affected. At the *same* time this retains the "problem child" in it's entirety for further testing. This is install done several years ago and has updated numerous times over the years without issues till v81. If the accumulation of many updates is the root of the problem this will be nearly impossible for you to duplicate in a controlled setting. If the problem remains even after the substitution then I can safely assume it is an ISP or v81 issue. Created attachment 222 [details]
Workaround: Bash script to restart dnsmasq only when it crashed
I put the script in /etc/fcron.minutely/. It catched three crashes so far, I suppose it works.
Simple logic: Does dnsmasq have NO PID? => restart
Does ist have a PID, but uses more than 90% CPU? => restart
This helps me to keep downtime short and avoids filling the DNS log (compared to an unconditioned restart every 5 minutes).
If you test the script with another process, please be aware that it only processes with one PID. httpd does not work ...
Created attachment 224 [details]
dnsmasq 2.72 for ipfire
I built the latest release of dnsmasq for IPFire. I already gave the second release candidate of dnsmasq 2.72 some testing and it still looks good for me - although I am not able to reproduce the issue at all.
Please copy the attached file to /usr/sbin/dnsmasq and restart the service. There are some changes that let me hope that this update may fix our troubles. Please don't forget to send me your feedback!
The changelog:
version 2.72
Add ra-advrouter mode, for RFC-3775 mobile IPv6 support.
Add support for "ipsets" in *BSD, using pf. Thanks to
Sven Falempim for the patch.
Fix race condition which could lock up dnsmasq when an
interface goes down and up rapidly. Thanks to Conrad
Kostecki for helping to chase this down.
Add DBus methods SetFilterWin2KOption and SetBogusPrivOption
Thanks to the Smoothwall project for the patch.
Fix failure to build against Nettle-3.0. Thanks to Steven
Barth for spotting this and finding the fix.
When assigning existing DHCP leases to intefaces by comparing
networks, handle the case that two or more interfaces have the
same network part, but different prefix lengths (favour the
longer prefix length.) Thanks to Lung-Pin Chang for the
patch.
Add a mode which detects and removes DNS forwarding loops, ie
a query sent to an upstream server returns as a new query to
dnsmasq, and would therefore be forwarded again, resulting in
a query which loops many times before being dropped. Upstream
servers which loop back are disabled and this event is logged.
Thanks to Smoothwall for their sponsorship of this feature.
Extend --conf-dir to allow filtering of files. So
--conf-dir=/etc/dnsmasq.d,\*.conf
will load all the files in /etc/dnsmasq.d which end in .conf
Fix bug when resulted in NXDOMAIN answers instead of NODATA in
some circumstances.
Fix bug which caused dnsmasq to become unresponsive if it
failed to send packets due to a network interface disappearing.
Thanks to Niels Peen for spotting this.
Fix problem with --local-service option on big-endian platforms
Thanks to Richard Genoud for the patch.
I can also confirm the same segfault, although it happens rarely: I had only 2 crashes. It also started for me after updating to core 81. My IPfire box is connected to a bridge port on the FTTH terminal and uses DHCP on RED. Although the IP is dynamic it rarely changes. As far as I can tell the dnsmasq crashes are not related to DHCP release renewal. # ls -lht /var/log/pakfire/ | head total 3.8M -rw-r--r-- 1 root root 5.1K 2014-09-23 21:53 update-core-upgrade-82.log -rw-r--r-- 1 root root 2.4K 2014-08-11 16:56 update-core-upgrade-81.log -rw-r--r-- 1 root root 8.1K 2014-08-05 13:58 update-core-upgrade-80.log # head /var/log/messages Aug 3 01:01:02 ipfire syslogd 1.5.0: restart. # fgrep -e segfault -e grsec -e leased /var/log/messages .... Sep 3 05:01:44 ipfire dhcpcd[1840]: red0: leased 83.148.237.234 for 28800 secon ds Sep 3 06:55:57 ipfire kernel: dnsmasq[1538]: segfault at 0 ip 0805c2a5 sp 5ce68 040 error 4 in dnsmasq[8048000+2f000] Sep 3 06:55:57 ipfire kernel: grsec: Segmentation fault occurred at (nil) in /usr/sbin/dnsmasq[dnsmasq:1538] uid/euid:99/99 gid/egid:40/40, parent /sbin/ini t[init:1] uid/euid:0/0 gid/egid:0/0 Sep 3 06:55:58 ipfire kernel: grsec: bruteforce prevention initiated due to cra sh of /usr/sbin/dnsmasq against uid 99, banning suid/sgid execs for 15 minutes. Please investigate the crash report for /usr/sbin/dnsmasq[dnsmasq:1538] uid/eui d:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0 Sep 3 09:01:44 ipfire dhcpcd[1840]: red0: leased 83.148.237.234 for 28800 secon ds .... Sep 26 23:33:25 ipfire dhcpcd[1840]: red0: leased 83.148.237.234 for 7200 seconds Sep 27 00:16:27 ipfire kernel: dnsmasq[10290]: segfault at 0 ip 0805c2a5 sp 59aa2ca0 error 4 in dnsmasq[8048000+2f000] Sep 27 00:16:27 ipfire kernel: grsec: From 192.168.2.2: Segmentation fault occurred at (nil) in /usr/sbin/dnsmasq[dnsmasq:10290] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0 Sep 27 00:16:27 ipfire kernel: grsec: From 192.168.2.2: bruteforce prevention initiated due to crash of /usr/sbin/dnsmasq against uid 99, banning suid/sgid execs for 15 minutes. Please investigate the crash report for /usr/sbin/dnsmasq[dnsmasq:10290] uid/euid:99/99 gid/egid:40/40, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0 Sep 27 00:33:25 ipfire dhcpcd[1840]: red0: leased 83.148.237.234 for 7200 seconds I'll update to the new binary now. Created attachment 243 [details] dnsmasq 2.72 with patches Please test the attached updated version of dnsmasq. It includes some patches from Simon Kelley which *might* fix the crashes with DNSSEC. http://git.ipfire.org/?p=ipfire-2.x.git;a=commitdiff; h=b56472d49b2e50d6c8f84023b80c3ee43114bfe1 Created attachment 254 [details]
dnsmasq 2.73 test version
This is an updated version taken from the git snapshot as of today.
I have put the updated program into production on three systems. So far there have been no issues with the new program. I will report back after a few days as to whether there are any. The big problem will be the issue was isolated and not frequent. Hello, I had problems with this, resulting in having to boot my IPFire every 2 days or my memory would be filled up and exessive swapping made Inet unreachable. The top command showed the main culprit for memory to be dnsmasq so my search landed here. I tested the script "/etc/init.d/dnsmasq restart" a few times and found the triggered restart to consitently solve my memory consumption issue so I went ahead and installed the "dnsmasq 2.73 test version" provided here and have no issues since then (2,5 weeks ago). Happy that this problem seems solved ith that, thanks for the documentation guys :) I will now install core update 87 and watch closely how memory behaves and replace with the dnsmasq 2.73 test version again if needed. thnx n take care Alright. This seems to be solved with the update of dnsmasq that was shipped with Core Update 89. |