Summary: | DHCP hosts not reliably propagated to DNS | ||
---|---|---|---|
Product: | IPFire | Reporter: | Jon <jon.murphy> |
Component: | --- | Assignee: | Aqua Lung <ajrh> |
Status: | CLOSED FIXED | QA Contact: | |
Severity: | Minor Usability | ||
Priority: | - Unknown - | CC: | adolf.belka, ajrh, brad-morgan, horace.michael, kairis, mbatranch, michael.tremer, peter.mueller, s.sheep |
Version: | 2 | ||
Hardware: | x86_64 | ||
OS: | Unspecified | ||
URL: | https://community.ipfire.org/t/dhcp-hosts-not-reliably-propagated-to-dns/3431 | ||
Attachments: |
attachment-2824-0.html
Patch to add more inotify events Log from Added MacMini to green network workaround for python inotify bug patch for the issue |
Description
Jon
2021-09-09 17:08:22 UTC
I am having the exact same problem. When I executed /etc/init.d/dhcp stop I received: Stopping DHCP Server... [ OK ] Stopping Unbound DHCP Leases Bridge... Not running. [ WARN ] If I manually run: /usr/sbin/unbound-dhcp-leases-bridge -vv it does not exit. Created attachment 947 [details] attachment-2824-0.html you need -d to start as a daemon. /usr/sbin/unbound-dhcp-leases-bridge -d -vv On Tue, Oct 19, 2021 at 7:49 AM IPFire Bugzilla <bugzilla@ipfire.org> wrote: > *Comment # 2 <https://bugzilla.ipfire.org/show_bug.cgi?id=12694#c2> on bug > 12694 <https://bugzilla.ipfire.org/show_bug.cgi?id=12694> from Brad Morgan > <brad-morgan@comcast.net> * > > If I manually run: > > /usr/sbin/unbound-dhcp-leases-bridge -vv > > it does not exit. > > ------------------------------ > You are receiving this mail because: > > - You are on the CC list for the bug. > > I ran it without the -d to see if any error messages were printed to the terminal and there were none. I also ran it with the -d and the daemon did start as expected. It also did not immediately exit. I therefore suspect that the process of starting the bridge in /etc/init.d/dhcp is failing to start the daemon. I have been unable to find a log file (or log entry) that would indicate what the problem is. If I add a new host via the hosts.cgi WebGUI page I see the /etc/unbound/dhcp-leases.conf file get updated instantly. Which works as expected. If I change a DHCP leased address to a DHCP fixed lease I see the /etc/unbound/dhcp-leases.conf file get updated instantly. Which works as expected. But if a new device is added to the network, the device receives a DHCP IP address. But there is no update the /etc/unbound/dhcp-leases.conf file. So it appear to be an issue with a new DHCP leased address only… Just to add more info: DHCP hosts not reliably propagated to DNS. Any new device added to the green or blue networks are NOT available to DNS. To force the new device hostname to propagate the user must: • restart the DHCP service • add a new host via the hosts.cgi WebGUI page • change a DHCP leased address to a DHCP fixed lease hope this helps! Oops! strike this like --> "the add a new host via the hosts.cgi WebGUI page". I was wrong about this... Here are other posts that _may_ be related: 01 May 2020 post: https://community.ipfire.org/t/issues-with-hosts-while-upgrading-from-139-to-144/2133 07 May 2020 - not sure about this one... https://community.ipfire.org/t/hostnames-on-green-network-not-recognised/2202 15 July 2021 post: https://community.ipfire.org/t/not-all-local-hosts-resolve/5822 I believe this may be the solution to this problem. https://community.ipfire.org/t/unbound-dhcp-leases-bridge-not-firing-correctly/7004 The /usr/sbin/unbound-dhcp-leases-bridge script monitors a few files and fires an update on changes. It's using a python library inotify to do this, (which is a simply wrapper for kernel inotify). I believe the problem is it's only looking for an update event to the files, and that event isn't always fired depending on how the source executable is modifying the files. At least for my system, triggering in events close_write and delete_self seems to have fixed the issue. So changing line 103 in file /usr/sbin/unbound-dhcp-leases-bridge from:- if "IN_MODIFY" in type_names: to:- if "IN_MODIFY" in type_names or "IN_CLOSE_WRITE" in type_names or "IN_DELETE_SELF" in type_names: Solved the issue for me. In file context :- # Update leases after leases file has been modified #if "IN_MODIFY" in type_names: #(This is original line) if "IN_MODIFY" in type_names or "IN_CLOSE_WRITE" in type_names or "IN_DELETE_SELF" in type_names: # Reload hosts if watch_path == self.hosts_file: self.hosts = self.read_static_hosts() self.update_dhcp_leases() # If the file is deleted, we re-add the watcher if "IN_IGNORED" in type_names: i.add_watch(watch_path) Created attachment 972 [details] Patch to add more inotify events Details in comment #9 https://bugzilla.ipfire.org/show_bug.cgi?id=12694 FYI & probably irrelevant to above issue... https://github.com/ipfire/ipfire-2.x/blob/master/config/unbound/unbound-dhcp-leases-bridge#L143-L144 `if lease.expired: log.debug(" Expired")` I've had double verbose `-vv` on since Sep and there have been no `Expired` entries in the message log. Michael - I made a temporary code change to help debug this issue. I added this `log.info` line to unbound-dhcp-leases-bridge at line 93: `log.info("UDLBevent: " + str(event))` https://github.com/ipfire/ipfire-2.x/blob/master/config/unbound/unbound-dhcp-leases-bridge#L92-L100 Then I add a new device to the green (or blue) network. I can see changes to the '/var/state/dhcp/dhcpd.leases'. But whenever the `/var/state/dhcp/dhcpd.leases` file changes there is no proper INOTIFY_EVENT. I usually see events like this: (_INOTIFY_EVENT(wd=1, mask=32, cookie=0, len=0), ['IN_OPEN'], '/var/state/dhcp/dhcpd.leases', '') (_INOTIFY_EVENT(wd=1, mask=1, cookie=0, len=0), ['IN_ACCESS'], '/var/state/dhcp/dhcpd.leases', '') (_INOTIFY_EVENT(wd=1, mask=16, cookie=0, len=0), ['IN_CLOSE_NOWRITE'], '/var/state/dhcp/dhcpd.leases', '') ... but no other INOTIFY_EVENT signifying a change or modification else. Created attachment 978 [details]
Log from Added MacMini to green network
The enclosed log shows there is no reaction by inotify within the unbound-dhcp-leases-bridge python3 program.
===
Attached is a pdf of the IPFire message log when a new device is added to the network.
This uses to `log.info` mentioned in the last comment.
The blue text within this file is from a separate log to help debug.
The MacMini has not been on my network for greater than 1 year.
Created attachment 982 [details]
workaround for python inotify bug
Feedback welcome on attached horrific patch. It looks like the python inotify module may never have been tested with single file monitoring. If so, proper fix is in upstream python, but first let's see: does this hack solve the problem for everybody?
I made that change in the bridge code, brought in a new laptop to the network and it found it. I will do more testing to make sure it works for all cases. Thanks. When the laptop disconnects and the lease expires, unbound still remembers it. Thanks Paul... OK, it seems like this is an improvement then, but I'm pretty sure it's not right yet. Hence a few notes on what I think is happening, since I'm old and will forget if I don't write it down. So this python bridge script is heroically monitoring all the dhcpd database files and transforming them for use by unbound. It looks like it would work perfectly... if only inotify would cooperate. Inotify is inode-based, so we need to consider changes in the file contents but also the file itself changing. The basic structure of the bridge script is: while (1) { if inotify-event = modified: send the info to unbound if inotify-event = vanished: recreate the inotify watch } And that's very reasonable, other than the python inotify adapter deliberately ignores the request for recreation. Sadly, the accompanying comment is word salad. def add_watch(self, path_unicode, mask=inotify.constants.IN_ALL_EVENTS): _LOGGER.debug("Adding watch: [%s]", path_unicode) # Because there might be race-conditions in the recursive handling (see # the notes in the documentation), we recommend to add watches using # data from a secondary channel, if possible, which means that we might # then be adding it, yet again, if we then receive it in the normal # fashion afterward. if path_unicode in self.__watches: _LOGGER.warning("Path already being watched: [%s]", path_unicode) return One might hope that then that our bridge code could explicitly remove the dead watch first, but no dice, that causes an OS exception because the old watch really has gone, it's just python's book keeping that is wrong. Hence my patch above pokes into the private variables of the inotify class, cleans up the mess, and allows the watch to be successfully re-added. The watch stays alive, and thus the dhcp leases have a chance to eventually reach unbound. Eventually... because there's another problem. Deletion of the leases file is rather brutal, and not really dhcpd's style. The way the leases file is updated is classic old-school Unix: - the new data is written out to a temporary file - the existing file is linked to an extra backup path ("dhcp.leases~") - the new data is moved on top of the dhcp.leases path The effect of this is that rather like Mission Impossible or Ocean's 11 or something, the entire switch happens quietly without really triggering inotify at all, and our bridge code is left blissfully unaware and now monitoring the backup file dhcp.leases~ instead. I suspect it would sort itself out in the end, because the leases file will eventually get written out a second time, the old backup file will be replaced and thus deleted, and so the watch recreation will happen, leading to permadeath before the previous patch but now retracking of the correct file. But yuck what a nasty bug now, in that new leases will appear in unbound but only after the delay of one other unrelated lease renewal. So umm does someone own this bridge script and prefer to fix it themselves? I can have a crack at it otherwise. Created attachment 995 [details]
patch for the issue
OK, attached hopefully a proper fix. The simplest resolution is to monitor the containing directories instead of the files, which is most robust way of using inotify anyway as per above. With this approach the python bug is no longer triggered because there is never a need to replace the inotify watch, so the previous patch is superseded.
Feedback welcome. I've now assigned the bug to myself assuming that is appropriate protocol.
Hi @aqualung, Changing the assignee to yourself (picking this bug up), is perfectly fine to do. Once you have a patch then the process for that is to provide it to the IPFire development mailing list formatted in the correct way. The dev mailing list is used to review and decide on all patches and when correctly formatted it also means that the patches are automatically picked up by the IPFire Patchwork system. Wiki links to help you. See Step 5 in this link https://wiki.ipfire.org/devel/submit-patches link for commit and patch format. https://wiki.ipfire.org/devel/git/commit-messages Hi @aqualung, do you need any help for submitting your patch to the mailinglist? Thanks for the prompt Peter, I've just submitted it to the development list now; hope the format etc is okay. Hi, thank you for submitting this. The format is almost fine, the patches are just missing a "Signed-off-by" tag, which you can add by conducting "git commit -vs" instead of just "git commit -v". See also these wiki pages for further information: - https://wiki.ipfire.org/devel/git/commit-messages - https://wiki.ipfire.org/devel/git/tags I did not review the content of your patches, yet, since I would like to leave this job to Michael, as he is the one being most familiar with Python, and usually has an opinion on changes to Python scripts. He is pretty busy, so please do not get discouraged, should it take a bit more time to have your patches reviewed. Feel free to poke me every now and then to keep things moving. To trace things, I usually add the corresponding Patchwork and Git URLs to a bug, so it is transparent which patches and commits are related to it. In your case, it's just patchwork for the moment, hence: https://patchwork.ipfire.org/project/ipfire/list/?series=2678 All the best, Peter Müller Great, thank you Peter. I'll wait for Michael's feedback therefore, and I can submit revised patches as he needs, will include the signed-off tag then. https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=29a36036100f354a700aa30ad7a682b6488d87f9 https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=6d6e7949cfef37ee719e535ac21ddc7d175caf31 https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=091cb62ada59e4eae22d371c224e3b5cfaf60a0d https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=f961135513cdd6f5e64d91bc6477fcf8a6df941a https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=1918174dbb004e2e32429f5cfaae4a1240b885bc cu 167 Testing … feedback A few tests indicate that the fix is working (thank you Anthony). when a new laptop comes in, it can be pinged by name when the laptop leaves and dhcp lease expires, the record in DNS is removed. Follow up question: I start the dhcp-leases-bridge with -d -vv /usr/bin/python3 /usr/sbin/unbound-dhcp-leases-bridge -d -vv The text “Wakeup of main loop” shows in /var/log/messages at uneven times, sometimes 30s, other times 18s, or 1m 30s. There does not seem to be a pattern. Here’s some log … Apr 12 05:46:44 zotac dhcp[16447]: Wakeup of main loop Apr 12 05:47:50 zotac dhcp[16447]: Wakeup of main loop Apr 12 05:48:14 zotac dhcp[16447]: Wakeup of main loop Apr 12 05:48:32 zotac dhcp[16447]: Wakeup of main loop Apr 12 05:50:02 zotac dhcp[16447]: Wakeup of main loop Apr 12 05:50:20 zotac dhcp[16447]: Wakeup of main loop Apr 12 05:51:51 zotac dhcp[16447]: Wakeup of main loop Apr 12 05:53:21 zotac dhcp[16447]: Wakeup of main loop Apr 12 05:54:21 zotac dhcp[16447]: Wakeup of main loop Apr 12 05:55:03 zotac dhcp[16447]: Wakeup of main loop Apr 12 05:55:09 zotac dhcp[16447]: Wakeup of main loop Apr 12 05:55:51 zotac dhcp[16447]: Wakeup of main loop Apr 12 05:55:57 zotac dhcp[16447]: Wakeup of main loop Apr 12 05:56:02 zotac dhcp[16447]: Wakeup of main loop Apr 12 05:57:32 zotac dhcp[16447]: Wakeup of main loop Apr 12 05:58:14 zotac dhcp[16447]: Wakeup of main loop Apr 12 05:58:20 zotac dhcp[16447]: Wakeup of main loop Apr 12 05:58:50 zotac dhcp[16447]: Wakeup of main loop Apr 12 06:00:14 zotac dhcp[16447]: Wakeup of main loop Apr 12 06:00:32 zotac dhcp[16447]: Wakeup of main loop Apr 12 06:00:38 zotac dhcp[16447]: Wakeup of main loop Apr 12 06:00:43 zotac dhcp[16447]: Wakeup of main loop Apr 12 06:01:25 zotac dhcp[16447]: Wakeup of main loop Apr 12 06:01:31 zotac dhcp[16447]: Wakeup of main loop Apr 12 06:01:37 zotac dhcp[16447]: Wakeup of main loop Apr 12 06:03:13 zotac dhcp[16447]: Wakeup of main loop Apr 12 06:03:31 zotac dhcp[16447]: Wakeup of main loop Apr 12 06:03:49 zotac dhcp[16447]: Wakeup of main loop Apr 12 06:05:19 zotac dhcp[16447]: Wakeup of main loop Apr 12 06:06:20 zotac dhcp[16447]: Wakeup of main loop Apr 12 06:06:26 zotac dhcp[16447]: Wakeup of main loop Apr 12 06:06:32 zotac dhcp[16447]: Wakeup of main loop Apr 12 06:06:50 zotac dhcp[16447]: Wakeup of main loop Apr 12 06:06:56 zotac dhcp[16447]: Wakeup of main loop Apr 12 06:07:02 zotac dhcp[16447]: Wakeup of main loop Yes, this is supposed to happen if you run it with debugging on (-vv). *** Bug 12275 has been marked as a duplicate of this bug. *** Moved to verified as @Paul has confirmed the fix (In reply to Adolf Belka from comment #30) > Moved to verified as @Paul has confirmed the fix I'll add a "thumbs up" also! Michael - I think Paul was referring to the inconsistent timing between the loops. I see the same thing. Some loops happen after 5 seconds. some happen after 90 seconds. To all of those that were involved in this: THANK YOU! I appreciate your efforts. You are all THE BEST! Jon (In reply to Jon from comment #31) > Michael - I think Paul was referring to the inconsistent timing between the > loops. I see the same thing. Some loops happen after 5 seconds. some > happen after 90 seconds. Ah, well the loop just wakes whenever any of the dhcp-related files change, which most commonly is dhcpd logging a lease renewal. So yes, it's going to look a bit random, it could be any time above 5 seconds; that's just how it works. I think I had a log line which printed the triggering file event instead of just "Waking", but regardless the -vv output probably isn't meant to make sense without reading the code. (In reply to Aqua Lung from comment #33) > (In reply to Jon from comment #31) > > Michael - I think Paul was referring to the inconsistent timing between the > > loops. I see the same thing. Some loops happen after 5 seconds. some > > happen after 90 seconds. > > Ah, well the loop just wakes whenever any of the dhcp-related files change, > which most commonly is dhcpd logging a lease renewal. So yes, it's going to > look a bit random, it could be any time above 5 seconds; that's just how it > works. I think I had a log line which printed the triggering file event > instead of just "Waking", but regardless the -vv output probably isn't meant > to make sense without reading the code. Interesting. This was supposed to be very consistent because the loop checks whether the process should terminate itself or not. I felt it should not take longer than 5-ish seconds for that. Maybe we should use the timeout_s parameter in event_gen() so that we won't get stuck there for forever. We cannot rely on any of the files being changed at all. But if it was that, I would miss some more logging output when the daemon actually does something. Oh, I follow now... sorry I imagined we must have been talking about an outer loop. That's very odd. The python inotify library should block in epoll for 1 second by default and after that yield 'None'; that plus the 5 second sleep makes for a wakeup every six seconds, and indeed that's what I'm seeing on my box just now: Apr 13 07:31:01 ipfire dhcp[11625]: Wakeup of main loop Apr 13 07:31:07 ipfire dhcp[11625]: Wakeup of main loop Apr 13 07:31:13 ipfire dhcp[11625]: Wakeup of main loop Apr 13 07:31:19 ipfire dhcp[11625]: Wakeup of main loop Apr 13 07:31:25 ipfire dhcp[11625]: Wakeup of main loop Is the random lost time definitely in the epoll/notify area, or is it possible instead one of the event handlers could be taking that long to return to the main loop? Hmm, I don't remember seeing anything likely so time-consuming... (In reply to Aqua Lung from comment #35) > Oh, I follow now... sorry I imagined we must have been talking about an > outer loop. That's very odd. The python inotify library should block in > epoll for 1 second by default and after that yield 'None'; that plus the 5 > second sleep makes for a wakeup every six seconds, and indeed that's what > I'm seeing on my box just now: > > Apr 13 07:31:01 ipfire dhcp[11625]: Wakeup of main loop > Apr 13 07:31:07 ipfire dhcp[11625]: Wakeup of main loop > Apr 13 07:31:13 ipfire dhcp[11625]: Wakeup of main loop > Apr 13 07:31:19 ipfire dhcp[11625]: Wakeup of main loop > Apr 13 07:31:25 ipfire dhcp[11625]: Wakeup of main loop That is what I observed, too. And that is how it should be. > Is the random lost time definitely in the epoll/notify area, or is it > possible instead one of the event handlers could be taking that long to > return to the main loop? Hmm, I don't remember seeing anything likely so > time-consuming... Well, that depends on how long the lists are that are being parsed. We could add some extra debug logging to find out. Maybe a call to unbound-control blocks? |