Bug 12694

Summary: DHCP hosts not reliably propagated to DNS
Product: IPFire Reporter: Jon <jon.murphy>
Component: ---Assignee: Aqua Lung <ajrh>
Status: CLOSED FIXED QA Contact:
Severity: Minor Usability    
Priority: - Unknown - CC: adolf.belka, ajrh, brad-morgan, horace.michael, kairis, mbatranch, michael.tremer, peter.mueller, s.sheep
Version: 2   
Hardware: x86_64   
OS: Unspecified   
URL: https://community.ipfire.org/t/dhcp-hosts-not-reliably-propagated-to-dns/3431
Attachments: attachment-2824-0.html
Patch to add more inotify events
Log from Added MacMini to green network
workaround for python inotify bug
patch for the issue

Description Jon 2021-09-09 17:08:22 UTC
I’ve been monitoring the files below and I can see changes in /var/state/dhcp/dhcpd.leases (like when you bring a laptop into the green network). But there is no corresponding entry in /etc/unbound/dhcp-leases.conf.

[root@ipfire pmacct]# ls -al {/var/state/dhcp/dhcpd.leases,/var/ipfire/dhcp/fixleases,/etc/unbound/dhcp-leases.conf}
-rw-r--r-- 1 root   root    4833 Sep  6 13:35 /etc/unbound/dhcp-leases.conf
-rw-r--r-- 1 nobody nobody   878 Sep  3 14:18 /var/ipfire/dhcp/fixleases
-rw-r--r-- 1 root   root   12050 Sep  9 11:24 /var/state/dhcp/dhcpd.leases
[root@ipfire pmacct]# 

I think that is created by unbound-dhcp-leases-bridge. But this is beyond me.


see:
https://community.ipfire.org/t/dhcp-hosts-not-reliably-propagated-to-dns/3431
Comment 1 Brad Morgan 2021-10-19 13:11:14 UTC
I am having the exact same problem. When I executed /etc/init.d/dhcp stop I received:

Stopping DHCP Server...                                            [  OK  ]
Stopping Unbound DHCP Leases Bridge...    Not running.             [ WARN ]
Comment 2 Brad Morgan 2021-10-19 14:49:23 UTC
If I manually run:

/usr/sbin/unbound-dhcp-leases-bridge -vv

it does not exit.
Comment 3 paul 2021-10-19 14:51:07 UTC
Created attachment 947 [details]
attachment-2824-0.html

you need -d to start as a daemon.

/usr/sbin/unbound-dhcp-leases-bridge -d -vv


On Tue, Oct 19, 2021 at 7:49 AM IPFire Bugzilla <bugzilla@ipfire.org> wrote:

> *Comment # 2 <https://bugzilla.ipfire.org/show_bug.cgi?id=12694#c2> on bug
> 12694 <https://bugzilla.ipfire.org/show_bug.cgi?id=12694> from Brad Morgan
> <brad-morgan@comcast.net> *
>
> If I manually run:
>
> /usr/sbin/unbound-dhcp-leases-bridge -vv
>
> it does not exit.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>
Comment 4 Brad Morgan 2021-10-19 15:45:17 UTC
I ran it without the -d to see if any error messages were printed to the terminal and there were none.

I also ran it with the -d and the daemon did start as expected. It also did not immediately exit.

I therefore suspect that the process of starting the bridge in /etc/init.d/dhcp is failing to start the daemon. I have been unable to find a log file (or log entry) that would indicate what the problem is.
Comment 5 Jon 2021-10-22 02:28:30 UTC
If I add a new host via the hosts.cgi WebGUI page I see the /etc/unbound/dhcp-leases.conf file get updated instantly.  Which works as expected.

If I change a DHCP leased address to a DHCP fixed lease I see the /etc/unbound/dhcp-leases.conf file get updated instantly.  Which works as expected.

But if a new device is added to the network, the device receives a DHCP IP address.  But there is no update the /etc/unbound/dhcp-leases.conf file.

So it appear to be an issue with a new DHCP leased address only…
Comment 6 Jon 2021-10-22 02:50:08 UTC
Just to add more info:

DHCP hosts not reliably propagated to DNS.  Any new device added to the green or blue networks are NOT available to DNS.

To force the new device hostname to propagate the user must:
• restart the DHCP service
• add a new host via the hosts.cgi WebGUI page
• change a DHCP leased address to a DHCP fixed lease

hope this helps!
Comment 7 Jon 2021-10-23 02:49:49 UTC
Oops!  strike this like  --> "the add a new host via the hosts.cgi WebGUI page".

I was wrong about this...
Comment 9 mut ley 2022-01-11 22:02:44 UTC
I believe this may be the solution to this problem.
https://community.ipfire.org/t/unbound-dhcp-leases-bridge-not-firing-correctly/7004

The /usr/sbin/unbound-dhcp-leases-bridge script monitors a few files and fires an update on changes.  It's using a python library inotify to do this, (which is a simply wrapper for kernel inotify).  I believe the problem is it's only looking for an update event to the files, and that event isn't always fired depending on how the source executable is modifying the files.
At least for my system, triggering in events close_write and delete_self seems to have fixed the issue.

So changing line 103 in file /usr/sbin/unbound-dhcp-leases-bridge
from:-
if "IN_MODIFY" in type_names:
to:-
if "IN_MODIFY" in type_names or "IN_CLOSE_WRITE" in type_names or "IN_DELETE_SELF" in type_names:

Solved the issue for me.

In file context :-
  # Update leases after leases file has been modified
  #if "IN_MODIFY" in type_names: #(This is original line)
  if "IN_MODIFY" in type_names or "IN_CLOSE_WRITE" in type_names or "IN_DELETE_SELF" in type_names:
    # Reload hosts
    if watch_path == self.hosts_file:
       self.hosts = self.read_static_hosts()

       self.update_dhcp_leases()

  # If the file is deleted, we re-add the watcher
  if "IN_IGNORED" in type_names:
    i.add_watch(watch_path)
Comment 10 mut ley 2022-01-12 14:37:50 UTC
Created attachment 972 [details]
Patch to add more inotify events

Details in comment #9
https://bugzilla.ipfire.org/show_bug.cgi?id=12694
Comment 11 Jon 2022-01-19 23:40:37 UTC
FYI & probably irrelevant to above issue...

https://github.com/ipfire/ipfire-2.x/blob/master/config/unbound/unbound-dhcp-leases-bridge#L143-L144

`if lease.expired:
log.debug(" Expired")`

I've had double verbose `-vv` on since Sep and there have been no `Expired` entries in the message log.
Comment 12 Jon 2022-02-05 17:49:04 UTC
Michael - I made a temporary code change to help debug this issue.

I added this `log.info` line to unbound-dhcp-leases-bridge at line 93:
	`log.info("UDLBevent: " + str(event))`

https://github.com/ipfire/ipfire-2.x/blob/master/config/unbound/unbound-dhcp-leases-bridge#L92-L100


Then I add a new device to the green (or blue) network.  I can see changes to the '/var/state/dhcp/dhcpd.leases'.  But whenever the `/var/state/dhcp/dhcpd.leases` file changes there is no proper INOTIFY_EVENT.

I usually see events like this:
(_INOTIFY_EVENT(wd=1, mask=32, cookie=0, len=0), ['IN_OPEN'], '/var/state/dhcp/dhcpd.leases', '')
(_INOTIFY_EVENT(wd=1, mask=1, cookie=0, len=0), ['IN_ACCESS'], '/var/state/dhcp/dhcpd.leases', '')
(_INOTIFY_EVENT(wd=1, mask=16, cookie=0, len=0), ['IN_CLOSE_NOWRITE'], '/var/state/dhcp/dhcpd.leases', '')

... but no other INOTIFY_EVENT signifying a change or modification else.
Comment 13 Jon 2022-02-05 19:31:02 UTC
Created attachment 978 [details]
Log from Added MacMini to green network

The enclosed log shows there is no reaction by inotify within the unbound-dhcp-leases-bridge python3 program.

===

Attached is a pdf of the IPFire message log when a new device is added to the network.

This uses to `log.info` mentioned in the last comment.

The blue text within this file is from a separate log to help debug.

The MacMini has not been on my network for greater than 1 year.
Comment 14 Aqua Lung 2022-02-22 03:02:00 UTC
Created attachment 982 [details]
workaround for python inotify bug

Feedback welcome on attached horrific patch. It looks like the python inotify module may never have been tested with single file monitoring.  If so,  proper fix is in upstream python,  but first let's see: does this hack solve the problem for everybody?
Comment 15 paul 2022-02-25 19:25:17 UTC
I made that change in the bridge code, brought in a new laptop to the network and it found it. I will do more testing to make sure it works for all cases. Thanks.
Comment 16 paul 2022-02-25 20:38:07 UTC
When the laptop disconnects and the lease expires, unbound still remembers it.
Comment 17 Aqua Lung 2022-02-26 23:18:52 UTC
Thanks Paul...  OK, it seems like this is an improvement then,  but I'm pretty sure it's not right yet.   Hence a few notes on what I think is happening,  since I'm old and will forget if I don't write it down.

So this python bridge script is heroically monitoring all the dhcpd database files and transforming them for use by unbound.   It looks like it would work perfectly... if only inotify would cooperate.

Inotify is inode-based,  so we need to consider changes in the file contents but also the file itself changing.

The basic structure of the bridge script is:

    while (1) {
        if inotify-event = modified:  send the info to unbound
        if inotify-event = vanished:  recreate the inotify watch
    }

And that's very reasonable, other than the python inotify adapter deliberately ignores the request for recreation.  Sadly, the accompanying comment is word salad.

    def add_watch(self, path_unicode, mask=inotify.constants.IN_ALL_EVENTS):
        _LOGGER.debug("Adding watch: [%s]", path_unicode)

        # Because there might be race-conditions in the recursive handling (see
        # the notes in the documentation), we recommend to add watches using
        # data from a secondary channel, if possible, which means that we might
        # then be adding it, yet again, if we then receive it in the normal
        # fashion afterward.
        if path_unicode in self.__watches:
            _LOGGER.warning("Path already being watched: [%s]", path_unicode)
            return

One might hope that then that our bridge code could explicitly remove the dead watch first,  but no dice,  that causes an OS exception because the old watch really has gone,  it's just python's book keeping that is wrong.

Hence my patch above pokes into the private variables of the inotify class, cleans up the mess, and allows the watch to be successfully re-added.   The watch stays alive, and thus the dhcp leases have a chance to eventually reach unbound.

Eventually...  because there's another problem.   Deletion of the leases file is rather brutal,  and not really dhcpd's style.   The way the leases file is updated is classic old-school Unix:
    -  the new data is written out to a temporary file
    -  the existing file is linked to an extra backup path  ("dhcp.leases~")
    -  the new data is moved on top of the dhcp.leases path

The effect of this is that rather like Mission Impossible or Ocean's 11 or something,  the entire switch happens quietly without really triggering inotify at all,  and our bridge code is left blissfully unaware and now monitoring the backup file dhcp.leases~ instead.

I suspect it would sort itself out in the end,  because the leases file will eventually get written out a second time,  the old backup file will be replaced and thus deleted, and so the watch recreation will happen,  leading to permadeath before the previous patch but now retracking of the correct file.

But yuck what a nasty bug now,  in that new leases will appear in unbound but only after the delay of one other unrelated lease renewal.

So umm does someone own this bridge script and prefer to fix it themselves?   I can have a crack at it otherwise.
Comment 18 Aqua Lung 2022-03-01 18:57:43 UTC
Created attachment 995 [details]
patch for the issue

OK, attached hopefully a proper fix.  The simplest resolution is to monitor the containing directories instead of the files, which is most robust way of using inotify anyway as per above.   With this approach the python bug is no longer triggered because there is never a need to replace the inotify watch, so the previous patch is superseded.

Feedback welcome.  I've now assigned the bug to myself assuming that is appropriate protocol.
Comment 19 Adolf Belka 2022-03-02 09:26:02 UTC
Hi @aqualung,

Changing the assignee to yourself (picking this bug up), is perfectly fine to do.

Once you have a patch then the process for that is to provide it to the IPFire development mailing list formatted in the correct way.

The dev mailing list is used to review and decide on all patches and when correctly formatted it also means that the patches are automatically picked up by the IPFire Patchwork system.

Wiki links to help you.

See Step 5 in this link https://wiki.ipfire.org/devel/submit-patches

link for commit and patch format.
https://wiki.ipfire.org/devel/git/commit-messages
Comment 20 Peter Müller 2022-03-21 22:00:07 UTC
Hi @aqualung,

do you need any help for submitting your patch to the mailinglist?
Comment 21 Aqua Lung 2022-03-22 03:53:26 UTC
Thanks for the prompt Peter,  I've just submitted it to the development list now;  hope the format etc is okay.
Comment 22 Peter Müller 2022-03-24 10:05:43 UTC
Hi,

thank you for submitting this.

The format is almost fine, the patches are just missing a "Signed-off-by" tag, which you can add by conducting "git commit -vs" instead of just "git commit -v".

See also these wiki pages for further information:
- https://wiki.ipfire.org/devel/git/commit-messages
- https://wiki.ipfire.org/devel/git/tags

I did not review the content of your patches, yet, since I would like to leave this job to Michael, as he is the one being most familiar with Python, and usually has an opinion on changes to Python scripts.

He is pretty busy, so please do not get discouraged, should it take a bit more time to have your patches reviewed. Feel free to poke me every now and then to keep things moving.

To trace things, I usually add the corresponding Patchwork and Git URLs to a bug, so it is transparent which patches and commits are related to it.

In your case, it's just patchwork for the moment, hence:

https://patchwork.ipfire.org/project/ipfire/list/?series=2678 

All the best,
Peter Müller
Comment 23 Aqua Lung 2022-03-26 03:52:57 UTC
Great,  thank you Peter.  I'll wait for Michael's feedback therefore, and I can submit revised patches as he needs, will include the signed-off tag then.
Comment 27 paul 2022-04-12 14:18:30 UTC
cu 167 Testing … feedback

A few tests indicate that the fix is working (thank you Anthony).

    when a new laptop comes in, it can be pinged by name
    when the laptop leaves and dhcp lease expires, the record in DNS is removed.


Follow up question: I start the dhcp-leases-bridge with -d -vv
/usr/bin/python3 /usr/sbin/unbound-dhcp-leases-bridge -d -vv

The text “Wakeup of main loop” shows in /var/log/messages at uneven times, sometimes 30s, other times 18s, or 1m 30s. There does not seem to be a pattern. Here’s some log …

Apr 12 05:46:44 zotac dhcp[16447]: Wakeup of main loop
Apr 12 05:47:50 zotac dhcp[16447]: Wakeup of main loop
Apr 12 05:48:14 zotac dhcp[16447]: Wakeup of main loop
Apr 12 05:48:32 zotac dhcp[16447]: Wakeup of main loop
Apr 12 05:50:02 zotac dhcp[16447]: Wakeup of main loop
Apr 12 05:50:20 zotac dhcp[16447]: Wakeup of main loop
Apr 12 05:51:51 zotac dhcp[16447]: Wakeup of main loop
Apr 12 05:53:21 zotac dhcp[16447]: Wakeup of main loop
Apr 12 05:54:21 zotac dhcp[16447]: Wakeup of main loop
Apr 12 05:55:03 zotac dhcp[16447]: Wakeup of main loop
Apr 12 05:55:09 zotac dhcp[16447]: Wakeup of main loop
Apr 12 05:55:51 zotac dhcp[16447]: Wakeup of main loop
Apr 12 05:55:57 zotac dhcp[16447]: Wakeup of main loop
Apr 12 05:56:02 zotac dhcp[16447]: Wakeup of main loop
Apr 12 05:57:32 zotac dhcp[16447]: Wakeup of main loop
Apr 12 05:58:14 zotac dhcp[16447]: Wakeup of main loop
Apr 12 05:58:20 zotac dhcp[16447]: Wakeup of main loop
Apr 12 05:58:50 zotac dhcp[16447]: Wakeup of main loop
Apr 12 06:00:14 zotac dhcp[16447]: Wakeup of main loop
Apr 12 06:00:32 zotac dhcp[16447]: Wakeup of main loop
Apr 12 06:00:38 zotac dhcp[16447]: Wakeup of main loop
Apr 12 06:00:43 zotac dhcp[16447]: Wakeup of main loop
Apr 12 06:01:25 zotac dhcp[16447]: Wakeup of main loop
Apr 12 06:01:31 zotac dhcp[16447]: Wakeup of main loop
Apr 12 06:01:37 zotac dhcp[16447]: Wakeup of main loop
Apr 12 06:03:13 zotac dhcp[16447]: Wakeup of main loop
Apr 12 06:03:31 zotac dhcp[16447]: Wakeup of main loop
Apr 12 06:03:49 zotac dhcp[16447]: Wakeup of main loop
Apr 12 06:05:19 zotac dhcp[16447]: Wakeup of main loop
Apr 12 06:06:20 zotac dhcp[16447]: Wakeup of main loop
Apr 12 06:06:26 zotac dhcp[16447]: Wakeup of main loop
Apr 12 06:06:32 zotac dhcp[16447]: Wakeup of main loop
Apr 12 06:06:50 zotac dhcp[16447]: Wakeup of main loop
Apr 12 06:06:56 zotac dhcp[16447]: Wakeup of main loop
Apr 12 06:07:02 zotac dhcp[16447]: Wakeup of main loop
Comment 28 Michael Tremer 2022-04-12 14:19:16 UTC
Yes, this is supposed to happen if you run it with debugging on (-vv).
Comment 29 Peter Müller 2022-04-12 15:13:47 UTC
*** Bug 12275 has been marked as a duplicate of this bug. ***
Comment 30 Adolf Belka 2022-04-12 17:42:54 UTC
Moved to verified as @Paul has confirmed the fix
Comment 31 Jon 2022-04-12 20:42:30 UTC
(In reply to Adolf Belka from comment #30)
> Moved to verified as @Paul has confirmed the fix

I'll add a "thumbs up" also!

Michael - I think Paul was referring to the inconsistent timing between the loops.  I see the same thing.  Some loops happen after 5 seconds.  some happen after 90 seconds.
Comment 32 Jon 2022-04-12 20:46:49 UTC
To all of those that were involved in this:  THANK YOU!  

I appreciate your efforts.  You are all THE BEST!

Jon
Comment 33 Aqua Lung 2022-04-12 21:55:21 UTC
(In reply to Jon from comment #31)
> Michael - I think Paul was referring to the inconsistent timing between the
> loops.  I see the same thing.  Some loops happen after 5 seconds.  some
> happen after 90 seconds.

Ah, well the loop just wakes whenever any of the dhcp-related files change, which most commonly is dhcpd logging a lease renewal.  So yes, it's going to look a bit random, it could be any time above 5 seconds; that's just how it works.   I think I had a log line which printed the triggering file event instead of just "Waking", but regardless the -vv output probably isn't meant to make sense without reading the code.
Comment 34 Michael Tremer 2022-04-13 07:58:28 UTC
(In reply to Aqua Lung from comment #33)
> (In reply to Jon from comment #31)
> > Michael - I think Paul was referring to the inconsistent timing between the
> > loops.  I see the same thing.  Some loops happen after 5 seconds.  some
> > happen after 90 seconds.
> 
> Ah, well the loop just wakes whenever any of the dhcp-related files change,
> which most commonly is dhcpd logging a lease renewal.  So yes, it's going to
> look a bit random, it could be any time above 5 seconds; that's just how it
> works.   I think I had a log line which printed the triggering file event
> instead of just "Waking", but regardless the -vv output probably isn't meant
> to make sense without reading the code.

Interesting. This was supposed to be very consistent because the loop checks whether the process should terminate itself or not. I felt it should not take longer than 5-ish seconds for that.

Maybe we should use the timeout_s parameter in event_gen() so that we won't get stuck there for forever. We cannot rely on any of the files being changed at all.

But if it was that, I would miss some more logging output when the daemon actually does something.
Comment 35 Aqua Lung 2022-04-13 11:41:45 UTC
Oh, I follow now...  sorry I imagined we must have been talking about an outer loop.  That's very odd. The python inotify library should block in epoll for 1 second by default and after that yield 'None';  that plus the 5 second sleep makes for a wakeup every six seconds,  and indeed that's what I'm seeing on my box just now:

Apr 13 07:31:01 ipfire dhcp[11625]: Wakeup of main loop
Apr 13 07:31:07 ipfire dhcp[11625]: Wakeup of main loop
Apr 13 07:31:13 ipfire dhcp[11625]: Wakeup of main loop
Apr 13 07:31:19 ipfire dhcp[11625]: Wakeup of main loop
Apr 13 07:31:25 ipfire dhcp[11625]: Wakeup of main loop

Is the random lost time definitely in the epoll/notify area, or is it possible instead one of the event handlers could be taking that long to return to the main loop?   Hmm, I don't remember seeing anything likely so time-consuming...
Comment 36 Michael Tremer 2022-04-13 14:11:33 UTC
(In reply to Aqua Lung from comment #35)
> Oh, I follow now...  sorry I imagined we must have been talking about an
> outer loop.  That's very odd. The python inotify library should block in
> epoll for 1 second by default and after that yield 'None';  that plus the 5
> second sleep makes for a wakeup every six seconds,  and indeed that's what
> I'm seeing on my box just now:
> 
> Apr 13 07:31:01 ipfire dhcp[11625]: Wakeup of main loop
> Apr 13 07:31:07 ipfire dhcp[11625]: Wakeup of main loop
> Apr 13 07:31:13 ipfire dhcp[11625]: Wakeup of main loop
> Apr 13 07:31:19 ipfire dhcp[11625]: Wakeup of main loop
> Apr 13 07:31:25 ipfire dhcp[11625]: Wakeup of main loop

That is what I observed, too. And that is how it should be.

> Is the random lost time definitely in the epoll/notify area, or is it
> possible instead one of the event handlers could be taking that long to
> return to the main loop?   Hmm, I don't remember seeing anything likely so
> time-consuming...

Well, that depends on how long the lists are that are being parsed. We could add some extra debug logging to find out. Maybe a call to unbound-control blocks?