13254 – CU 177 - LOTS of unbound restarts (1000s per week)

Bug 13254 - CU 177 - LOTS of unbound restarts (1000s per week)

Summary: CU 177 - LOTS of unbound restarts (1000s per week)

Status:	MODIFIED

Alias:	None

Product:	IPFire
Classification:	Unclassified
Component:	--- (show other bugs)
Version:	2
Hardware:	x86_64 Unspecified

Importance:	- Unknown - Minor Usability
Assignee:	Michael Tremer
QA Contact:

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-08-26 20:16 UTC by Jon
Modified:	2024-10-16 08:57 UTC (History)
CC List:	6 users (show)

See Also:

Attachments
pcap file (2.28 KB, application/vnd.tcpdump.pcap) 2023-11-21 23:31 UTC, Jon	Details
pcap file for collegebrowser.com (1.76 KB, application/vnd.tcpdump.pcap) 2023-11-22 16:54 UTC, Jon	Details
comparison script for testing (2.09 KB, text/plain) 2023-12-05 17:09 UTC, Jon	Details
readme (3.88 KB, text/plain) 2024-03-29 02:28 UTC, Jon	Details
dhcpScript (6.88 KB, text/plain) 2024-03-29 02:29 UTC, Jon	Details
dhcpd.conf.local config file (1.28 KB, text/plain) 2024-03-29 02:32 UTC, Jon	Details
timing (451.10 KB, image/png) 2024-04-04 18:18 UTC, Jon	Details
dhcpEvent updated with speed ups and updating of dhcp-leases.conf removed (6.33 KB, application/x-shellscript) 2024-04-05 11:13 UTC, Nick Howitt	Details
dhcpEvent - v30 (7.62 KB, text/plain) 2024-04-05 15:07 UTC, Jon	Details
Script to generate /etc/unbound/dhcp-leases.conf (12.84 KB, text/x-python3) 2024-04-06 13:48 UTC, Nick Howitt	Details
Zip file for dhcpEvent (3.40 KB, application/zip) 2024-04-15 22:21 UTC, Jon	Details
dhcp settings (110.01 KB, image/png) 2024-05-12 02:51 UTC, Jon	Details
overly simple flow (254.97 KB, image/png) 2024-05-15 15:27 UTC, Jon	Details
compare (518.97 KB, image/png) 2024-05-15 15:57 UTC, Jon	Details
better flow chart (577.12 KB, image/png) 2024-05-15 19:31 UTC, Jon	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jon 2023-08-26 20:16:22 UTC

I went digging through the messages log (old a new) and normally I see a few "unbound: [31414:0] info: start of service" messages every week.  Less than 10 per week.  

Aug  2 13:09:29 ipfire unbound: [31414:0] info: start of service (unbound 1.17.1).

But after upgrading from cu 177 to CU 178 I see thousands per week.

```
-rw-rw-r-- 1 root syslogd 2508718 Aug  6 00:01 /var/log/messages.3.gz

Aug  5 20:35:02 ipfire pakfire: CORE UPGR: Upgrading from release 176 to 177

Aug  2 13:09:29 ipfire unbound: [31414:0] info: start of service (unbound 1.17.1).
Aug  2 13:14:05 ipfire unbound: [32097:0] info: start of service (unbound 1.17.1).
Aug  2 13:46:10 ipfire unbound: [2004:0] info: start of service (unbound 1.17.1).
Aug  2 13:51:27 ipfire unbound: [2880:0] info: start of service (unbound 1.17.1).
Aug  2 17:10:31 ipfire unbound: [1923:0] info: start of service (unbound 1.17.1).
Aug  2 17:10:58 ipfire unbound: [1923:0] info: start of service (unbound 1.17.1).
Aug  5 20:42:07 ipfire unbound: [1923:0] info: start of service (unbound 1.17.1).
Aug  5 20:52:42 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1).
Aug  5 20:53:06 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1).
Aug  5 20:53:42 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1).
Aug  5 20:53:59 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1).
Aug  5 20:54:17 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1).
Aug  5 21:02:28 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1).
```

Here is the count of "start of service" messages.  I just from a few to over 3000 per week:
 
```
-rw-rw-r-- 1 root syslogd 2211885 Jul 23 00:01 /var/log/messages.5.gz
0

-rw-rw-r-- 1 root syslogd 2318649 Jul 30 00:01 /var/log/messages.4.gz
4

-rw-rw-r-- 1 root syslogd 2508718 Aug  6 00:01 /var/log/messages.3.gz
Aug  5 20:35:02 ipfire pakfire: CORE UPGR: Upgrading from release 176 to 177
73

-rw-rw-r-- 1 root syslogd 4108718 Aug 13 00:01 /var/log/messages.2.gz
3559

-rw-rw-r-- 1 root syslogd 3927388 Aug 20 00:01 /var/log/messages.1.gz
Aug 18 18:52:51 ipfire pakfire: CORE UPGR: Upgrading from release 177 to 178
3773

-rw-rw-r-- 1 root syslogd 35421724 Aug 26 15:01 /var/log/messages
3512
[root@ipfire ~] # 
```

Tested on:
APU4D4
IPFire 2.27 (x86_64) - Core-Update 178 (stable)

Comment 1 Jon 2023-08-26 20:19:56 UTC

my bad - this should state:

But after upgrading from cu 176 to CU 177 I see thousands per week.

Comment 2 Jon 2023-08-26 20:23:59 UTC

maybe related to:

https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=f20ca78eff6e8baeb86361f55adf52819d1bae1f

Comment 3 Adolf Belka 2023-08-28 16:06:23 UTC

I have just checked my logs and I see some but not very many most of the weeks and there are some weeks with zero.

/var/log/messages.9.gz     15
/var/log/messages.8.gz     0
/var/log/messages.7.gz     2
/var/log/messages.6.gz     0
/var/log/messages.5.gz     12
/var/log/messages.4.gz     9
/var/log/messages.3.gz     0
/var/log/messages.2.gz     31
/var/log/messages.1.gz     0
/var/log/messages          0

Comment 4 Jon 2023-08-28 18:00:24 UTC

Ugh!  that is not good (for me!).  I see this on my production and high number on my test APU also.


/var/log/messages.9.gz
2

/var/log/messages.8.gz
0

/var/log/messages.7.gz
Jul 13 20:09:56 ipfire pakfire: CORE UPGR: Upgrading from release 175 to 176
2

/var/log/messages.6.gz
0

/var/log/messages.5.gz
4

/var/log/messages.4.gz
Aug  5 20:35:02 ipfire pakfire: CORE UPGR: Upgrading from release 176 to 177
73

/var/log/messages.3.gz
3559

/var/log/messages.2.gz
Aug 18 18:52:51 ipfire pakfire: CORE UPGR: Upgrading from release 177 to 178
3773

/var/log/messages.1.gz
3718

/var/log/messages
850

Comment 5 Jon 2023-08-28 18:16:26 UTC

here is my test box

APU4D4
IPFire 2.27 (x86_64) - Core-Update 179 Development Build: master/b5d85855

(just formatted differently)


/var/log/messages.9.gz 	0

/var/log/messages.8.gz 	2

/var/log/messages.7.gz 	2
Jul 11 15:42:55 ipfireAPU pakfire: CORE UPGR: Upgrading from release 175 to 176

/var/log/messages.6.gz 	2

/var/log/messages.5.gz 	0

/var/log/messages.4.gz 	484
Jul 30 13:09:21 ipfireAPU pakfire: CORE UPGR: Upgrading from release 175 to 177

/var/log/messages.3.gz 	417

/var/log/messages.2.gz 	493
Aug 14 21:44:42 ipfireAPU pakfire: CORE UPGR: Upgrading from release 176 to 178

/var/log/messages.1.gz 	856
Aug 21 11:43:32 ipfireAPU pakfire: CORE UPGR: Upgrading from release 177 to 179

/var/log/messages 	117

Comment 6 Jon 2023-08-28 19:49:01 UTC

Adolf & all,

I found a ipfire box that has CU 174 on it and not too many unbound stop/start messages.

[root@ipfireAPU2 ~] # find /var/log -iname "messages*" | sort -Vr | tail -10 | xargs -I % bash -c 'echo ; printf "% \t" % ; grep --color -c "start of service" <(zcat -f %) ; grep --color "CORE UPGR:" <(zcat -f %)'

/var/log/messages.8.gz 	2
/var/log/messages.7.gz 	4
/var/log/messages.6.gz 	4
/var/log/messages.5.gz 	10
/var/log/messages.4.gz 	14
/var/log/messages.3.gz 	2
/var/log/messages.2.gz 	2
/var/log/messages.1.gz 	0
/var/log/messages 	2
[root@ipfireAPU2 ~] # 

I want to walk the Updates from CU 174 to CU 176 and let it sit for a day.  Then walk the Update from CU 176 to CU 177.  And let it sit again to check for unbound stop/start messages.

Do you know a way to do this?

Comment 7 Jon 2023-08-28 23:22:03 UTC

Same issue with a 3rd ipfire device.  Today I Updated from CU 174 to CU 178. The update happened only ~2 hours so not too many issues.

Interesting item - the stopping / start seem to happen right after a DHCPREQUEST and DHCPACK.

=====

```
[root@ipfireAPU2 ~] # find /var/log -iname "messages*" | sort -Vr | tail -1 | xargs -I % bash -c 'echo ; printf "% \n" % ; grep --color -e dhcp -e "CORE UPGR:" -e "start of service" -e reboot -e restart -e "service stopped" <(zcat -f %)'

/var/log/messages 
. . .
Aug 28 16:03:20 ipfireAPU2 pakfire: CORE UPGR: Upgrading from release 174 to 178

. . .

Aug 28 16:23:33 ipfireAPU2 shutdown[23875]: shutting down for system reboot

. . .

Aug 28 16:26:04 ipfireAPU2 syslogd 1.5.1: restart (remote reception).
Aug 28 16:26:04 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1).
Aug 28 16:27:15 ipfireAPU2 unbound: [3138:0] info: service stopped (unbound 1.17.1).
Aug 28 16:27:15 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1).
Aug 28 16:27:43 ipfireAPU2 dhcpd: Wrote 5 leases to leases file.
Aug 28 16:27:43 ipfireAPU2 dhcpd: Server starting service.
Aug 28 16:27:44 ipfireAPU2 unbound: [3138:0] info: service stopped (unbound 1.17.1).
Aug 28 16:27:45 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1).

Aug 28 16:27:51 ipfireAPU2 dhcpd: DHCPDISCOVER from d8:eb:97:b8:03:08 (iMac3) via green0
Aug 28 16:27:51 ipfireAPU2 dhcpd: DHCPOFFER on 192.168.7.100 to d8:eb:97:b8:03:08 (iMac3) via green0
Aug 28 16:27:52 ipfireAPU2 dhcpd: DHCPREQUEST for 192.168.7.100 (192.168.7.1) from d8:eb:97:b8:03:08 (iMac3) via green0
Aug 28 16:27:52 ipfireAPU2 dhcpd: DHCPACK on 192.168.7.100 to d8:eb:97:b8:03:08 (iMac3) via green0
Aug 28 16:27:55 ipfireAPU2 unbound: [3138:0] info: service stopped (unbound 1.17.1).
Aug 28 16:27:56 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1).

Aug 28 16:57:52 ipfireAPU2 dhcpd: DHCPREQUEST for 192.168.7.100 from d8:eb:97:b8:03:08 (iMac3) via green0
Aug 28 16:57:52 ipfireAPU2 dhcpd: DHCPACK on 192.168.7.100 to d8:eb:97:b8:03:08 (iMac3) via green0
Aug 28 16:57:56 ipfireAPU2 unbound: [3138:0] info: service stopped (unbound 1.17.1).
Aug 28 16:57:57 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1).

Aug 28 17:57:52 ipfireAPU2 dhcpd: DHCPREQUEST for 192.168.7.100 from d8:eb:97:b8:03:08 (iMac3) via green0
Aug 28 17:57:52 ipfireAPU2 dhcpd: Wrote 5 leases to leases file.
Aug 28 17:57:52 ipfireAPU2 dhcpd: DHCPACK on 192.168.7.100 to d8:eb:97:b8:03:08 (iMac3) via green0
Aug 28 17:57:53 ipfireAPU2 unbound: [3138:0] info: service stopped (unbound 1.17.1).
Aug 28 17:57:53 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1).
[root@ipfireAPU2 ~] # 
```

Comment 8 Adolf Belka 2023-09-04 09:59:23 UTC

I just ran your command on my min appliance system and got the following output.


/var/log/messages.19.gz 	2
/var/log/messages.18.gz 	0
/var/log/messages.17.gz 	0
/var/log/messages.16.gz 	0
/var/log/messages.15.gz 	7
/var/log/messages.14.gz 	2
/var/log/messages.13.gz 	0
/var/log/messages.12.gz 	5
Jun 13 11:51:09 iapetus pakfire: CORE UPGR: Upgrading from release 174 to 175
/var/log/messages.11.gz 	9
/var/log/messages.10.gz 	15
/var/log/messages.9.gz 	        0
/var/log/messages.8.gz 	        2
Jul 11 17:54:08 iapetus pakfire: CORE UPGR: Upgrading from release 175 to 176
/var/log/messages.7.gz 	        0
/var/log/messages.6.gz 	        12
/var/log/messages.5.gz 	        9
Aug  5 22:37:50 iapetus pakfire: CORE UPGR: Upgrading from release 176 to 177
/var/log/messages.4.gz 	        0
/var/log/messages.3.gz 	        31
Aug 18 12:30:42 iapetus pakfire: CORE UPGR: Upgrading from release 177 to 178
/var/log/messages.2.gz 	        0
/var/log/messages.1.gz 	        1
/var/log/messages 	        0


So I am not seeing what you are seeing.
A possible reason for that is your comment

"Interesting item - the stopping / start seem to happen right after a DHCPREQUEST and DHCPACK."

Are your clients running with dynamic IP assignments? All my clients have fixed leases on my mini system.

Comment 9 Jon 2023-09-04 15:30:53 UTC

(In reply to Adolf Belka from comment #8)
> So I am not seeing what you are seeing.
> A possible reason for that is your comment
> 
> "Interesting item - the stopping / start seem to happen right after a
> DHCPREQUEST and DHCPACK."
> 
> Are your clients running with dynamic IP assignments? All my clients have
> fixed leases on my mini system.

Yes.  Most everything in the BLUE is dynamic.  And only a few device in the GREEN are dynamic.

Comment 10 Jon 2023-09-20 19:24:23 UTC

Alexander @bmcs-alex has lots of unbound restarts also...

See:
https://community.ipfire.org/t/squid-stops-working-every-hour-at-x48/10364/5?u=jon

Comment 11 Jon 2023-10-05 16:16:12 UTC

Also see:

https://community.ipfire.org/t/unbound-starts-and-stops-a-lot-for-unknown-reason/10425/1

Comment 12 Michael Tremer 2023-10-30 10:19:47 UTC

Hello everyone,

we recently added some changes which make unbound reload its configuration when we need to update the internal local DNS configuration (i.e. a new DHCP lease was issued or has expired).

Technically unbound does not restart, as we only ask it to reload:

> https://git.ipfire.org/?p=ipfire-2.x.git;a=blob;f=config/unbound/unbound-dhcp-leases-bridge;hb=a85e9d4eb1c5a9d5b9bac3d90f59a321e4151871#l524

Internally, it probably makes some sense for unbound to halt the resolver, reload the configuration and then start a new resolver with the new configuration. The log message would then make sense.

However, are you experiencing any problems with DNS resolution or is this just an aesthetic issue with lots of those log messages being shown?

Comment 13 Jon 2023-11-14 00:16:50 UTC

I've been experimenting with DNS and I experience a delay when going to various websites.

Nov 13 17:31:05 ipfire unbound: [31555:0] notice: Restart of unbound 1.18.0.
Nov 13 17:31:05 ipfire unbound: [31555:0] notice: init module 0: respip
Nov 13 17:31:05 ipfire unbound: [31555:0] notice: init module 1: validator
Nov 13 17:31:05 ipfire unbound: [31555:0] notice: init module 2: iterator
Nov 13 17:31:17 ipfire unbound: [31555:0] info: start of service (unbound 1.18.0).

It seems like the delay, in obtaining a response from DNS, coincides with a ~12 second delay between "notice: init module 2: iterator" and "info: start of service (unbound 1.18.0)."

I happened to be monitoring the messages log with:
```
tail -f /var/log/messages | grep -e unbound -e dhcp
``` 
And I was searching for something in a web browser.  the delay on the webboorwser coincided with the unbound restart.

I had been seeing this browser delay for a few weeks but it did not make sense until today...

I'll do some more testing.

Comment 14 Michael Tremer 2023-11-20 15:47:13 UTC

(In reply to Jon from comment #13)
> I've been experimenting with DNS and I experience a delay when going to
> various websites.

Hmm, that would be a problem. I tested these changed on a network that has between 1200 and 1600 clients daily, but it is also running on an Enterprise appliance, so maybe the reload is slightly faster there? Can we rule out it isn't anything else?

Could you maybe run tcpdump and check if there are any DNS queries sent to IPFire which are not being answered until unbound is done reloading?

Comment 15 Jon 2023-11-20 18:03:58 UTC

Yes, I can do that (at least I think I can!) If you have a suggestion for the tcpdump command that would help me since I haven't used tcpdump since the DNS redirect days.


Via a browser I can see delays and I can see an occasional "drop".

The delays happen at every unbound restart.  The drops happen once in a while, maybe once every week or so.

Comment 16 Jon 2023-11-21 23:30:36 UTC

I wrote a script to pick 500 different (random) websites out of 1 million.  I did this because I wanted to avoid the dns caches.

I had to run the `unbound-control reload_keep_cache` to force the reload.  Otherwise this test is like watching paint dry...


I ran this tcpdump command:
```
tcpdump -i green0 host 192.168.60.212 and port 53 -w green_h60.212_p53.pcap
```


Each website is checked this command:

```
time dig +tries=3 +time=5 +stats @192.168.60.1 furchguitars.com
```

And I got this response every once in awhile:

furchguitars.com                                 FAIL: 9      15014

allResults =  
; <<>> DiG 9.10.6 <<>> dig +tries=3 +time=5 +stats @192.168.60.1 furchguitars.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

real	0m15.014s
user	0m0.002s
sys	0m0.003s



and I dug through the tcpdump list and found that `furchguitars.com` started at the time 16:25:46:
```
16:25:46.625255	192.168.60.212	192.168.60.1	DNS	87	Standard query 0x67a1 A furchguitars.com OPT
16:25:51.626709	192.168.60.212	192.168.60.1	DNS	87	Standard query 0x67a1 A furchguitars.com OPT
16:25:56.627352	192.168.60.212	192.168.60.1	DNS	87	Standard query 0x67a1 A furchguitars.com OPT
. . .
16:26:02.287736	192.168.60.1	192.168.60.212	DNS	103	Standard query response 0x67a1 A furchguitars.com A 89.221.213.64 OPT
16:26:02.287818	192.168.60.1	192.168.60.212	DNS	103	Standard query response 0x67a1 A furchguitars.com A 89.221.213.64 OPT
```


The first three commands are the three dig tries (+tries=3).  Each is 5 sec apart (+time=5).

So from the first dig send to the first response is 16 seconds.


I did a ping to make sure it was a real server and that worked:
```
iMac3:dns_speed_test jcmurphy$ ping furchguitars.com
PING furchguitars.com (89.221.213.64): 56 data bytes
64 bytes from 89.221.213.64: icmp_seq=0 ttl=45 time=131.229 ms
64 bytes from 89.221.213.64: icmp_seq=1 ttl=45 time=126.059 ms
64 bytes from 89.221.213.64: icmp_seq=2 ttl=45 time=134.259 ms
64 bytes from 89.221.213.64: icmp_seq=3 ttl=45 time=125.657 ms
^C
--- furchguitars.com ping statistics ---
4 packets transmitted, 4 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 125.657/129.301/134.259/3.609 ms
```


and ran dig to make sure it really worked:
```
iMac3:dns_speed_test jcmurphy$ dig furchguitars.com

; <<>> DiG 9.10.6 <<>> furchguitars.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37336
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;furchguitars.com.		IN	A

;; ANSWER SECTION:
furchguitars.com.	581	IN	A	89.221.213.64

;; Query time: 37 msec
;; SERVER: 192.168.60.2#53(192.168.60.2)
;; WHEN: Tue Nov 21 16:46:21 CST 2023
;; MSG SIZE  rcvd: 61

```

Comment 17 Jon 2023-11-21 23:31:50 UTC

Created attachment 1394 [details]
pcap file

Here are the tcpdump lines related to furchguitars

Comment 18 Jon 2023-11-21 23:38:04 UTC

I forgot the unbound log.  

See the restart (service stopped  / start of service) that occurred during the same time as the bad dig command.

```
Nov 21 16:25:50 ipfire unbound: [29467:0] info: service stopped (unbound 1.18.0).
Nov 21 16:25:50 ipfire unbound: [29467:0] info: server stats for thread 0: 12 queries, 6 answers from cache, 6 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Nov 21 16:25:50 ipfire unbound: [29467:0] info: server stats for thread 0: requestlist max 4 avg 1.66667 exceeded 0 jostled 0
Nov 21 16:25:50 ipfire unbound: [29467:0] info: average recursion processing time 0.102877 sec
Nov 21 16:25:50 ipfire unbound: [29467:0] info: histogram of recursion processing times
Nov 21 16:25:50 ipfire unbound: [29467:0] info: [25%]=0.0709973 median[50%]=0.098304 [75%]=0.125611
Nov 21 16:25:50 ipfire unbound: [29467:0] info: lower(secs) upper(secs) recursions
Nov 21 16:25:50 ipfire unbound: [29467:0] info:    0.032768    0.065536 1
Nov 21 16:25:50 ipfire unbound: [29467:0] info:    0.065536    0.131072 3
Nov 21 16:25:50 ipfire unbound: [29467:0] info:    0.131072    0.262144 1
Nov 21 16:25:50 ipfire unbound: [29467:0] notice: Restart of unbound 1.18.0.
Nov 21 16:25:50 ipfire unbound: [29467:0] notice: init module 0: respip
Nov 21 16:25:50 ipfire unbound: [29467:0] notice: init module 1: validator
Nov 21 16:25:50 ipfire unbound: [29467:0] notice: init module 2: iterator
Nov 21 16:26:01 ipfire unbound: [29467:0] info: start of service (unbound 1.18.0).
Nov 21 16:26:01 ipfire unbound: [29467:0] info: generate keytag query _ta-4a5c-4f66. NULL IN
```

Comment 19 Jon 2023-11-22 03:30:41 UTC

dig has +tries and +time (e.g., dig +tries=3 +time=5)

I picked these because of the original "12 second delay" I saw early on.  I was thinking this "might" be similar to unbound.  But I thinking now it may be masking the unbound reset question.

Is there an equivalent setting with unbound?

Comment 20 Jon 2023-11-22 16:53:09 UTC

And another test...

Each website is checked with this (slightly different than last time) command:
```
time dig +stats @192.168.60.1 collegebrowser.com


I got this response:
```
collegebrowser.com                                  FAIL: 9      15010
allResults =  
; <<>> DiG 9.10.6 <<>> +stats @192.168.60.1 collegebrowser.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

real	0m15.010s
user	0m0.002s
sys	0m0.003s


I can ping and dig to make sure the site is OK:
```
[root@ipfire ~] # ping collegebrowser.com
PING collegebrowser.com (72.167.58.94) 56(84) bytes of data.
64 bytes from 94.58.167.72.host.secureserver.net (72.167.58.94): icmp_seq=1 ttl=51 time=67.3 ms
64 bytes from 94.58.167.72.host.secureserver.net (72.167.58.94): icmp_seq=2 ttl=51 time=64.0 ms
64 bytes from 94.58.167.72.host.secureserver.net (72.167.58.94): icmp_seq=3 ttl=51 time=65.0 ms
64 bytes from 94.58.167.72.host.secureserver.net (72.167.58.94): icmp_seq=4 ttl=51 time=63.5 ms
^C
--- collegebrowser.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3002ms
rtt min/avg/max/mdev = 63.546/64.944/67.298/1.454 ms
[root@ipfire ~] # 

[root@ipfire ~] # dig collegebrowser.com

; <<>> DiG 9.16.44 <<>> collegebrowser.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51445
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;collegebrowser.com.		IN	A

;; ANSWER SECTION:
collegebrowser.com.	590	IN	A	72.167.58.94

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Nov 22 10:22:25 CST 2023
;; MSG SIZE  rcvd: 63


Digging through the tcpdump list and found that `collegebrowser.com ` appeared at 10:01:09:
```
312	2023-11-22 10:01:09.294139	192.168.60.212	192.168.60.1	DNS	89	Standard query 0x3ec2 A collegebrowser.com OPT
313	2023-11-22 10:01:14.294642	192.168.60.212	192.168.60.1	DNS	89	Standard query 0x3ec2 A collegebrowser.com OPT
314	2023-11-22 10:01:19.296406	192.168.60.212	192.168.60.1	DNS	89	Standard query 0x3ec2 A collegebrowser.com OPT
. . .
323	2023-11-22 10:01:26.983005	192.168.60.1	192.168.60.212	DNS	105	Standard query response 0x3ec2 A collegebrowser.com A 72.167.58.94 OPT
324	2023-11-22 10:01:26.983077	192.168.60.1	192.168.60.212	DNS	105	Standard query response 0x3ec2 A collegebrowser.com A 72.167.58.94 OPT 


The unbound service stopped  / start of service occurs at the same time as the dig command:

```
Nov 22 10:01:11 ipfire unbound: [29467:0] info: service stopped (unbound 1.18.0).
Nov 22 10:01:11 ipfire unbound: [29467:0] info: server stats for thread 0: 25 queries, 4 answers from cache, 21 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Nov 22 10:01:11 ipfire unbound: [29467:0] info: server stats for thread 0: requestlist max 2 avg 0.857143 exceeded 0 jostled 0
Nov 22 10:01:11 ipfire unbound: [29467:0] info: average recursion processing time 0.519950 sec
Nov 22 10:01:11 ipfire unbound: [29467:0] info: histogram of recursion processing times
Nov 22 10:01:11 ipfire unbound: [29467:0] info: [25%]=0.049152 median[50%]=0.098304 [75%]=0.32768
Nov 22 10:01:11 ipfire unbound: [29467:0] info: lower(secs) upper(secs) recursions
Nov 22 10:01:11 ipfire unbound: [29467:0] info:    0.016384    0.032768 1
Nov 22 10:01:11 ipfire unbound: [29467:0] info:    0.032768    0.065536 8
Nov 22 10:01:11 ipfire unbound: [29467:0] info:    0.065536    0.131072 2
Nov 22 10:01:11 ipfire unbound: [29467:0] info:    0.131072    0.262144 3
Nov 22 10:01:11 ipfire unbound: [29467:0] info:    0.262144    0.524288 4
Nov 22 10:01:11 ipfire unbound: [29467:0] info:    2.000000    4.000000 2
Nov 22 10:01:11 ipfire unbound: [29467:0] notice: Restart of unbound 1.18.0.
Nov 22 10:01:11 ipfire unbound: [29467:0] notice: init module 0: respip
Nov 22 10:01:11 ipfire unbound: [29467:0] notice: init module 1: validator
Nov 22 10:01:11 ipfire unbound: [29467:0] notice: init module 2: iterator
Nov 22 10:01:26 ipfire unbound: [29467:0] info: start of service (unbound 1.18.0).
```

Comment 21 Jon 2023-11-22 16:54:48 UTC

Created attachment 1395 [details]
pcap file for collegebrowser.com

Here is the pcap file

Comment 22 Michael Tremer 2023-11-23 15:57:33 UTC

Hmm this is interesting. Because you are getting the responses from unbound - just quite late. About 15 seconds, which is definitely noticeable.

Should we take this upstream? Or what other ideas do we have?

Comment 23 Jon 2023-11-23 16:37:33 UTC

The two examples above (the pcap files) are fails.  So yes, there is a response but something breaks during the "reload"

I get this in my log:
```
; <<>> DiG 9.10.6 <<>> +stats @192.168.60.1 collegebrowser.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
```

I cannot answer about reporting this to the unbound team.

I am worried this may be caused by this change:
https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=f20ca78eff6e8baeb86361f55adf52819d1bae1f

and how "unbound-dhcp-leases-bridge" interface is written around the reload.

Is there an IPFire python developer that can help??

Comment 24 Michael Tremer 2023-11-23 16:39:18 UTC

(In reply to Jon from comment #23)
> I am worried this may be caused by this change:
> https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;
> h=f20ca78eff6e8baeb86361f55adf52819d1bae1f

It very likely is. However, I am not sure if rolling this back will help us, because that would mean major rework of the leases bridge. So, what is easier to solve now?

Comment 25 Jon 2023-11-23 20:20:32 UTC

Is it more than just replacing the new with the old? (I thought it would be easy to go back one step - shows you how little I know!)

I don't mind going forward, but I don't speak python.  So we'd need someone fluid in pythonese.

Comment 26 Jon 2023-11-25 04:22:35 UTC

for what it is worth...  All of the fixed leases work A-OK with no unbound reloads.

I'm still checking but it looks good so far!

Comment 27 Michael Tremer 2023-11-25 14:40:44 UTC

The problem before was that the bridge was not very good at figuring out which leases needed to be loaded into unbound. It kind of worked okay in some cases, but on larger networks, some random leases were dropped from DNS. That problem is fixed by having unbound reload the entire file every time something changes.

Comment 28 Jon 2023-11-25 16:40:42 UTC

I have to image that someone has solved this problem before. 

We cannot be the only group that uses ISC-DHCP and unbound together.

I'd hope either the unbound group or the dhcp group would have written a nice bridge between the two.  I've been looking but have not been lucky in my searches.

I've read through LOTS of other posts about others having similar issues.  It looks like opensense _may_ have solved it.  but pf sense has similar issues.

Comment 29 Jon 2023-11-30 16:03:27 UTC

I added a note the the Dev mailing list about On Commit/Release/Expiry (CRE) that is already included into ISC DHCP.

Hope it helps!

Comment 30 Jon 2023-12-02 19:33:41 UTC

and I am back to bugzilla...

I wrote a small command to grab the dhcp-leases.conf file and grab the unbound cache and compare them.

```
#!/bin/bash

#	need HOSTNAME and DOMAINNAME value
eval $(/usr/local/bin/readhash /var/ipfire/main/settings)

dhcp_leases=$( sed 's/\"//g ; s/local-data: //g' /etc/unbound/dhcp-leases.conf | sort --version-sort --ignore-case )

uc_list=$( unbound-control list_local_data | grep "${DOMAINNAME}" | sed "/${HOSTNAME}\.${DOMAINNAME}/d" | sort --version-sort --ignore-case  )

diff --side-by-side --width=150 --minimal --ignore-case --ignore-all-space <(echo "$dhcp_leases") <(echo "$uc_list")

```

===

One thing I _think_ I noticed (I need to keep testing!) is sometimes it takes time for unbound to place devices into the unbound cache.  So they appear instantly in dhcp-leases.conf but take a minute or three to appear in the unbound cache.

On one of the networks you Admin, maybe you can run the above to check for differences?


If nothing else, the above code could be added in the bridge (as a "hack"!),  would detect differences, then do a launch of the "reload".  Like I said, a hack!

Comment 31 Jon 2023-12-02 23:19:30 UTC

I dropped a line for the last post.  It adds a period after the FQDNs.


```
dhcp_leases=$( sed .....
dhcp_leases=$( echo "${dhcp_leases}" | sed -E 's/(\.in-addr\.arpa)[[:blank:]]+/\1. /g ; s/(\.localdomain)+/\1./g' )
```

Comment 32 Michael Tremer 2023-12-04 17:37:53 UTC

(In reply to Jon from comment #30)
> If nothing else, the above code could be added in the bridge (as a "hack"!),
> would detect differences, then do a launch of the "reload".  Like I said, a
> hack!

If we go down this path, we should avoid calling any reloads, because:

* We should load the current set of data into Unbound when it starts. We need to do that by parsing the leases file and writing an unbound configuration file with all entries as we do now. More on that below.
* We will then keep unbound up to data by using the DHCP events.

The problem is that we picked Python for this job because we need to parse a rather complicated file and then need to check to which subnet a lease belongs. That is something that is very painful to do in shell. I played around with awk, which might be a middle ground, but as mentioned in a previous email, this is not a great approach. I just want to avoid that it takes a minute to start Unbound if we have to process a thousand leases in shell...

Comment 33 Jon 2023-12-05 01:11:02 UTC

(In reply to Michael Tremer from comment #32)
> (In reply to Jon from comment #30)
> * We should load the current set of data into Unbound when it starts. We
> need to do that by parsing the leases file and writing an unbound
> configuration file with all entries as we do now. More on that below.

I don't think this needs to be done!  Please hear me out.  (Sorry if I am explaining things already known.)

in the /etc/unbound/unbound.conf there are includes files.  And the `/etc/unbound/dhcp-leases.conf` include file is "imported" into unbound (and the unbound cache) every time there is a restart (unbound restart or ipfire reboot).  So `/etc/unbound/dhcp-leases.conf` are already populated on restart and don't need to be re-populated.  We don't need to parse the DHCP leases file.

To test, comment out the `unbound-dhcp-leases-bridge` lines ipfire system and do a restart, you'll see this is correct.  So why parse the lease and re-populate the unbound dhcp-leases.conf file?

Now to get new stuff into the unbound cache (and the /etc/unbound/dhcp-leases.conf file) this is where "unbound-control load-data" comes in.

So if we begin the process with DHCP On Commit/Release/Expire, add A & PTR records to `/etc/unbound/dhcp-leases.conf`, and add A & PTR records to unbound cache via "unbound-control load-data", I think we have a nice system.

With the above there are no periodic unbound reloads/restarts.

Again, this is all a proof-of concept.

Make sense?

Comment 34 Jon 2023-12-05 17:09:06 UTC

Created attachment 1406 [details]
comparison script for testing

FYI - Enclosed is a script to compare the unbound `dhcp-leases.conf` file (generated by Bridge) to the unbound cache (see attached compareMe_v3.sh file).

It does not handle static hosts so those will appear as non-matching.

Jon

Comment 35 Jon 2023-12-06 02:19:41 UTC

Michael - with the current "unbound-dhcp-leases-bridge" code.  How do fixed leases make it into unbound DNS?

I am stuck on this at the moment since the fixed leases appear in the `/etc/unbound/dhcp-leases.conf`file but are missing from the unbound cache.  

And the fix leases are missing from the `/var/state/dhcp/dhcpd.leases`.

I see the right items in `/var/ipfire/dhcp/fixleases`

I did a unbound init restart and a dhcp init restart.

(there is always a really good chance that I screwed up something!)

Comment 36 Jon 2023-12-06 15:45:36 UTC

Ignore Comment 35.  It is me - I figured it out.  

Sorry for the noise.

Comment 37 Michael Tremer 2023-12-07 16:44:59 UTC

(In reply to Jon from comment #35)
> Michael - with the current "unbound-dhcp-leases-bridge" code.  How do fixed
> leases make it into unbound DNS?

We read the configuration that the user manages through the web UI and create entries in unbound. This is not really correct, because they are a static lease and that really does not mean that this should show up in DNS. But users seem to want this.

> I am stuck on this at the moment since the fixed leases appear in the
> `/etc/unbound/dhcp-leases.conf`file but are missing from the unbound cache.  

Here we parse the file and add them as a lease that never expires:

> https://git.ipfire.org/?p=ipfire-2.x.git;a=blob;f=config/unbound/unbound-dhcp-leases-bridge;h=e9f022affa98a2483ca1f41d267fe443b0f048cf;hb=HEAD#l161

> And the fix leases are missing from the `/var/state/dhcp/dhcpd.leases`.

Sadly, there is no way to convince the DHCP server to write the fixed leases to the file. That said, I doubt that we will see events for them.

Comment 38 Jon 2024-02-08 20:30:21 UTC

With the update of unbound 1.19.0 this may have been fixed.  I'll test and report back

Comment 39 Jon 2024-02-08 23:05:29 UTC

Still does restarts.  About 100 restarts per 60 minutes.

Comment 40 Nick Howitt 2024-03-20 18:29:01 UTC

I am new to IPFire and have also noticed this.

If the file /etc/unbound/dhcp-leases.conf is written in a deterministic way, can it initially be written somewhere else like /tmp or /var/tmp or to /etc/unbound/ with a slightly different name. Then compare the new file with the original and only if they are different, copy the file over /etc/unbound/dhcp-leases.conf and restart or reload unbound? Most DHCP actions are renewals which will not change the file.

Comment 41 Nick Howitt 2024-03-20 18:47:39 UTC

I don't speak python but it looks like there is something in /usr/sbin/unbound-dhcp-leases-bridge, write_dhcp_leases(self, leases), which creates a temporary file then copies it over the original. It is called from class UnboundConfigWriter(object) which then goes on to self._control("reload_keep_cache").

What some genius could do is diff the temporary file and original file in write_dhcp_leases(self, leases) and set a flag if they are different before the temp file is copied over the original.

Then in UnboundConfigWriter(object), only call self._control("reload_keep_cache") if the flag is set. (Remember to clear the flag after .....).

Comment 42 Nick Howitt 2024-03-24 20:16:21 UTC

I've proposed a patch to only restart Unbound when the /etc/unbound/dhcp-leases.conf file changes - https://lists.ipfire.org/hyperkitty/list/development@lists.ipfire.org/thread/4RK4PR7M57QLHJXHYFWQ32QUGPPW5PRQ/. It is a minor rework of a couple of functions in the current program. It still requires restarts when the leases change, but not often.

It is nowhere near as cute as Jon's proposal which avoids restarts of unbound altogether, but it completely ditches the current methodology.

Comment 43 Nick Howitt 2024-03-24 20:18:14 UTC

Better link to the patch submission - https://patchwork.ipfire.org/project/ipfire/patch/ec4d6d18-41d3-4ba0-9713-e6a388a3119f@howitts.co.uk/

Comment 44 Jon 2024-03-25 17:51:46 UTC

(In reply to Nick Howitt from comment #42)
> It is nowhere near as cute as Jon's proposal ...
cute?!?  cute???  Nobody wants cute code!!  That is like the kiss of death to have "cute" code!

Comment 45 Nick Howitt 2024-03-25 18:12:45 UTC

Not cute code, I won't judge that (but it it prettier than my coding). It is a cute approach which avoids any restarts and hooks directly into the API's. I think that, as a concept, it is a great way to go. My patch is only a sticking plaster on an approach way inferior to yours.

If my patch gets accepted, I'll move on to your code. From my patch submission emails, I think Michael Tremer just has not had the time to go through your approach

Comment 46 Michael Tremer 2024-03-26 11:00:54 UTC

(In reply to Jon from comment #44)
> (In reply to Nick Howitt from comment #42)
> > It is nowhere near as cute as Jon's proposal ...
> cute?!?  cute???  Nobody wants cute code!!  That is like the kiss of death
> to have "cute" code!

It can't hurt. Well-readable should be a must. Cute is probably a more subjective feeling.

(In reply to Nick Howitt from comment #45)
> Not cute code, I won't judge that (but it it prettier than my coding). It is
> a cute approach which avoids any restarts and hooks directly into the API's.

This is what the bridge used to do, and it became a huge problem if there were thousands of leases going around. There have been systems where the bridge processed used of 20% of a single Intel Xeon core. That is why we changed it.

> https://git.ipfire.org/?p=ipfire-2.x.git;a=commitdiff;h=f20ca78eff6e8baeb86361f55adf52819d1bae1f

> I think that, as a concept, it is a great way to go. My patch is only a
> sticking plaster on an approach way inferior to yours.

That is due to be tested. We right now have the option to update multiple leases at the same time whereas the shell approach would run an update command for each event. That is a lot of forked shell processes and we need to create some database that will store the current state or import it from the DHCP leases file that after a restart we won't have a blank page.

I don't know what the overhead of that will be, but I don't think it will be that much cheaper (if at all) than the bridge.

Comment 47 Jon 2024-03-26 14:33:03 UTC

Using the dhcp Event approach (the Proof-of-Concept) how can I check if multiple request can happen at one time?

The commit/release/expiry comes from the DHCP server.  And the `execute` command below launches a script (the Proof-of-Concept script).

```
#
#	commit
#
on commit {
set ClientIP = binary-to-ascii(10, 8, ".", leased-address);
set NoName = "";
set ClientName = pick-first-value(option host-name, config-option-host-name, client-name, NoName);
log(concat("CRE-Commit:  ClientIP: ", ClientIP, "  Name: ", ClientName));

execute("/root/dhcpdconf/dhcpEvent.sh", "commit", ClientIP, ClientName); }
```

Is there a way to add a different execute command to monitor for multiple requests?

Comment 48 Michael Tremer 2024-03-26 14:53:35 UTC

(In reply to Jon from comment #47)
> Using the dhcp Event approach (the Proof-of-Concept) how can I check if
> multiple request can happen at one time?

The script will only be called once at a time. This is because of dhcpd being single-threaded. That has a lot of implications that we need to consider:

* The process will not be able to do anything else until the script is done. That means that the script needs to be *very* fast. If the script blocks, dhcpd will not be able to even read incoming packets.
* We cannot cover up any performance problems here with running multiple scripts at the same time.
* Client won't wait very for long for a response from the DHCP server. They will retry, but that might make the situation a lot worse on our side since we are then running the script multiple times for the client and no reason.

The code for dhcpd is here:

> https://git.ipfire.org/?p=thirdparty/dhcp.git;a=blob;f=common/execute.c;hb=572032cb0e514606559de3784e3f7ca8e1539d17#l1

> Is there a way to add a different execute command to monitor for multiple
> requests?

I don't know what you are asking...

Comment 49 Jon 2024-03-29 02:23:29 UTC

(In reply to Michael Tremer from comment #48)
> (In reply to Jon from comment #47)
> > Using the dhcp Event approach (the Proof-of-Concept) how can I check if
> > multiple request can happen at one time?
> 
> The script will only be called once at a time. This is because of dhcpd
> being single-threaded. That has a lot of implications that we need to
> consider:
> 
> * The process will not be able to do anything else until the script is done.
> That means that the script needs to be *very* fast. If the script blocks,
> dhcpd will not be able to even read incoming packets.
> * We cannot cover up any performance problems here with running multiple
> scripts at the same time.
> * Client won't wait very for long for a response from the DHCP server. They
> will retry, but that might make the situation a lot worse on our side since
> we are then running the script multiple times for the client and no reason.

Knowing DHCP is single threaded helps!

The Proof-of-Concept script runs in 0.5 seconds.  But it is not elegant code.

I added logger code to get an idea of time and if it blocks DHCP.  Here is a typical log entry:

```
Mar 28 18:15:34 ipfire dhcpd: DHCPDISCOVER from 28:c2:dd:af:c7:33 via blue0
Mar 28 18:15:35 ipfire dhcpd: DHCPOFFER on 192.168.65.114 to 28:c2:dd:af:c7:33 via blue0
Mar 28 18:15:35 ipfire dhcpd: CRE-Commit:  ClientIP: 192.168.65.114  Name: Onelink   begin
Mar 28 18:15:35 ipfire dhcpd: execute_statement argv[0] = /usr/sbin/dhcpEvent
Mar 28 18:15:35 ipfire dhcpd: execute_statement argv[1] = commit
Mar 28 18:15:35 ipfire dhcpd: execute_statement argv[2] = 192.168.65.114
Mar 28 18:15:35 ipfire dhcpd: execute_statement argv[3] = Onelink
Mar 28 18:15:36 ipfire dhcpEv28: DNS Update (RFC2136) is enabled
Mar 28 18:15:36 ipfire dhcpEv28: commit: added A & PTR records to unbound
Mar 28 18:15:36 ipfire dhcpEv28: CRE script: complete - exiting
Mar 28 18:15:36 ipfire dhcpd: CRE-Commit:  ClientIP: 192.168.65.114  Name: Onelink   finish
Mar 28 18:15:36 ipfire dhcpd: DHCPREQUEST for 192.168.65.114 (192.168.65.1) from 28:c2:dd:af:c7:33 via blue0
Mar 28 18:15:36 ipfire dhcpd: DHCPACK on 192.168.65.114 to 28:c2:dd:af:c7:33 (Onelink) via blue0
```

`dhcpd: CRE-Commit . . . begin` is the start of the `dhcp.conf` "on commit" log entry.

`dhcpd: execute_statement argv` is passing info from DHCP to the Proof-of-Concept script.

`dhcpEv28` is the Proof-of-Concept script.  

`dhcpd: CRE-Commit . . . finish` is the end of the DHCP execute.

The total time for the "block" is 2-3 seconds.

I measured using `time dhcpEvent commit <ip addr> <hostname>` and got 0.5 seconds

How fast is "*very* fast"??  Is that OK?

Comment 50 Jon 2024-03-29 02:28:52 UTC

Created attachment 1493 [details]
readme

Comment 51 Jon 2024-03-29 02:29:47 UTC

Created attachment 1494 [details]
dhcpScript

Comment 52 Jon 2024-03-29 02:32:43 UTC

Created attachment 1495 [details]
dhcpd.conf.local config file

Added three files: README.md, dhcpEvent and dhcpd.conf.local.  This is the current PoC version.

Comment 53 Nick Howitt 2024-03-29 09:18:48 UTC

If HOSTNAME is available in scripts, is extraction of the domain using DOMAINNAME=${HOSTNAME#*.*} considered dangerous as it is not reading from file? If it is OK, it could be faster than and eval.

Comment 54 Jon 2024-03-29 16:01:36 UTC

(In reply to Nick Howitt from comment #53)
> If HOSTNAME is available in scripts, is extraction of the domain using
> DOMAINNAME=${HOSTNAME#*.*} considered dangerous as it is not reading from
> file? If it is OK, it could be faster than and eval.

Yes, the HOSTNAME is available.  It is sent from DHCP.  But it arrives as like this:
```
Mar 29 10:53:48 ipfire dhcpd: execute_statement argv[3] = MM2
```

and does not have the DOMAINNAME attached.  The only way I knew of to grab the "official" DOMAINNAME was via the: `eval $(/usr/local/bin/readhash /var/ipfire/main/settings)`

Comment 55 Jon 2024-03-29 16:07:07 UTC

I changed one line to grep-ing a fixed string so that should make this a bit faster.

#	get existing records from unbound-control list_local_data
ucList=$( /usr/sbin/unbound-control list_local_data | 
	/bin/grep --ignore-case --fixed-strings ".${DOMAINNAME}." | 
	expand -t1 )

Comment 56 Nick Howitt 2024-03-29 16:09:26 UTC

AFAIK HOSTNAME is a standard environment variable in many Linux distros:

[root@gateway ~]# echo $HOSTNAME
gateway.howitts.co.uk

On my IPFire it contains the FQDN as above.

Comment 57 Michael Tremer 2024-03-30 13:04:03 UTC

(In reply to Jon from comment #49)
> The total time for the "block" is 2-3 seconds.
> 
> I measured using `time dhcpEvent commit <ip addr> <hostname>` and got 0.5
> seconds
> 
> How fast is "*very* fast"??  Is that OK?

I think we are looking at an executing time of around 100ms. The reason is as follows:

Consider a group of people (let's say 60 for simple maths) enter a room with a wireless access point all having a phone in their pockets that connect simultaneously to the WiFi and therefore will request a DHCP Lease at exactly the same time.

dhcpd will then queue all requests and process them one at a time. If the script would take one second, that would initially mean that the last person is waiting around 60s to get an IP address offered by dhcpd. That is already disruptive.

In practise clients don't wait that long. Since DHCP is UDP-based they would assume that the request was lost and send another request. Considering that dhcpd processes all requests one after the other ordered by when they were received, it is suddenly processing requests that no client is waiting for any more. Whatever the timeout is depends on the client, but dhclient implements three seconds.

That means that if initially all clients send one request, dhcpd will process the first client and respond after 1 second. The second and third client will also get a response. The fourth client will already have timed out and therefore send another request that is going to be appended to the queue. The maths is now becoming huge.

If the queue initially has 60 requests and processes three before all clients time out and send another request, the queue will have grown to 117 requests. After another three seconds we will have responded to another three clients and the remaining 54 have timed out again adding a further 54 requests totalling 171... This will continue until all clients have received a response by what time the queue will have 570 requests queued.

If a new client now connects to the network, it will have to wait almost ten minutes for the DHCP server to respond. The client will long have gone away as most of the others in the first example.

I don't have a good name for this, but I think we can consider this an avalanche effect. Once the script cannot keep up, the situation is running out of control and no client will be able to connect to the network.

This is the huge risk I see with this approach. It will work in small test environments, but IPFire does not only run there. Therefore we need to solve this problem and have a script that is *fast*. With running so many shell commands this might be very difficult.

---

And yes, you can use $HOSTNAME.

Comment 58 Jon 2024-03-30 14:29:43 UTC

excellent.  all of this helps!

So now looking at the overall concept (after all, this is a proof-of-concept): 
- is getting info from DHCP using a Commit/Release/Expiry event   <--this is new
- and processing it with a script (a very very fast script!) 
- and sending it to unbound as  
    • unbound-control local_datas <data>                          <--this is new
    • and an update to the `/etc/unbound/dhcp-leases.conf` file


an OK way to go forward?

Comment 59 Nick Howitt 2024-03-30 16:38:27 UTC

How about this. What is the point of maintaining the /etc/unbound/dhcp-leases.conf file? This can be done initially as a pre-start task of dhcpd virtually using /usr/sbin/unbound-dhcp-leases-bridge, but as it is done as a pre-start task, unbound does not have to be restarted and no listener would be needed for any DNS events? /usr/sbin/unbound-dhcp-leases-bridge already gets run straight after unbound starts. This should be moved to before and then simplified so it is a one-shot task.

Once that happens none of the calls are then needed to maintain the on-the-fly. It is only ever read when dhcpd is started or reloaded so jt only ever needs to be valid at the point that unbound starts. Then Jon can remove all those bits of the his script and just do the calls.

If Jon strips out all his maintenance of the /etc/unbound/dhcp-leases.conf file and just leaves in the calls to the unbound cache, how quick does the processing become?

/usr/sbin/unbound-dhcp-leases-bridge would also then be massively simplified just to write the /etc/unbound/dhcp-leases.conf and exit.

Comment 60 Jon 2024-03-30 17:55:36 UTC

(In reply to Nick Howitt from comment #59)
> How about this. What is the point of maintaining the
> /etc/unbound/dhcp-leases.conf file? This can be done initially as a

It primary purpose is to load unbound DNS on a reboot/restart/power-up.  Otherwise it would have no DNS entries.

does that help?

Comment 61 Nick Howitt 2024-03-30 18:38:23 UTC

So just generate it at startup with a modified /usr/sbin/unbound-dhcp-leases-bridge, run just before the unbound start, rather than after it as it does at the moment. Then pull all the /etc/unbound/dhcp-leases.conf file maintenance from your scripts just leaving the parts related to the unbound cache calls. Then how fast does your script become? It cuts right down on the file i/o.

As a PoC, don't even bother generating the /etc/unbound/dhcp-leases.conf and just test your modified script on its own.

Comment 62 Jon 2024-03-30 19:34:56 UTC

(In reply to Nick Howitt from comment #61)
> So just generate it at startup with a modified
> /usr/sbin/unbound-dhcp-leases-bridge, run just before the unbound start,
> rather than after it as it does at the moment. Then pull all the
> /etc/unbound/dhcp-leases.conf file maintenance from your scripts just
> leaving the parts related to the unbound cache calls. 

I am not sure I understand your proposal.

Since the bash needs to go from more than 500ms to less than 100ms I am not sure the file i/o will make huge difference.  

It would be better to write as perl or python if @ms agrees the overall concept is correct.

Comment 63 Jon 2024-03-30 20:21:36 UTC

(In reply to Nick Howitt from comment #61)
> ...   Then pull all the
> /etc/unbound/dhcp-leases.conf file maintenance from your scripts just
> leaving the parts related to the unbound cache calls. Then how fast does
> your script become? It cuts right down on the file i/o.
> 
> As a PoC, don't even bother generating the /etc/unbound/dhcp-leases.conf and
> just test your modified script on its own.

I commented out the reads, writes, sed's, etc /etc/unbound/dhcp-leases.conf and it saves about 10ms.

This is on a APU4D4.

Comment 64 Nick Howitt 2024-03-30 22:46:01 UTC

Perhaps it is worth timing just the unbound cache commands. That should give you the absolute shortest time possible. If that is too long, then perhaps we are dead.

If those times are OK, then we need to speed up your stuff. You end up with small scripts so could they be rune directly from dhcpd.conf.local rather than calling /usr/sbin/dhcpEvent? Just ideas.

Comment 65 Michael Tremer 2024-04-01 11:29:10 UTC

(In reply to Nick Howitt from comment #64)
> Perhaps it is worth timing just the unbound cache commands. That should give
> you the absolute shortest time possible. If that is too long, then perhaps
> we are dead.

I think that is a good start. We also need to take into consideration that there are people with lots of leases. So the loop that is in the script rewriting the file might take a short moment in testing, but might take several seconds if there are 1000 leases to process.

Option C could be to have the most minimal script that just talks to a daemon that can hold and cache state. For example the bridge.

Comment 66 Nick Howitt 2024-04-01 11:59:34 UTC

I've been trying to do some basic testing and research. Can some things be spead up by avoiding external calls, for example, to get a reverse IP, from https://unix.stackexchange.com/questions/132779/how-to-read-an-ip-address-backwards you can do:

IFS=. read w x y z <<<'192.168.1.1'
printf '%d.%d.%d.%d.in-addr.arpa.' "$z" "$y" "$x" "$w"

or:

echo '192.168.1.1' | { IFS=. read w x y z; echo "$z.$y.$w.$x.in-addr.arpa."; }

to avoid calling awk. Would that be faster.

Also, from empirical testing, there seems little point in trapping errors before   calling unbound-control. For example, in delete_unbound, perhaps just deleting the IP and reverse IP may be quicker than testing if it is there.

Would it just be faster to delete any A and PTR records before adding them rather than test if they exist

Comment 67 Nick Howitt 2024-04-02 09:36:19 UTC

There is a bug with the this new approach. It also exists in the current approach. The cache auto-expires all entries when it hits the TTL so this should be read from /etc/dhcp/dhcpd.conf every time a lease is renewed or, if it appears, from the renewal event. I also hope that a renewal event kicks off a commit event.

Currently it is hard-coded to 60min in both Jon's update and /usr/sbin/unbound-dhcp-leases-bridge. The issue probably does not hit in /usr/sbin/unbound-dhcp-leases-bridge because it regenerates and restarts so often but I think would come more apparent if someone increased the DHCP TTL. I need to check out my own updates to /etc/unbound/dhcp-leases.conf for the same issue.

Comment 68 Jon 2024-04-02 16:36:02 UTC

(In reply to Nick Howitt from comment #67)
> There is a bug with the this new approach. It also exists in the current
> approach. The cache auto-expires all entries when it hits the TTL so this
> should be read from /etc/dhcp/dhcpd.conf every time a lease is renewed or,
> if it appears, from the renewal event. I also hope that a renewal event
> kicks off a commit event.

I do not see a TTL in the dhcp.conf.  Are you referring to default-lease-time or max-lease-time?

> 
> Currently it is hard-coded to 60min in both Jon's update and
> /usr/sbin/unbound-dhcp-leases-bridge. The issue probably does not hit in
> /usr/sbin/unbound-dhcp-leases-bridge because it regenerates and restarts so
> often but I think would come more apparent if someone increased the DHCP
> TTL. I need to check out my own updates to /etc/unbound/dhcp-leases.conf for
> the same issue.

Are the the DHCP lease times and the unbound DNS TTL times really connected or related??

Comment 69 Nick Howitt 2024-04-02 17:51:00 UTC

I tried looking at the ISC docs and it really isn't clear to me. I was wondering if I could experiment. Both values are used in the UI. The IPF documentation says the default-lease-time and does not mention the max-lease-time.

https://dhcp-users.isc.narkive.com/jeVKUrHW/what-is-the-subtil-difference-between-max-lease-time-and-default-lease-time indicates the default is the one handed out if the client does not request a specific lease time and the max is the maximum DHCPD would allow if a client requested a specific one.

I tried looking at the on-commit and it does not seem to be passed as a parameter, unless I missed it. Then the only option that I can see is to calculate it from /var/state/dhcp/dhcpd.leases but this risks slowing down the script.

Comment 70 Jon 2024-04-02 18:15:55 UTC

(In reply to Nick Howitt from comment #69)
> I tried looking at the ISC docs and it really isn't clear to me. I was
> wondering if I could experiment. 

Of course!

> https://dhcp-users.isc.narkive.com/jeVKUrHW/what-is-the-subtil-difference-
> between-max-lease-time-and-default-lease-time indicates the default is the
> one handed out if the client does not request a specific lease time and the
> max is the maximum DHCPD would allow if a client requested a specific one.

Max is the expiry time.  When I leave home, DHCP sends an "on-expiry" after 2 hours.  My dhcp is set to: "Max lease time (mins): 120"


> I tried looking at the on-commit and it does not seem to be passed as a
> parameter, unless I missed it.

There is not one that I found.  I only found status (like "commit") IP address & hostname.

Comment 71 Jon 2024-04-02 18:20:37 UTC

And a MAC address!

Comment 72 Nick Howitt 2024-04-02 19:05:49 UTC

(In reply to Jon from comment #70)
> (In reply to Nick Howitt from comment #69)
> > I tried looking at the ISC docs and it really isn't clear to me. I was
> > wondering if I could experiment. 
> 
> Of course!
> 
> > https://dhcp-users.isc.narkive.com/jeVKUrHW/what-is-the-subtil-difference-
> > between-max-lease-time-and-default-lease-time indicates the default is the
> > one handed out if the client does not request a specific lease time and the
> > max is the maximum DHCPD would allow if a client requested a specific one.
> 
> Max is the expiry time.  When I leave home, DHCP sends an "on-expiry" after
> 2 hours.  My dhcp is set to: "Max lease time (mins): 120"

That isn't my reading. My reading is that many clients don't request a specific lease time. In that case they get the default lease time. However if the client explicitly requested 10 minutes, it would get 10 minutes and if it requested 10h, it would get max-lease-time if it is less than 10h.

https://dhcp-users.isc.narkive.com/jeVKUrHW/what-is-the-subtil-difference-between-max-lease-time-and-default-lease-time

> 
> 
> > I tried looking at the on-commit and it does not seem to be passed as a
> > parameter, unless I missed it.
> 
> There is not one that I found.  I only found status (like "commit") IP
> address & hostname.

Comment 73 Jon 2024-04-02 19:26:35 UTC

This feels like we may be going off topic...

How is this related to the multiple unbound restarts?

Comment 74 Nick Howitt 2024-04-02 20:00:39 UTC

Multiple restarts are caused by /usr/sbin/unbound-dhcp-leases-bridge updating, effectively, the hosts in in Unbound. I have tried to work around it by updating /usr/sbin/unbound-dhcp-leases-bridge, but you have a better solution (IMHO) of taking the output directly from dhcpd and populating the Unbound cache.

In testing, I have noticed that my records added to the Unbound cache disappear after the the TTL specified. e.g., with my testing simplification of your script:

IP=$1
PTR=$(echo $IP | { IFS=. read w x y z; echo "$z.$y.$w.$x.in-addr.arpa."; })
DOMAIN=${HOSTNAME#*.*}

echo -e "$2.$DOMAIN"'\n'"$PTR" | /usr/sbin/unbound-control local_datas_remove

echo -e "$2.$DOMAIN 60 IN A $IP"'\n'"$PTR 60 IN PTR $2.$DOMAIN" | /usr/sbin/unbound-control local_datas

Anything I add disappears in 60 minutes. Now, If I had a lease time of 3h and this triggered the add to cache, for 60 mins the A record would appear, but then it would disappear until the lease were renewed. To me, that is a bug.

I am trying to work round it.

Now, if the default-lease-time is set to 60 mins, the bug won't generally be hit as releases are renewed from half the granted lease time onwards. But if the default were set to 8h (the IPF default is very low), then the bug would be continually hit as you will never get a message to renew the hosts entry in time. Similarly, if a client requested a long lease, you'd hit the same problem.

The current methodology works with a 60 minute default works because the hosts file gets regenerated so often but it is not good. It would also suffer if there were very few leases and a long lease time.

Why do you think it is not related to multiple unbound restarts and fixing the issue?

Comment 75 Jon 2024-04-02 20:23:05 UTC

(In reply to Nick Howitt from comment #74)
> Why do you think it is not related to multiple unbound restarts and fixing
> the issue?

Mostly because it was more related to DHCP...

If I understanding what you are saying:  Changing the DHCP "max lease time" from default of 120min (2h) to 3h, the Unbound cache record A (but not record PTR?) disappears from the unbound cache after 60 minutes (the TTL setting in dhcpEvent script)?

I am trying to understand your scenario so I can test also...

Comment 76 Nick Howitt 2024-04-02 20:35:04 UTC

The PTR also disappears.

The 60min TTL is used in both dhcpEvent and /usr/sbin/unbound-dhcp-leases-bridge.

It may be tempting to stick the TTL very high and rely on the dhcpd expiry event to remove the record from Unbound, but it won't be a safe approach for downstream which may have cached the record until the original TTL expires. Downstream won't then re-query Unbound, but Unbound's record may have changed.

Comment 77 Jon 2024-04-02 21:25:43 UTC

I think this is the lease time that could be grabbed with on-commit:
https://github.com/isc-projects/dhcp/blob/master/common/print.c#L979-L981

Comment 78 Nick Howitt 2024-04-03 09:53:06 UTC

Looks like you are right - https://linux.die.net/man/5/dhcp-eval

Comment 79 Nick Howitt 2024-04-03 17:11:00 UTC

Do you know how to use lease-time? I have tried all sorts just outputting to a log file. Most things end up outputting nothing and killing the log lone, or dhcpd refuses to start. The most progress I made was with:

set LeaseTime = encode-int(lease-time, 8);
set PTR = concat (binary-to-ascii (10, 8, ".", reverse (1, leased-address)), ".in-addr.arpa.");
log(concat("CRE-Commit:  ClientIP: ", ClientIP, "  Name: ", ClientName, "  PTR: ", PTR, "  Lease time: ", LeaseTime )); 

But this list gives:
Apr  3 17:57:10 gateway dhcpd: CRE-Commit:  ClientIP: 172.17.2.95  Name: LAPTOP-Rachel  PTR: 95.2.17.172.in-addr.arpa.  Lease time: ^P

The additional paramater of 10 or more stops dhcpd from starting. ChatGPT isn't helping as it suggests:

set lease_time_seconds = option lease-time; but lease-time is not an option. If I correct it, it switches to packet-lease-time and my log line fails.

Comment 80 Nick Howitt 2024-04-03 17:59:56 UTC

Just to add, if you change the encode-int second parameter to 16 you get ^N^P and if you change it to 32 you get a blank.

Comment 81 Jon 2024-04-03 18:11:51 UTC

having similar issues.

just found this thread:
https://lists.isc.org/mailman/htdig/dhcp-users/2011-December/014521.html

and "maybe" this solution:
https://lists.isc.org/mailman/htdig/dhcp-users/2012-January/014593.html

Comment 82 Jon 2024-04-03 18:29:14 UTC

(In reply to Nick Howitt from comment #80)
> Just to add, if you change the encode-int second parameter to 16 you get
> ^N^P and if you change it to 32 you get a blank.

^N^P = 0E10x = 3600

what it your lease time setting?

Comment 83 Jon 2024-04-03 18:40:51 UTC

mine is: either blank or ^Ah

^Ah = 0168 hex = 360 dec = Default lease time (mins): 5

So my number is correct (as a hex) for my 5 minute default time.

But I dont understand why some are blank...

Comment 84 Jon 2024-04-03 18:41:36 UTC

oops not quite correct...

Comment 85 Jon 2024-04-03 19:12:00 UTC

maybe this...

```
set LeaseTime = encode-int(lease-time, 16);
set LeaseTime2 = binary-to-ascii(10, 16, "", LeaseTime);
```

Comment 86 Nick Howitt 2024-04-03 19:30:23 UTC

Well done. I've tried:
set LeaseTime = binary-to-ascii(10, 16, "", encode-int(lease-time, 16));

and it works. I was trying ,, instead of ,"", when you posted.

I am also confused. All the documentation I read indicates the TTL is in seconds, which means the setting used currently in /etc/unbound/dhcp-leases.conf and in your replacement method may be wrong? I'm checking.

Comment 87 Nick Howitt 2024-04-03 19:41:45 UTC

It also means that my premise that the DNS record was expiring on the TTL is incorrect but something is expiring it.

Comment 88 Jon 2024-04-03 19:45:53 UTC

This was the part I did not understand...

I could not re-create what you saw.  What error or issue did you see that hinted this was true?

Comment 89 Nick Howitt 2024-04-03 19:54:57 UTC

I was adding a pair of DNS records with:
IP=$1
PTR=$(echo $IP | { IFS=. read w x y z; echo "$z.$y.$w.$x.in-addr.arpa."; })
DOMAIN=${HOSTNAME#*.*}
echo -e "$2.$DOMAIN"'\n'"$PTR" | /usr/sbin/unbound-control local_datas_remove
echo -e "$2.$DOMAIN 60 IN A $IP"'\n'"$PTR 60 IN PTR $2.$DOMAIN" | /usr/sbin/unbound-control local_datas

so:
./test-add2 192.168.1.2 xxx

giving:
[root@gateway ~]# /usr/sbin/unbound-control list_local_data | grep xxx
xxx.howitts.co.uk.      60      IN      A       192.168.1.2
2.1.192.168.in-addr.arpa.       60      IN      PTR     xxx.howitts.co.uk.

Then at some point in the future they disappeared. I'll shoot myself quietly if it was an unbound restart that was causing it, which I guess it could have been. I am testing now. I've added a record and I'm watching it in a loop. If it disappears I'll see when, check for a restart and check back.

I am running my updated script so it may not restart....

Comment 90 Nick Howitt 2024-04-03 21:09:48 UTC

Still working. I'm thinking I've led you up the garden path. Apologies. I'll keep the loop running overnight.

Comment 91 Nick Howitt 2024-04-04 09:02:48 UTC

The manually cached entry was still there this morning, so I have led you up the garden path with the TTL issue.

When I restarted Unbound, the cache was flushed and the entry disappeared as I expected. It did not disappear when I restarted dhcp. To me, this then means the /etc/unbound/dhcp-leases.conf only needs to be written as a pre-start task of unbound. It can be maintained on-the-fly by dhcpScript but this puts an unnecessary overhead on dhcp, especially if the emphasis is speed so as to not block dhcpd from issuing leases.

What do others think?

Another question, which is quicker for generating the PTR? it can be done in the  dhcpd.conf.local with:

set PTR = concat (binary-to-ascii (10, 8, ".", reverse (1, leased-address)), ".in-addr.arpa.");

Or in dhcpScript as it is done now calling awk or directly in bash with:

PTR=$(echo $IP | { IFS=. read w x y z; echo "$z.$y.$w.$x.in-addr.arpa."; })

I suspect the awk method is the slowest. I can try to set up a timing loop for the two methods in dhcpScript later today.

Comment 92 Nick Howitt 2024-04-04 13:31:55 UTC

Here are some timings for calculating the reverse IP over 256 IP's:

IP="192.168.1."
time for i in {0..255} ; do echo ${IP}$i | { IFS=. read w x y z; echo "$z.$y.$w.$x.in-addr.arpa."; }; done
<snip>
real    0m0.166s
user    0m0.156s
sys     0m0.109s

time for i in {0..255} ; do echo "${IP}$i" | /usr/bin/awk -F. '{print $4 "." $3 "." $2 "." $1 ".in-addr.arpa." }'; done
<snip>
real    0m0.878s
user    0m0.617s
sys     0m0.368s

It is a saving of 0.003s per iteration. Perhaps not too significant, but a small reduction.

I don't know how to time the same thing in dhcpd.conf.local with:

set PTR = concat (binary-to-ascii (10, 8, ".", reverse (1, leased-address)), ".in-addr.arpa.");

Comment 93 Nick Howitt 2024-04-04 14:32:06 UTC

But for extracting the valid part of a hostname, these are near as dammit the same:

test="abcABD-_.khaoih"
time for i in {0..255} ; do expr "$i$test" : '\([A-Za-z0-9.-]*\)'; done
time for i in {0..255} ; do echo "$i$test" | /bin/sed 's/[^A-Za-z0-9.-]/_/g' | /usr/bin/awk -F_ '{print $1}'; done

But this is a touch faster:
time for i in {0..255} ; do echo "$i$test" | /bin/sed 's/[^A-Za-z0-9.-].*//'; done

but not significantly - about 0.0005s per iteration

Comment 94 Nick Howitt 2024-04-04 16:37:27 UTC

if you're allowed to us it for the domain name:
time for i in {0..255} ; do DOMAINNAME=${HOSTNAME#*.*}; echo $DOMAINNAME ; done
<snip>
real    0m0.010s
user    0m0.005s
sys     0m0.004s

vs
time for i in {0..255} ; do eval "$(/usr/local/bin/readhash /var/ipfire/main/settings)"; echo $DOMAINNAME; done
<snip>
real    0m2.485s
user    0m1.700s
sys     0m0.817s

Comment 95 Nick Howitt 2024-04-04 16:43:23 UTC

Do you need the "expand":

[root@gateway ~]# time for i in {0..255} ; do ucList=$( /usr/sbin/unbound-control list_local_data | /bin/grep -i "${DOMAINNAME}" | expand -t1 ) ; done

real    0m1.869s
user    0m1.785s
sys     0m1.141s
[root@gateway ~]# time for i in {0..255} ; do ucList=$( /usr/sbin/unbound-control list_local_data | /bin/grep -i "${DOMAINNAME}") ; done

real    0m1.577s
user    0m1.214s
sys     0m1.062s

Comment 96 Jon 2024-04-04 18:18:45 UTC

Created attachment 1513 [details]
timing

Hey Nick!  Sorry I haven't replied recently.  Just been busy with the weekend holiday and now Tax Day filings!  (ugh!).

I've been doing the same for each command and running time trials (see attached image).  For most everyone I can save near 5-20 ms for a command.  But I have not been able to find "the smoking gun" that is using a big chunk of time.  So as mentioned in Comment 62, I am not sure we can get things down from 500ms to 100 ms without perl (or something faster than bash).

I should be able to get back to this sometime soon!  

Let me know if you find the smoking gun!

Comment 97 Jon 2024-04-04 19:10:46 UTC

(In reply to Nick Howitt from comment #92)
> I don't know how to time the same thing in dhcpd.conf.local with:
> 
> set PTR = concat (binary-to-ascii (10, 8, ".", reverse (1, leased-address)),
> ".in-addr.arpa.");

This could be done.  I just don't know how to measure the execution time if PTR is moved to dhcpd.conf.   If we can find a way to measure (beside the time stamp on a log entry) that would be great!

Comment 98 Nick Howitt 2024-04-04 19:22:54 UTC

I have no idea there.

For IP validation over 500 iterations:
time for i in {0..499} ; do if [[ "$IP$i" =~ [0-9]{1,3}\.[0-9]{1,3}\[0-9]{1,3}\[0-9]{1,3} ]] ; then echo "good" ; else echo "bad"; fi ; done
real    0m0.079s
user    0m0.062s
sys     0m0.018s

vs

time for i in {0..499} ; do if ! /bin/grep --quiet --extended-regexp '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' <<< "${IP}${i}" ; then echo "bad"; else echo "good" ;fi ; done
real    0m1.842s
user    0m1.200s
sys     0m0.672s

Note the original test is negative. My other one is positive but can be negated. Again no smoking gun.

I guess I may have to load in your your code fully rather than do synthetic testing.

How did you do your timings?

Comment 99 Jon 2024-04-04 19:26:26 UTC

(In reply to Nick Howitt from comment #95)
> Do you need the "expand":

Yes, I just dont remember the "why".  It probably had to do the one of the greps on the ucList. I'll review

Comment 100 Jon 2024-04-04 19:35:29 UTC

(In reply to Nick Howitt from comment #98)
> How did you do your timings?


For command line: similar to yours except for 1000:
`time for i in {0..999} ; do DOMAINNAME=${HOSTNAME#*.*}; done`


for dhcpEvent just one time:
`time dhcpEvent commit <ip addr> <hostname>`


For multiple IPs and I time this script as a whole:
```
#!/bin/bash
for y in `seq 101 104`					#	4 x 250 = 1000 lines			
do
	for z in `seq 3 252`				#	250 lines total
	do
		IP="192.168.${y}.${z}"
		/root/dhcpdconf/dhcpEvent_v29b.sh commit "${IP}" "testme${y}${z}"
	done
done
```

Comment 101 Jon 2024-04-04 19:57:54 UTC

One more item:  The last command (the for loop) stresses everything.  Dont run on production.  It will add to unbound cache and /etc/unbound/dhcp-leases.conf

Comment 102 Nick Howitt 2024-04-04 20:00:44 UTC

I made a mistake in my testing and piped some rubbish into /bin/grep. I had to spin up a VM to grab a copy of the file to fix the firewall. I can try to copy over data to test there.

Comment 103 Jon 2024-04-04 20:53:39 UTC

Hmmm.  I may have made an error also.  The smoking gun may this:
```
eval $(/usr/local/bin/readhash /var/ipfire/dhcp/settings)

# Is "DNS Update > Enable DNS Update (RFC2136)" enabled or disabled?
if [[ "${DNS_UPDATE_ENABLED}" != on ]]; then
	/usr/bin/logger --tag dhcpEvent "DNS Update (RFC2136) is NOT enabled.  Bridge may be running"
else
	/usr/bin/logger --tag dhcpEvent "DNS Update (RFC2136) is enabled"
fi

pgrep -f unbound-dhcp-leases-bridge &&
	kill -s SIGTERM $(pgrep -d',' -f unbound-dhcp-leases-bridge) &&
	/usr/bin/logger --tag dhcpEvent "terminated unbound-dhcp-leases-bridge"
```

And secondly this:
```
#	send message to message log
msg_log () {
	if tty --silent ; then
 		echo "${tagName}:" "$*"
	else
		/usr/bin/logger --tag "${tagName}" "$*"
	fi
}
```

time to test!

Comment 104 Jon 2024-04-04 21:24:04 UTC

[root@ipfire ~] # time /root/dhcpdconf/dhcpEvent_v30.sh commit 192.168.200.16 testME20016

real	0m0.404s
user	0m0.270s
sys	0m0.202s


the above got rid of ~100ms.  300ms more to go!

Comment 105 Nick Howitt 2024-04-04 21:36:23 UTC

I had not decoded that part. I am not sure what you see wrong in the first bit or what to expect from DNS_UPDATE_ENABLED, but I don't think Unbound can handle RFC2136.

A killall unbound-dhcp-leases-bridge may be a bit better than a kill with a pgrep.

Does all that execute every time the app runs?

I know nothing about tty so skipped that bit of code, but too much debugging output will slow things.

Comment 106 Jon 2024-04-04 21:53:34 UTC

It does for the current dhcpEvent, but all of this will be removed if the idea is acceptable to the Core Devs.

I was just being paranoid and I wanted to make sure `unbound-dhcp-leases-bridge` was not running at the same time.  And yes, it did run and check every time.  So I've changed the test so that does not happen

It is not worth spending time on to make faster since it will disappear one day.  

But it is noteworthy since it is/was included the the original 500ms.  And this needs to be commented out for time tests:
```
eval $(/usr/local/bin/readhash /var/ipfire/dhcp/settings)

# Is "DNS Update > Enable DNS Update (RFC2136)" enabled or disabled?
if [[ "${DNS_UPDATE_ENABLED}" != on ]]; then
	/usr/bin/logger --tag dhcpEvent "DNS Update (RFC2136) is NOT enabled.  Bridge may be running"

	pgrep -f unbound-dhcp-leases-bridge &&
		kill -s SIGTERM $(pgrep -d',' -f unbound-dhcp-leases-bridge) &&
		/usr/bin/logger --tag dhcpEvent "terminated unbound-dhcp-leases-bridge"
else
 	/usr/bin/logger --tag dhcpEvent "DNS Update (RFC2136) is enabled"
fi
```

Comment 107 Nick Howitt 2024-04-05 11:13:40 UTC

Created attachment 1514 [details]
dhcpEvent updated with speed ups and updating of dhcp-leases.conf removed

I've commented out the "Kill the unbound-dhcp-leases-bridge process" section and added all the speed ups I mentioned. Note the RE I originally proposed to validate the IP address was buggy. THis has the correct version.

I have also removed the lines updating the dhcp-leases.conf file. This should be removed to the Unbound start up as a pre-start task. It hen becomes a one off task rather than a perpetual maintenance task. It is only used to load the cache at the beginning and, thereafter, the cache is maintained by this script.

Now the release is slower than the commit!

[root@ipfire ~]# time /root/dhcpEvent.njh commit 192.168.1.1 xxxx
added 2 datas
dhcpEv28: commit: added A & PTR records to unbound

real    0m0.067s
user    0m0.034s
sys     0m0.040s
[root@ipfire ~]# /usr/sbin/unbound-control list_local_data | grep xxx
1.1.192.168.in-addr.arpa.       60      IN      PTR     xxxx.localdomain.
xxxx.localdomain.       60      IN      A       192.168.1.1
[root@ipfire ~]# time /root/dhcpEvent.njh release 192.168.1.1
removed 2 datas
dhcpEv28: release: references to xxxx.localdomain. removed from unbound

real    0m0.075s
user    0m0.050s
sys     0m0.045s
[root@ipfire ~]# /usr/sbin/unbound-control list_local_data | grep xxx
[root@ipfire ~]# 

This was tested on a 2-core VM with 2GB RAM

Comment 108 Nick Howitt 2024-04-05 11:21:18 UTC

I have a feeling this can leave orphaned PTR records. If you get a lease on your phone, for example, then change the phone's name and refresh the lease on the phone (WiFi off then on, for example), the old A record will get overwritten, but I think the PTR will remain?

Comment 109 Nick Howitt 2024-04-05 14:31:12 UTC

The "revIP" line is not needed in delete_unbound.

unbound-control can take a -q which may have a benefit.

There are a couple of orphaned parameters in the script now e.g unboundDHCPleases and in delete_unbound, revIP, if the unnecessary line is removed.

Comment 110 Jon 2024-04-05 15:07:31 UTC

Created attachment 1515 [details]
dhcpEvent - v30

Here is the current dhcpEvent (v30).  It includes the changes up to April 4.  This is the 300ms version (after msg_log commands commented out).

Comment 111 Jon 2024-04-05 15:13:02 UTC

(In reply to Nick Howitt from comment #109)
> The "revIP" line is not needed in delete_unbound.

How are PTR / `.in-addr.arpa` lines deleted without reverse IP?

> 
> unbound-control can take a -q which may have a benefit.

Got it - I will give it a try!

> There are a couple of orphaned parameters in the script now e.g
> unboundDHCPleases and in delete_unbound, revIP, if the unnecessary line is
> removed.

I do not understand how I can get rid of code related to unbound DHCP leases.  How does `/etc/unbound/dhcp-leases.conf` get updated?

Comment 112 Nick Howitt 2024-04-05 15:22:15 UTC

(In reply to Jon from comment #111)
> (In reply to Nick Howitt from comment #109)
> > The "revIP" line is not needed in delete_unbound.
> 
> How are PTR / `.in-addr.arpa` lines deleted without reverse IP?
> 
Have a look at the code. It is not used. The reverse IP is derived from grepping the ucList for the FQDN
> > 
> > unbound-control can take a -q which may have a benefit.
> 
> Got it - I will give it a try!
> 
I don't think it makes much difference from my observations
> > There are a couple of orphaned parameters in the script now e.g
> > unboundDHCPleases and in delete_unbound, revIP, if the unnecessary line is
> > removed.
> 
> I do not understand how I can get rid of code related to unbound DHCP
> leases.  How does `/etc/unbound/dhcp-leases.conf` get updated?
`/etc/unbound/dhcp-leases.conf` only gets read during the start or reload of Unbound. While unbound is running, changing the file does nothing. As a logical extension, why, then, bother to maintain it while unbound is running? Its generation could then move to the Unbound start routine, /etc/init.d/unbound to run prior to unbound starting. There is a huge amount of pre-processing done in init file, so why not add the file generation there, where it is not so time critical and will rarely get called.

Comment 113 Jon 2024-04-05 16:52:51 UTC

(In reply to Nick Howitt from comment #112)
> (In reply to Jon from comment #111)
> > (In reply to Nick Howitt from comment #109)
> > > The "revIP" line is not needed in delete_unbound.
> > 
> > How are PTR / `.in-addr.arpa` lines deleted without reverse IP?
> > 
> Have a look at the code. It is not used. The reverse IP is derived from
> grepping the ucList for the FQDN
removed - thank you!

> > > unbound-control can take a -q which may have a benefit.
> > 
> > Got it - I will give it a try!
> > 
> I don't think it makes much difference from my observations
added -q and removed `>/dev/null`

> `/etc/unbound/dhcp-leases.conf` only gets read during the start or reload of
> Unbound. While unbound is running, changing the file does nothing. As a
> logical extension, why, then, bother to maintain it while unbound is
> running? Its generation could then move to the Unbound start routine,
> /etc/init.d/unbound to run prior to unbound starting. There is a huge amount
> of pre-processing done in init file, so why not add the file generation
> there, where it is not so time critical and will rarely get called.

You have to show me the code for this one.  I dont think this will work.  Records A & PTR cannot be created on the fly.  That is why `/etc/unbound/dhcp-leases.conf` exists.

Comment 114 Nick Howitt 2024-04-05 17:16:20 UTC

You've missed what I am trying to say. Another script would be needed. It could be based on a cut down /usr/sbin/unbound-dhcp-leases-bridge. It queries the leases file (/var/state/dhcp/dhcpd.leases) and static leases, eliminates anything expired, anything covered by the hosts file and anything without a name on the lease and then generates the /etc/unbound/dhcp-leases.conf. After that it reloads unbound (which we would not need). Problem is I don't speak python. The same thing could probably be scripted in bash but I don't know how to query the information out of the dhcpd.leases file. This should be run during the unbound start up.

So the code does not exist, but generating the /etc/unbound/dhcp-leases.conf is better done as part of the unbound start (IMHO).

Also generating the file during the unbound start removes the risk of the file diverging over time as it gives a full reset.

Comment 115 Bernhard Bitsch 2024-04-05 18:14:39 UTC

Just decided to join this discussion. Didn't read it as a whole, yet.

Therefore my motivation only.
I am examining filtering using the RPZ mechanism of unbound. The block lists are updated by the SOA functionality. I noticed that lists with a longer update period ( usually 1 day ) are not checked. I suppose, the reload of unbound resets the associated timers. This results in deactivation of the update mechanism.
Don't know, whether other SOA based lists lack actuality also.

Jon's concept of syncing DHCP and unbound by the CRE mechanism and unbound-control load-local-data seems a way out of this problem.

Comment 116 Jon 2024-04-06 01:05:06 UTC

Nick - Now I understand better.  But I am not going to touch this aspect of the current unbound bridge.  It is way over my head!

Hey Bernhard!

(In reply to Nick Howitt from comment #108)
> I have a feeling this can leave orphaned PTR records. If you get a lease on
> your phone, for example, then change the phone's name and refresh the lease
> on the phone (WiFi off then on, for example), the old A record will get
> overwritten, but I think the PTR will remain?

I do not see stay PTRs.  Is this with your code from Comment 107 or my code?  Does this still happen?

Comment 117 Nick Howitt 2024-04-06 08:47:44 UTC

I may be able to have a look at the unbound bridge, but I'll be slow!

The thing about orphaned PTR's is hypothetical and I have not tried to demonstrate. It goes roughly like this:

Obtain a lease with a name like Pixel7a e.g. for 192.168.1.10.
You get an A record Pixel7a.domain pointing to 192.168.1.10.
You get a PTR record 10.1.168.192.in-addr.arpa pointing to Pixel7a.domain
Cut the device off and rename it to Dad.
Get a new lease. As the MAC address has not changed, it will get the same IP
You get an A record Dad.domain pointing to 192.168.1.10. This appears to overwrite the current A record.
You get a PTR record, 10.1.168.192.in-addr.arpa pointing to Dad.domain. This does not overwrite the previous PTR record.

BUT I have just tested and I cannot reproduce it so possibly the code handles it. Originally I came across it when I mocked up a few lines of code, but not with this full routine.

Comment 118 Nick Howitt 2024-04-06 13:48:21 UTC

Created attachment 1516 [details]
Script to generate /etc/unbound/dhcp-leases.conf

This is a major hack job on the original /usr/sbin/unbound-dhcp-leases-bridge (not my proposed modified version as my modifications are redundant with this approach). It creates a new /etc/unbound/dhcp-leases.conf and is designed to run during the unbound startup.

I have removed the inotify bits (file monitoring), the perpetual loop and daemonizing elements. It also drops the restart. I have tried to cut out all the bits I can but may have left in some redundant "imports" or other bits so feel free to make any changes.

If you want to test and not overwrite your current /etc/unbound/dhcp-leases.conf, comment out line 487 and uncomment line 488. Modify line 488 to suit.

This script should be run from /etc/init.d/unbound after the write_forward_conf line (574).

Comment 119 Jon 2024-04-06 16:30:19 UTC

I'll keep an eye out for stray PTR also.  So far I have not seen any.  

(In reply to Nick Howitt from comment #117)
> Cut the device off and rename it to Dad.

The code in dhcpEvent looks at a hostname name change.
```
UC_FQDN=$( /bin/grep --word-regexp --ignore-case "^${clientHostname}\.${DOMAINNAME}" <<< "${ucList}" )
if [[ -z "${UC_FQDN}" ]] ; then
#	No, hostname.  So the hostname changed.
	#	delete A & PTR records from unbound
	delete_unbound "${clientIP}"

	#	add new to unbound!
	write-unbound "${clientIP}" "${clientHostname}"
fi
```

so hopefully this handles things correctly!


(In reply to Nick Howitt from comment #118)
> Created attachment 1516 [details]
> Script to generate /etc/unbound/dhcp-leases.conf
> 
> This is a major hack job on the original
> /usr/sbin/unbound-dhcp-leases-bridge (not my proposed modified version as my
> modifications are redundant with this approach). It creates a new
> /etc/unbound/dhcp-leases.conf and is designed to run during the unbound
> startup.

One of the reasons I wrote dhcpEvent was to find a way around parsing the file generated by DHCP @ `/var/state/dhcp/dhcpd.leases`.  So my preference is to stick with `dhcp-leases.conf` being populated with dhcpEvent.

Comment 120 Nick Howitt 2024-04-06 16:59:12 UTC

(In reply to Jon from comment #119)
> I'll keep an eye out for stray PTR also.  So far I have not seen any.  
> 
> (In reply to Nick Howitt from comment #117)
> > Cut the device off and rename it to Dad.
> 
> The code in dhcpEvent looks at a hostname name change.
> ```
> UC_FQDN=$( /bin/grep --word-regexp --ignore-case
> "^${clientHostname}\.${DOMAINNAME}" <<< "${ucList}" )
> if [[ -z "${UC_FQDN}" ]] ; then
> #	No, hostname.  So the hostname changed.
> 	#	delete A & PTR records from unbound
> 	delete_unbound "${clientIP}"
> 
> 	#	add new to unbound!
> 	write-unbound "${clientIP}" "${clientHostname}"
> fi
> ```
> 
> so hopefully this handles things correctly!
> 
> 

I'd missed the hostname change code. Looks like it it is handling it.

> (In reply to Nick Howitt from comment #118)
> > Created attachment 1516 [details]
> > Script to generate /etc/unbound/dhcp-leases.conf
> > 
> > This is a major hack job on the original
> > /usr/sbin/unbound-dhcp-leases-bridge (not my proposed modified version as my
> > modifications are redundant with this approach). It creates a new
> > /etc/unbound/dhcp-leases.conf and is designed to run during the unbound
> > startup.
> 
> One of the reasons I wrote dhcpEvent was to find a way around parsing the
> file generated by DHCP @ `/var/state/dhcp/dhcpd.leases`.  So my preference
> is to stick with `dhcp-leases.conf` being populated with dhcpEvent.
The code already existed to parse the /var/state/dhcp/dhcpd.leases - /usr/sbin/unbound-dhcp-leases-bridge. All I've done, more or less, is remove the daemonising elements so it runs as a one-shot.

By using this you can remove your lines which maintain /etc/unbound/dhcp-leases.conf and gain valuable time in your code execution and execution time is at a premium.

It also stops any possibility of /etc/unbound/dhcp-leases.conf deviating from where it should be, for example on a crash or termination of a routine. The file only ever needs to be correct when unbound starts and it will be with this script. This is the safer and faster option.

Comment 121 Nick Howitt 2024-04-06 17:36:18 UTC

(In reply to Nick Howitt from comment #118)
> Created attachment 1516 [details]
> Script to generate /etc/unbound/dhcp-leases.conf
> 
> This is a major hack job on the original
> /usr/sbin/unbound-dhcp-leases-bridge (not my proposed modified version as my
> modifications are redundant with this approach). It creates a new
> /etc/unbound/dhcp-leases.conf and is designed to run during the unbound
> startup.
> 
> I have removed the inotify bits (file monitoring), the perpetual loop and
> daemonizing elements. It also drops the restart. I have tried to cut out all
> the bits I can but may have left in some redundant "imports" or other bits
> so feel free to make any changes.
> 
> If you want to test and not overwrite your current
> /etc/unbound/dhcp-leases.conf, comment out line 487 and uncomment line 488.
> Modify line 488 to suit.
> 
> This script should be run from /etc/init.d/unbound after the
> write_forward_conf line (574).

The code should also be run on a reload, so after write_hosts_conf on line 598.

Comment 122 Jon 2024-04-15 22:21:29 UTC

Created attachment 1524 [details]
Zip file for dhcpEvent

Update dhcpd.conf.local
- Added commit check for a hostname.  If there is no commit hostname, then do not execute dhcpEvent script.  (small speed improvement)
- General clean-up of config file.

Update dhcpEvent
- removed hostname check for commit.  Now done as part of "on commit" in dhcpd.conf (small speed improvement)
- General clean up.

Comment 123 Jon 2024-04-16 19:17:46 UTC

Michael - I need your input.

I'd like to keep moving forward.  And I am thinking of re-writing the bash script into perl.  And I don't want to go forward without your approval.

So, do you agree with the proof-of-concept and using the DHCP on-commit / on-release / on-expire functions to send IP address and Hostname to Unbound DNS?

If yes, are you OK with re-writing the dchpEvent using perl?  (I have not written anything using perl so this will be a bit of a learning curve)

Comment 124 Nick Howitt 2024-04-16 19:35:16 UTC

I think we also need input on 2 other issues:

1 - can we use `DOMAINNAME=${HOSTNAME#*.*}` using the built-in HOSTNAME do derive the DOMAINNAME or do we have to do `eval "$(/usr/local/bin/readhash /var/ipfire/main/settings)"` which is much slower (in bash), so I assume similar in perl.

2 - does the dhcp-leases file have to be maintained during cache updates? As far as I can see, the file is only ever read on startup of unbound so a cut down version of /usr/sbin/unbound-dhcp-leases-bridge could be run as a pre-start and pre-reload task in init.d/unbound to generate the dhcp-leases file. All the script needs to do is regenerate dhcp-leases with out any restarts. I have attached a working cut down version of the script which removes all deamonising elements and restarts. It also removes the need for my mods which I submitted a couple of weeks ago. It runs very quickly on my machine and makes no noticeable difference to the start up time of unbound. It also, then, removes any requirement to maintain dhcp-leases when doing cache updates, speeding them up.

Comment 125 Jon 2024-04-22 16:21:27 UTC

(In reply to Jon from comment #123)
> Michael - I need your input.
> 
> I'd like to keep moving forward.  And I am thinking of re-writing the bash
> script into perl.  And I don't want to go forward without your approval.
> 
> So, do you agree with the proof-of-concept and using the DHCP on-commit /
> on-release / on-expire functions to send IP address and Hostname to Unbound
> DNS?
> 
> If yes, are you OK with re-writing the dchpEvent using perl?  (I have not
> written anything using perl so this will be a bit of a learning curve)


Michael - are you OK with me going forward with a perl script to do what dhcpEvent.sh does?

Comment 126 Michael Tremer 2024-04-23 11:28:24 UTC

(In reply to Jon from comment #125)
> Michael - are you OK with me going forward with a perl script to do what
> dhcpEvent.sh does?

Erm, why Perl?

Comment 127 Jon 2024-04-23 13:42:07 UTC

It seems to be MUCH faster than bash and had the possibility of getting from 300 ms (w/ bash) down to 100 ms.  And I did not want to learn C.

Comment 128 Jon 2024-04-23 13:43:46 UTC

and I have someone to help me write perl!

Comment 129 Nick Howitt 2024-04-23 13:51:17 UTC

When I ran your dhcpEvent in bash but removed the updates to the /etc/unbound/dhcp-leases.conf and all your pre-amble at the top to kill the unbound-dhcp-leases-bridge process and used my alt method for deriving the domain name, my dhcpEvent took about 70 ms to execute.

Comment 130 Bernhard Bitsch 2024-04-23 16:03:24 UTC

I suppose Nick's changes have found their way into CU185.
The restarts are really minimised, but the transfer of new DHCP info needs a reload nevertheless.
This resets the update timers for RPZ list, for example.
Jon's approach with unbound-control avoids this.
But it is necessary to separate the DHCP events handling from the update process:
- the event writes the parameters to a pipe
- the bridge process works as a server reading this pipe

My idea.

Comment 131 Nick Howitt 2024-04-23 16:08:56 UTC

Hmm. I never received any feedback for my patch. I was going to chase it again on Thursday, 28 days after I submitted it and I have not updated to CU185 to see if it is there.

However, Jon's approach is way better and mine should only be considered an interim fix. I'll update my machine to see if my fix has been incorporated.

Comment 132 Jon 2024-04-23 22:11:51 UTC

I did not locate any recent changes to the bridge.  The last change I found is July 2023.

https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=f20ca78eff6e8baeb86361f55adf52819d1bae1f


But I do see an update to Unbound:

https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=4e57a4f31ac0b3bebac4fd43ad1dd282a4ded352

Maybe that changed the "restarts"

Comment 133 Nick Howitt 2024-04-24 09:44:10 UTC

The only way that could affect the restarts is if it no longer wrote DHCP leases to the cache at /var/state/dhcp/dhcpd.leases

I can't test for the moment as my test system is bust and won't do DNS so I can't update it and I've had to decommission my production system to test a work related issue on another firewall.

Comment 134 Michael Tremer 2024-05-10 16:59:00 UTC

So I have implemented the version that I had in mind where a simple script (as simple as possible really) communities with the bridge over a UNIX domain socket:

> https://git.ipfire.org/?p=people/ms/ipfire-2.x.git;a=shortlog;h=refs/heads/unbound-socket

The client currently waits for a response from the bridge which takes around 100 ms. Without the wait it finishes in less than 10 ms on an IPFire Mini Appliance. That should pretty much be the goal.

The bridge will then take all the events and put them into a queue so that it is ready to read the event as soon as possible. The queue is being processed by a worker thread which will update all existing leases and apply any changes to Unbound if necessary.

Please give this a test and let me know if this is all working for you, too.

Comment 135 Jon 2024-05-10 23:50:57 UTC

I’d be happy to test but I really don’t know how to install this.  Sorry for being stupid.

I saw ~22 patches.  Is this part of CU 186 or CU 187?

Comment 136 Jon 2024-05-11 00:32:04 UTC

Michael,
I started looking through the patches (they are not installed).

I came across the "Add event handlers" which looks like the code that goes into a "dhcpd.conf.local" file for me:

https://git.ipfire.org/?p=people/ms/ipfire-2.x.git;a=blob;f=html/cgi-bin/dhcp.cgi;h=83cd809656ad0076fd5d4646b8ce64d09aff3a55;hb=83cd809656ad0076fd5d4646b8ce64d09aff3a55#l1377

If it is the same code for "dhcpd.conf.local", then there is a parameter missing.

Bernhard noted the dhcp.cgi WebGUI page "Domain name suffix" should change the domain name.  So this was added to the dhcpd.conf code:

set NoDomain = "localdomain";
set ClientDomain = pick-first-value(config-option domain-name, NoDomain);
. . .
execute("/usr/sbin/dhcpEvent", "commit", ClientIP, ClientName, ClientDomain);

Comment 137 Michael Tremer 2024-05-11 10:20:44 UTC

(In reply to Jon from comment #135)
> I’d be happy to test but I really don’t know how to install this.  Sorry for
> being stupid.
> 
> I saw ~22 patches.  Is this part of CU 186 or CU 187?

You just need to copy the three files that I have touched: dhcp.cgi and the bridge and the new client. The scripts go to /usr/sbin. Make sure they are executable. Then hit the Save button on the DHCP page to rewrite the configuration file and everything should be up and running.

This is not yet scheduled for any of the updates.

(In reply to Jon from comment #136)
> Michael,
> I started looking through the patches (they are not installed).
> 
> I came across the "Add event handlers" which looks like the code that goes
> into a "dhcpd.conf.local" file for me:
> 
> https://git.ipfire.org/?p=people/ms/ipfire-2.x.git;a=blob;f=html/cgi-bin/
> dhcp.cgi;h=83cd809656ad0076fd5d4646b8ce64d09aff3a55;
> hb=83cd809656ad0076fd5d4646b8ce64d09aff3a55#l1377
> 
> If it is the same code for "dhcpd.conf.local", then there is a parameter
> missing.

Yeah, I took this from the file that you uploaded in the ZIP container in an earlier comment. Maybe I messed this up when copy-and-pasting it?

> Bernhard noted the dhcp.cgi WebGUI page "Domain name suffix" should change
> the domain name.  So this was added to the dhcpd.conf code:
> 
> set NoDomain = "localdomain";
> set ClientDomain = pick-first-value(config-option domain-name, NoDomain);
> . . .
> execute("/usr/sbin/dhcpEvent", "commit", ClientIP, ClientName, ClientDomain);

I did experiment with that, because we currently have a rather complicated solution to determine the full FQDN. However, on my test system the domain-name field was always empty. The domain-name is also empty in the leases file. Not sure if this requires some more changes elsewhere in the configuration?

The protocol that the bridge is talking over the socket should be easy to extend without even touching the client. It just pipes the event into the socket and waits for a response :)

Comment 138 Nick Howitt 2024-05-11 11:09:39 UTC

I have to admit I have not read the code in detail, but @Jon, didn't you move the domain check to your dhcpd.conf.local file because it was faster there than in the dhcpScript file? Would it now be faster in the new dhcp-unbound-leases-bridge/script routine as things like the DOMAINNAME only need to be determined once and, because the data has been piped at this point, the dhcpd.conf.local will return quicker?

Comment 139 Nick Howitt 2024-05-11 11:59:47 UTC

@Jon, the latest dhcpd.conf.local is only on your github repo. It did not make it to this bug which is probably why @Michael does not know about it.

Comment 140 Jon 2024-05-12 02:51:11 UTC

Created attachment 1547 [details]
dhcp settings

(In reply to Michael Tremer from comment #137)
> (In reply to Jon from comment #135)
> > If it is the same code for "dhcpd.conf.local", then there is a parameter
> > missing.
> 
> Yeah, I took this from the file that you uploaded in the ZIP container in an
> earlier comment. Maybe I messed this up when copy-and-pasting it?
> 
not at all.  it is a recent change.


> I did experiment with that, because we currently have a rather complicated
> solution to determine the full FQDN. However, on my test system the
> domain-name field was always empty. The domain-name is also empty in the
> leases file. Not sure if this requires some more changes elsewhere in the
> configuration?

No additional changes.  Just adding three lines to dhcpd.conf:
```
set NoDomain = "localdomain";
set ClientDomain = pick-first-value(config-option domain-name, NoDomain);
. . .
execute("/usr/sbin/dhcpEvent", "commit", ClientIP, ClientName, ClientDomain);
```

maybe a dash is missing from "config-option . . ." or a misspelled name??  It should never be blank since it should grab the variable "NoDomain"

This is the current dhcpd.conf:
```
#	This file CAN be modified by the user.
#	  If this file is not needed, just delete the text.
#	  But, DO NOT completely delete this file! 

#
#	commit
#
on commit {
	set ClientIP = binary-to-ascii(10, 8, ".", leased-address);

	set NoName = "";
	set ClientName = pick-first-value(option host-name, config-option-host-name, client-name, NoName);

	set NoDomain = "localdomain";
	set ClientDomain = pick-first-value(config-option domain-name, NoDomain);

 	#log(concat("info: commit-CRE: ClientIP: ", ClientIP, "  Name: ", ClientName, "  Domain: ", ClientDomain));

	if ClientName != "" { 
		execute("/usr/sbin/dhcpEvent", "commit", ClientIP, ClientName, ClientDomain);
	} else {
		log(concat("info: commit-CRE: missing client hostname from ", ClientIP, ". Nothing to send to unbound"));
	}
}

#
#	release
#
on release {
	set ClientIP = binary-to-ascii(10, 8, ".", leased-address);
	
	set NoName = "";
	set ClientName = pick-first-value(option host-name, config-option-host-name, client-name, NoName);
	
	#log(concat("info: release-CRE: ClientIP: ", ClientIP, "  Name: ", ClientName));
	execute("/usr/sbin/dhcpEvent", "release", ClientIP, ClientName);
}

#
#	expiry
#
on expiry {
	set ClientIP = binary-to-ascii(10, 8, ".", leased-address);
	
	#log(concat("info: expiry-CRE: ClientIP: ", ClientIP));
	execute("/usr/sbin/dhcpEvent", "expiry", ClientIP, "");
}
```

do the domain names appear in the dhcpd.conf file?


> The protocol that the bridge is talking over the socket should be easy to
> extend without even touching the client. It just pipes the event into the
> socket and waits for a response :)

curious - How does the unbound-dhcp-leases-bridge know when to grab something from the client / socket?

Comment 141 Jon 2024-05-12 19:55:25 UTC

(In reply to Michael Tremer from comment #137)
> (In reply to Jon from comment #135)
> The domain-name is also empty in the
> leases file. Not sure if this requires some more changes elsewhere in the
> configuration?

In the dhcpd.leases file, do you see the "on-expiry/on-release" at the bottom of some leases?  Or do you also have the "set VariAble" lines?

```
lease 192.168.65.106 {
  starts 0 2024/05/12 15:30:20;
  ends 0 2024/05/12 17:30:20;
  cltt 0 2024/05/12 15:30:20;
  binding state active;
  next binding state free;
  rewind binding state free;
  hardware ethernet 40:c7:00:01:02:03;
  uid "\001@\307\000\001\002\003";
  set ClientDomain = "localdomain";
  set NoDomain = "localdomain";
  set ClientName = "SiPhone12";
  set ClientIP = "192.168.65.106";
  client-hostname "SiPhone12";
  on expiry {
    set ClientIP = 
       binary-to-ascii (10, 8, ".", leased-address) ;
    execute ("/usr/sbin/dhcpEvent", "expiry", ClientIP, "");
  }
  on release {
    set ClientIP = 
       binary-to-ascii (10, 8, ".", leased-address) ;
    set NoName = "" ;
    set ClientName = 
       pick-first-value (option host-name, config-option-host-name, client-name, NoName) ;
    execute ("/usr/sbin/dhcpEvent", "release", ClientIP, ClientName);
  }
}

Comment 142 Jon 2024-05-12 20:09:57 UTC

also, I get this error when running `unbound-dhcp-leases-client`

[root@ipfireAPU2 ~]# /usr/sbin/unbound-dhcp-leases-client release 192.168.7.250
/usr/sbin/unbound-dhcp-leases-client: line 41: nc: command not found

Comment 143 Jon 2024-05-12 20:18:32 UTC

looks like the client is dependent on `ncat`.  I added nat via pakfire,  but I get "ERROR":

```
[root@ipfireAPU2 ~]# /usr/sbin/unbound-dhcp-leases-client release 192.168.7.250 
ERROR
```

but I am not sure where it comes from...  May be the unbound bridge??

Comment 144 Jon 2024-05-12 20:34:29 UTC

I added this to the `dhcp.cgi`:

```
set NoDomain = "localdomain";
set ClientDomain = pick-first-value(config-option domain-name, NoDomain);
log(concat("info: commit-CRE: ClientIP: ", ClientIP, "  Name: ", ClientName, "  Domain: ", ClientDomain));

if (ClientName != "") {
    execute("/usr/sbin/unbound-dhcp-leases-client", "commit", ClientAddress, ClientName, ClientDomain);
}
```


and it worked for getting the domain.  It arrives as `execute_statement argv[4]`:
```
May 12 15:25:03 ipfireAPU2 dhcpd: execute_statement argv[4] = localdomain
```


BUT, the get it to "take" I had to reload the dhcp.cgi WebGUI page _AND_ do a `/etc/init.d/dhcp restart`.  Otherwise it did not work.


And I get these errors:
````
May 12 15:25:03 ipfireAPU2 dhcp[18015]: Could not handle message: No value given
May 12 15:25:03 ipfireAPU2 dhcpd: execute: /usr/sbin/unbound-dhcp-leases-client exit status 256

```

here is the section of the message log:


```
May 12 15:25:01 ipfireAPU2 dhcpd: DHCPDISCOVER from 02:d2:ef:b1:e2:5d via green0
May 12 15:25:02 ipfireAPU2 dhcpd: DHCPOFFER on 192.168.7.104 to 02:d2:ef:b1:e2:5d via green0
May 12 15:25:03 ipfireAPU2 dhcpd: execute_statement argv[0] = /usr/sbin/unbound-dhcp-leases-client
May 12 15:25:03 ipfireAPU2 dhcpd: execute_statement argv[1] = commit
May 12 15:25:03 ipfireAPU2 dhcpd: execute_statement argv[2] = ADDRESS=192.168.7.104
May 12 15:25:03 ipfireAPU2 dhcpd: execute_statement argv[3] = NAME=
May 12 15:25:03 ipfireAPU2 dhcpd: execute_statement argv[4] = localdomain
May 12 15:25:03 ipfireAPU2 dhcp[18015]: Could not handle message: No value given
May 12 15:25:03 ipfireAPU2 dhcpd: execute: /usr/sbin/unbound-dhcp-leases-client exit status 256
May 12 15:25:03 ipfireAPU2 dhcpd: DHCPREQUEST for 192.168.7.104 (192.168.7.1) from 02:d2:ef:b1:e2:5d via green0
May 12 15:25:03 ipfireAPU2 dhcpd: DHCPACK on 192.168.7.104 to 02:d2:ef:b1:e2:5d via green0
```


maybe related to adding an extra argument?!?

Comment 145 Michael Tremer 2024-05-13 10:35:39 UTC

(In reply to Nick Howitt from comment #139)
> @Jon, the latest dhcpd.conf.local is only on your github repo. It did not
> make it to this bug which is probably why @Michael does not know about it.

I didn't even know there was a GitHub repository. It has not been linked on here.

Do you not have one on our infrastructure?

(In reply to Jon from comment #140)
> do the domain names appear in the dhcpd.conf file?

Yes, I had domain names configured.

> > The protocol that the bridge is talking over the socket should be easy to
> > extend without even touching the client. It just pipes the event into the
> > socket and waits for a response :)
> 
> curious - How does the unbound-dhcp-leases-bridge know when to grab
> something from the client / socket?

It just listens all the time. When a connection comes in, it reads exactly one message from it, processes it and then sends a response. The "processing" part is simply to check if the message is roughly valid. Properly handling happens later and the client will never know if it has been properly processed. But that is in line with dhpcd's fire and forget approach.

(In reply to Jon from comment #141)
> (In reply to Michael Tremer from comment #137)
> > (In reply to Jon from comment #135)
> > The domain-name is also empty in the
> > leases file. Not sure if this requires some more changes elsewhere in the
> > configuration?
> 
> In the dhcpd.leases file, do you see the "on-expiry/on-release" at the
> bottom of some leases?  Or do you also have the "set VariAble" lines?

Yes, my leases have this set. Weird.

(In reply to Jon from comment #144)
> maybe related to adding an extra argument?!?

You seem to have two problems here. First, NAME is empty. That makes the bridge consider that this is an invalid message. Maybe it should not do that.

Then you added the domain, but it is missing its key. It should be an argument of the format KEY=VALUE, or in this case DOMAIN=localdomain.

Comment 146 Jon 2024-05-13 17:01:30 UTC

(In reply to Michael Tremer from comment #145)
> I didn't even know there was a GitHub repository. It has not been linked 
> on here.  Do you not have one on our infrastructure?

The GitHub was private and planned to move it all once you had blessed the proof-of-concept.

> (In reply to Jon from comment #144)
> > maybe related to adding an extra argument?!?
> 
> You seem to have two problems here. First, NAME is empty. That makes the
> bridge consider that this is an invalid message. Maybe it should not do that.
> 
> Then you added the domain, but it is missing its key. It should be an
> argument of the format KEY=VALUE, or in this case DOMAIN=localdomain.

Got it.  I'll change my testing to KEY=VALUE.

Is DOMAIN=VALUE in the bridge code? Or is it always set to "localdomain?

Comment 147 Michael Tremer 2024-05-13 17:04:06 UTC

(In reply to Jon from comment #146)
> (In reply to Michael Tremer from comment #145)
> > I didn't even know there was a GitHub repository. It has not been linked 
> > on here.  Do you not have one on our infrastructure?
> 
> The GitHub was private and planned to move it all once you had blessed the
> proof-of-concept.

If you don't have a repository on it already, we also have a Git server :)

> > (In reply to Jon from comment #144)
> > > maybe related to adding an extra argument?!?
> > 
> > You seem to have two problems here. First, NAME is empty. That makes the
> > bridge consider that this is an invalid message. Maybe it should not do that.
> > 
> > Then you added the domain, but it is missing its key. It should be an
> > argument of the format KEY=VALUE, or in this case DOMAIN=localdomain.
> 
> Got it.  I'll change my testing to KEY=VALUE.
> 
> Is DOMAIN=VALUE in the bridge code? Or is it always set to "localdomain?

Currently the bridge doesn't know about a DOMAIN key. But you can still add more data to the message that the bridge receives and nothing will break. So that should help us to verify whether this works and we can add the part in the bridge after that.

Comment 148 Jon 2024-05-13 17:13:54 UTC

got it!

I did added/changed this in the dhcp.cgi for DOMAIN:

```
set ClientDomain  = concat(
	"DOMAIN=",
	pick-first-value(config-option domain-name, "localdomain");

if (ClientName != "") {
	execute("/usr/sbin/unbound-dhcp-leases-client", "commit", ClientAddress, ClientName, ClientDomain);
}

Comment 149 Jon 2024-05-14 21:30:45 UTC

When an item is released, should it be deleted from unbound cache AND dhcp-leases.conf?

I see items gone from the unbound cache (as they should), but they still exist in the /etc/unbound/dhcp-leases.conf file.


I am using these commands to check:
```
unbound-control list_local_data | grep <HOSTNAME>

cat /etc/unbound/dhcp-leases.conf | grep <HOSTNAME>
```

Comment 150 Nick Howitt 2024-05-14 21:48:04 UTC

It is not working like that. The new programs do not maintain dhcp-leases.conf dynamically. It works more like I proposed, but a little differently. When dhcp starts, it launches the bridge routine. In its first pass it generates a new dhcp-leases.conf file then unbound is reloaded. There is then no further attempt to maintain the dhcp-leases.conf until dhcp is next started.

Thinking about it, I am not sure it is particularly safe this way as, if unbound restarts, it will revert to the earlier dhcp-leases.conf which is no longer valid. Possibly my approach was safer, running a separate script on dhcp reload or start.

Perhaps Michael can comment?

Comment 151 Michael Tremer 2024-05-15 11:04:38 UTC

(In reply to Nick Howitt from comment #150)
> It is not working like that. The new programs do not maintain
> dhcp-leases.conf dynamically. It works more like I proposed, but a little
> differently. When dhcp starts, it launches the bridge routine. In its first
> pass it generates a new dhcp-leases.conf file then unbound is reloaded.
> There is then no further attempt to maintain the dhcp-leases.conf until dhcp
> is next started.

This is correct. We don't maintain the leases file for Unbound any more. We only generate it once when the bridge starts from the DHCP Leases and then we continue to update Unbound using unbound-control.

> Thinking about it, I am not sure it is particularly safe this way as, if
> unbound restarts, it will revert to the earlier dhcp-leases.conf which is no
> longer valid. Possibly my approach was safer, running a separate script on
> dhcp reload or start.

Well, this is indeed a problem with this approach (all of them have their downsides). The bridge has to be restarted after either Unbound or DHCP are being restarted, otherwise we risk the two things getting out of sync.

A possible solution could be to rewrite the unbound configuration regularly, so that if there is an unexpected restart of Unbound, it won't really start very far in the past.

(In reply to Jon from comment #149)
> unbound-control list_local_data | grep <HOSTNAME>

This should absolutely show you the current status.

Comment 152 Nick Howitt 2024-05-15 11:29:03 UTC

Or just use my unbound-dhcp-leases script at start up and reload of unbound and remove that element from your script? Then we cover all bases, I think.

Comment 153 Michael Tremer 2024-05-15 11:31:48 UTC

(In reply to Nick Howitt from comment #152)
> Or just use my unbound-dhcp-leases script at start up and reload of unbound
> and remove that element from your script? Then we cover all bases, I think.

Yes, that should cover 99% of all cases. The rest can be fixed with a reboot as a worst-case option.

Comment 154 Jon 2024-05-15 15:25:42 UTC

(In reply to Michael Tremer from comment #151)
> We don't maintain the leases file for Unbound any more. 
I think it should _always_ be maintained only for IPFire reboots and unbound restarts. 

The proof-of-concept code updates the unbound cache AND updates the unbound dhcp-leases.conf file at every change.

Yes, there are problems with two network interface clients (e.g., Ethernet and Wi-Fi on a single desktop computer), and will cause an out-of-sync.  But I think this can be fixed.

There is no reason to keep parsing the dhcp dhcpd.leases file.  Not even on restart if the unbound dhcp-leases.conf is updated as the event handler / bridge finds changes.

Comment 155 Jon 2024-05-15 15:27:42 UTC

Created attachment 1548 [details]
overly simple flow

This is a (massively) overly simplified view of the dhcp event script proof-of-concept.

Comment 156 Nick Howitt 2024-05-15 15:39:42 UTC

I know it is the current state but Michael is concerned at the number of times we open and close the dhcp-leases.conf file. It could be thousands of times a day and will (does) give a high level of wear on SD card installations and possibly even SSD installations. (Michael's use case has 1000-2000 users who come in in the morning and go home at night. With a 1h lease time, that is a minimum of 2000-4000 writes to the file and more if the devices go off or sleep for any period during the day when the leases expires)

At the same time there is a risk of the dhcp-leases.conf drifting out of true. Regenerating it on each unbound restart fixes and drift from true.

We are continually maintaining a file which is only ever read when unbound (re-)starts so it seems a waste of effort to continually maintain it compared to regenerating it once per (re-start)

Comment 157 Jon 2024-05-15 15:57:05 UTC

Created attachment 1549 [details]
compare

(In reply to Jon from comment #149)
> I see items gone from the unbound cache (as they should), but they still
> exist in the /etc/unbound/dhcp-leases.conf file.

The above is for 33 devices after about 1 day of use of the new bridge.

I wrote a "diff" compare script for:
/etc/unbound/dhcp-leases.conf AND /etc/unbound/hosts.conf  
-vs-
unbound-control list_local_data 

and I see these missing items in the attached image.

Comment 158 Jon 2024-05-15 16:34:08 UTC

Also, to make items easier to compare, each item placed in the /etc/unbound/dhcp-leases.conf file should be an absolute name with a period at the end of the hostname and the arpa name.  Example:

```
local-data: "MM2.localdomain. 60 IN A 192.168.60.20"
local-data: "20.60.168.192.in-addr.arpa. 60 IN PTR MM2.localdomain."
```

this is how they are stored in unbound cache:
```
[root@ipfire ~] # unbound-control list_local_data | grep MM2
20.60.168.192.in-addr.arpa.	60	IN	PTR	MM2.localdomain.
MM2.localdomain.	60	IN	A	192.168.60.20
```

Comment 159 Jon 2024-05-15 19:31:52 UTC

Created attachment 1550 [details]
better flow chart

Comment 160 Jon 2024-05-27 18:31:59 UTC

just checking in - anything new for me to test?

Comment 161 Jon 2024-06-18 15:51:17 UTC

see this post in Dev Mailing List:
https://lists.ipfire.org/hyperkitty/list/development@lists.ipfire.org/message/SIMSXLGR7BBMIDXPUHNIGC2ZBR3R64YD/

"Concerning CU186, I may have stumbled across an issue.  It took a few days for it to pop up.  See the message log below from Jun 17 13:48:34 to the end.  I experienced a flurry of Reloads and DNS stalled for ~ 15 minutes."

Comment 162 Jon 2024-06-20 18:34:37 UTC

I was looking through the message log and found another Unbound Rebooting storm on Jun 18.

It lasted about 13 minutes and I could not access anything (no DNS).

Comment 163 Michael Tremer 2024-08-13 09:34:38 UTC

Since this project has fallen asleep once again, I went ahead and merged my proposed changes from my Git branch. They are at least better than what we had before and I don't want to waste the time and effort that has gone into this.

Please test this and report any regressions here.

Comment 164 Nick Howitt 2024-08-13 19:53:09 UTC

Sorry, but I became unable to contribute as I could not apply all your patches. I am afraid I struggle with the multiple commits especially when some of them update other commits. I really needed a repo I could pull the full update from. This should now exist.

If I do any more work on the updates it will be to regenerate the dhcp-leases.conf on each unbound restart to get round the issue raised in https://bugzilla.ipfire.org/show_bug.cgi?id=13254#c150, and remove the updates to the unbound dhcp-leases.conf file at every change to speed it up and and reduce disk wear as the continual updates are pointless. As stated before, dhcp-leases.conf only ever needs to be valid when unbound starts or restarts and not at any other time.

If this is not the direction IPFire wants to take then there is no point in me providing patches. Please advise.

Comment 165 Michael Tremer 2024-08-14 09:33:12 UTC

(In reply to Nick Howitt from comment #164)
> Sorry, but I became unable to contribute as I could not apply all your
> patches. I am afraid I struggle with the multiple commits especially when
> some of them update other commits. I really needed a repo I could pull the
> full update from. This should now exist.

In case of there being so many changes, it is easiest to just download the raw script from the Git web UI. In this case, there were only two or three files that were actually changed.

> If I do any more work on the updates it will be to regenerate the
> dhcp-leases.conf on each unbound restart to get round the issue raised in
> https://bugzilla.ipfire.org/show_bug.cgi?id=13254#c150, and remove the
> updates to the unbound dhcp-leases.conf file at every change to speed it up
> and and reduce disk wear as the continual updates are pointless. As stated
> before, dhcp-leases.conf only ever needs to be valid when unbound starts or
> restarts and not at any other time.

We should not continuously write a file that is never being read. If anything there could be some check if unbound is running (or has changed PID) and in case we detect that, we can simply call the reload function of the bridge which will rewrite the leases file and reload Unbound. That would put those things back in sync.

> If this is not the direction IPFire wants to take then there is no point in
> me providing patches. Please advise.

I don't quite understand the question. We cannot just always stop when things get complicated and dump things on my table. Not saying that this has happened in this case, but I think we are not very far away from it. It takes me months to pick up on hundreds of these little projects. That does not mean that I don't want the changes, I just move my attention to something else if things are not going. Silence does not send a signal to me.

Comment 166 Nick Howitt 2024-08-14 10:28:13 UTC

(In reply to Michael Tremer from comment #165)
> > If I do any more work on the updates it will be to regenerate the
> > dhcp-leases.conf on each unbound restart to get round the issue raised in
> > https://bugzilla.ipfire.org/show_bug.cgi?id=13254#c150, and remove the
> > updates to the unbound dhcp-leases.conf file at every change to speed it up
> > and and reduce disk wear as the continual updates are pointless. As stated
> > before, dhcp-leases.conf only ever needs to be valid when unbound starts or
> > restarts and not at any other time.
> 
> We should not continuously write a file that is never being read. If
> anything there could be some check if unbound is running (or has changed
> PID) and in case we detect that, we can simply call the reload function of
> the bridge which will rewrite the leases file and reload Unbound. That would
> put those things back in sync.
So, if you reload unbound, "we can simply call the reload function of
the bridge which will rewrite the leases file and reload Unbound. That would
put those things back in sync". That sounds like an infinite loop.

It also means creating a watcher on the process.

Your update has specifically been designed to continuously write to the leases file. Then. because of the issue I pointed out in https://bugzilla.ipfire.org/show_bug.cgi?id=13254#c150, you seem to be coming up with evermore complex workarounds which are all unnecessary if we give up maintaining the leases file except when unbound starts or restarts.

> 
> > If this is not the direction IPFire wants to take then there is no point in
> > me providing patches. Please advise.
> 
> I don't quite understand the question. We cannot just always stop when
> things get complicated and dump things on my table. Not saying that this has
> happened in this case, but I think we are not very far away from it. It
> takes me months to pick up on hundreds of these little projects. That does
> not mean that I don't want the changes, I just move my attention to
> something else if things are not going. Silence does not send a signal to me.

The silence came because of lack of feedback to the mailing list thread and because the direction I am happy to go in does not seem to meet with approval.

Comment 167 Bernhard Bitsch 2024-08-14 11:31:42 UTC

I think I must break my 'silence' also.

The problem of many restarts of unbound has two implications:
1) a restart reduces the availability of the DNS service.
2) a restart or even reload resets internal timers

Topic 1) can be handled by means of reduction of writes to config files. This needs an examination which DHCP information must be kept for restarts of unbound or IPFire, this includes power loss handling.
Topic 2) influences unbound features which use automatic update mechanisms. The main task in the development pipe at the moment is jon's approach to include PiHole functionality using RPZ. The files used include a SOA record which defines an update period. Usage of these show, that unbound resets the update counters with each reload of the config files. Michael's new version tries to eliminate the reload by another way to signal DHCP information to unbound. dhcpd allows the use of events for leases changes to start a little program. This program triggers unbound-dhcp-leases-bridge to do unbound-control operations to add/remove local data definitions. This changes the internal information of unbound only, without touching the update timers.
This means unbound-leases-bridge must update the settings read at a restart of unbound, also.

I test Michael's modifications in parallell to Jon's RPZ addition.
I can confirm that the number of reloads went to the necessary amount, and even RPZ files with an update interval of 1d are up-to-date.
So I think we should pusue this approach ( Thanks, Michael, for your work! ).

Comment 168 Nick Howitt 2024-08-14 13:24:13 UTC

(In reply to Bernhard Bitsch from comment #167)
> I test Michael's modifications in parallell to Jon's RPZ addition.
> I can confirm that the number of reloads went to the necessary amount, and
> even RPZ files with an update interval of 1d are up-to-date.
> So I think we should pusue this approach ( Thanks, Michael, for your work! ).

Then, I am afraid you have not understood https://bugzilla.ipfire.org/show_bug.cgi?id=13254#c150 and are accepting unnecessary workarounds.

My approach uses most of Michael's code and avoids the bridge restarting unbound when the bridge starts (so when dhcpd starts), so makes fractionally less restarts of unbound. It also avoids complicated workarounds for comment #150. It has the same impact on RPZ (or fractionally less as the unbound restart is removed from the bridge start up). Dhcpd can then be restarted as often as you want without impacting unbound and causing it to restart.

Comment 169 Bernhard Bitsch 2024-08-14 13:52:59 UTC

You may be wright.

But I think to discuss this problem further we should clarify some things:
1. where is what information stored? When is this info used?
2. which version of Michael's bridge program are we speaking of?
3. which version of your program are we speaking of?

My state about these three points is
1. don't know exactly, but I think Michael is experienced enough about the bridge. So I can trust, that he transfered the knowledge to his new version
2. I use the version downloaded shortly after his announcement
3. the main goal of your version is to reduce restarts/reloads to a minimum ( dhcpd restarts, new leases information )

I know, this may not be the newest status. But this thread has become very long and did not read all postings ( and the associated post in the dev list ).

Comment 170 Bernhard Bitsch 2024-08-14 14:17:32 UTC

An annotation more, 'unbound-control reload' does a restart of the internal state of unbound.
I haven't yet verified  by code inspection, but experience with both ways of announcing DHCP information shows this behaviour.

Comment 171 Michael Tremer 2024-08-14 14:32:06 UTC

(In reply to Nick Howitt from comment #166)
> (In reply to Michael Tremer from comment #165)
> > > If I do any more work on the updates it will be to regenerate the
> > > dhcp-leases.conf on each unbound restart to get round the issue raised in
> > > https://bugzilla.ipfire.org/show_bug.cgi?id=13254#c150, and remove the
> > > updates to the unbound dhcp-leases.conf file at every change to speed it up
> > > and and reduce disk wear as the continual updates are pointless. As stated
> > > before, dhcp-leases.conf only ever needs to be valid when unbound starts or
> > > restarts and not at any other time.
> > 
> > We should not continuously write a file that is never being read. If
> > anything there could be some check if unbound is running (or has changed
> > PID) and in case we detect that, we can simply call the reload function of
> > the bridge which will rewrite the leases file and reload Unbound. That would
> > put those things back in sync.
> So, if you reload unbound, "we can simply call the reload function of
> the bridge which will rewrite the leases file and reload Unbound. That would
> put those things back in sync". That sounds like an infinite loop.
> 
> It also means creating a watcher on the process.

There won't be a loop, because this should only be triggered when Unbound is being manually restarted. It will always load the old leases and therefore be out of sync. Depending on when the bridge was started last, this might be very far off.

> Your update has specifically been designed to continuously write to the
> leases file. Then. because of the issue I pointed out in
> https://bugzilla.ipfire.org/show_bug.cgi?id=13254#c150, you seem to be
> coming up with evermore complex workarounds which are all unnecessary if we
> give up maintaining the leases file except when unbound starts or restarts.

Either we want this thing to be very accurate or not.

The original motivation was that we have Unbound reload less. This seems to be the only viable way to me.

(In reply to Bernhard Bitsch from comment #167)
> Topic 2) influences unbound features which use automatic update mechanisms.
> The main task in the development pipe at the moment is jon's approach to
> include PiHole functionality using RPZ. The files used include a SOA record
> which defines an update period. Usage of these show, that unbound resets the
> update counters with each reload of the config files. Michael's new version
> tries to eliminate the reload by another way to signal DHCP information to
> unbound. dhcpd allows the use of events for leases changes to start a little
> program. This program triggers unbound-dhcp-leases-bridge to do
> unbound-control operations to add/remove local data definitions. This
> changes the internal information of unbound only, without touching the
> update timers.
> This means unbound-leases-bridge must update the settings read at a restart
> of unbound, also.

I am not interested in anything regarding the RPZs right now. They have not been properly put on the agenda and looking at how much time we have on our hands, this won't make it on the agenda for years.

I don't want to build blockers, but this ticket is about a different problem which I want to solve first.

(In reply to Bernhard Bitsch from comment #169)
> But I think to discuss this problem further we should clarify some things:
> 2. which version of Michael's bridge program are we speaking of?

Please let's not look at anything else than what is in next. The goal is to get that stable and ready for release.

Comment 172 Nick Howitt 2024-08-14 15:19:35 UTC

Ok, do what you want, but please consider this. If IPF has been running a month with no restarts of dhcpd and unbound, if unbound then gets restarted on its own, then all the accumulated DCHP state information from the last month is lost to unbound as it reads back in the dhcp-leases.conf from the last dhcpd restart. As a result the internal DNS resolution could be invalid or just partially valid.

I would struggle to find this acceptable, but it is the way the patching has gone.

Comment 173 Michael Tremer 2024-08-21 09:12:24 UTC

(In reply to Nick Howitt from comment #172)
> Ok, do what you want, but please consider this. If IPF has been running a
> month with no restarts of dhcpd and unbound, if unbound then gets restarted
> on its own, then all the accumulated DCHP state information from the last
> month is lost to unbound as it reads back in the dhcp-leases.conf from the
> last dhcpd restart. As a result the internal DNS resolution could be invalid
> or just partially valid.
> 
> I would struggle to find this acceptable, but it is the way the patching has
> gone.

No this is not acceptable indeed:

> https://git.ipfire.org/?p=ipfire-2.x.git;a=commitdiff;h=8ead2ddf3d0f1521c2a99b509373615f77a754a0

Comment 174 Nick Howitt 2024-09-04 19:30:27 UTC

I've been AFK for a while.

Looking at the changes, I can see what has been done and it should work, adding a listener, but it seems to me to be over-complicated.

Surely it would be easier if unbound looked after creating the leases file it needs itself during startup? It would avoid a listener and I don't think the code is any more complicated. It removes the leases file update from dhcpd and adds it to unbound which can run it only when it is needed, during startup and restart.

Comment 175 Michael Tremer 2024-09-05 10:26:54 UTC

(In reply to Nick Howitt from comment #174)
> I've been AFK for a while.

Welcome back.

> Looking at the changes, I can see what has been done and it should work,
> adding a listener, but it seems to me to be over-complicated.

It is not *that* complicated, but it is more code than I had anticipated.

> Surely it would be easier if unbound looked after creating the leases file
> it needs itself during startup? It would avoid a listener and I don't think
> the code is any more complicated. It removes the leases file update from
> dhcpd and adds it to unbound which can run it only when it is needed, during
> startup and restart.

As in: Let's start the script once again with a special option like "--one-shot" so that it will update the leases file, then terminate and Unbound is being launched with a fresh leases file? Yes, that would be acceptable to me. We would however have to restart the daemon once again to sync state. There is a small window where the leases could change between the one-off shot and the daemon being relaunched.

If people restart Unbound without the script, well, then they are on their own.

I had another thought in mind why we would need to watch the process, but I cannot think of it now. Maybe it was just that we will be out of sync again if we call the script more than once.

Comment 176 Nick Howitt 2024-09-05 18:18:03 UTC

Ages ago I created a cut-down version of the original python bridge script - https://bugzilla.ipfire.org/attachment.cgi?id=1516 - whose sole purpose is to generate the leases file. Perhaps more can be cut out of it as I am not a python expert. The intent was for it to be packaged with unbound and called from the init script before a start or restart.

Then all lease file maintenance can be removed from your new script and the listener can be removed.

It also makes unbound self-contained, updating its "own" leases file when it needs to.

Is there anything wrong with this approach?

Comment 177 Michael Tremer 2024-09-06 10:46:03 UTC

(In reply to Nick Howitt from comment #176)
> It also makes unbound self-contained, updating its "own" leases file when it
> needs to.
> 
> Is there anything wrong with this approach?

How would any continuous updates work? Generating the leases file at the start is a good idea, but what happens after that?

We cannot change the DHCP execute statements to call unbound-control as those will take a very long time to run and will block the DHCP server.

I think all in all, we are not very close to a very aesthetic solution. This all seems to be very complicated, confusing and over-engineered. However, when ever we want to make things simpler, we will create some disadvantages. Unless we have introduced any new problems, I believe the version that is currently in next has the smallest number of disadvantages (it mainly is complicated), but it does not lack any features like getting Unbound out of sync or stalling the DHCP server.

Comment 178 Nick Howitt 2024-09-06 12:56:26 UTC

Continuous updates should still happen from your script (or even Jon's), but remove the leases file updating from it. Package your script with the dhcp package.

The leases file should be updated from my script (which anyone is welcome to tweak) and the unbound init.d file should be modified to run it on start and reload. My script should be packaged with unbound.

Like this it does not lack any features like getting Unbound out of sync or stalling the DHCP server

Comment 179 Adolf Belka 2024-09-23 12:19:19 UTC

So the patches from @Michael's unbound-dhcp-leases-bridge repo branch have been merged into CU188 and that has had its Testing phase and has now been released.

Can some feedback data be provided on how this has dealt with, or not, the original issue of 'lots of unbound restarts'.

Comment 180 Jon 2024-09-23 15:25:51 UTC

So far so good for a short test period of almost 4 days.  6 total Unbound Restarts.

Restarts:
3 related to CU 188 upgrade (before 1st reboot)

1 related to the 1st reboot (part of 1st reboot)

1 related to "Reading fix leases from /var/ipfire/dhcp/fixleases"

1 related to me testing RPZ

---

Summary:
1 total for the testing period.  But I don't think this one really counts since it was so close to the 1st reboot.  


Logs:
Sep 19 11:15:00 ipfire syslogd 1.5.1: restart (remote reception).
. . .
Sep 19 11:15:23 ipfire unbound: [2125:0] info: start of service (unbound 1.21.0).
. . .
Sep 19 11:15:45 ipfire unbound: [2125:0] info: start of service (unbound 1.21.0).
. . .
Sep 19 11:16:40 ipfire dhcp[3876]: Reading static hosts from /var/ipfire/main/hosts
Sep 19 11:16:40 ipfire dhcp[3876]: Reading DHCP leases from /var/state/dhcp/dhcpd.leases
Sep 19 11:16:40 ipfire dhcp[3876]: Reading fix leases from /var/ipfire/dhcp/fixleases
Sep 19 11:16:40 ipfire unbound: [2125:0] info: service stopped (unbound 1.21.0).
. . .
Sep 19 11:17:01 ipfire unbound: [2125:0] info: start of service (unbound 1.21.0).

---

So far so good!!  I'll continue to keep an eye on it.

Comment 181 Jon 2024-09-23 15:41:47 UTC

Forgot one line:

and no unbound restarts, related to dhcp, in the past the past 3-4 days!  Yay!

Comment 182 Peter MÃ¼ller 2024-09-23 17:35:47 UTC

Given this positive feedback, I am taking the liberty of closing this bug as FIXED. If it is not, please feel free to reopen it. Thank you.

Comment 183 Michael Tremer 2024-09-23 17:42:12 UTC

This is not done as there is not the best process in place with what happens when unbound is restarted manually. There are also some smaller outstanding problems with static hosts.