Bug 13254 - CU 177 - LOTS of unbound restarts (1000s per week)
Summary: CU 177 - LOTS of unbound restarts (1000s per week)
Status: ASSIGNED
Alias: None
Product: IPFire
Classification: Unclassified
Component: --- (show other bugs)
Version: 2
Hardware: x86_64 Unspecified
: - Unknown - Minor Usability
Assignee: Michael Tremer
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-26 20:16 UTC by Jon
Modified: 2024-04-24 09:44 UTC (History)
4 users (show)

See Also:


Attachments
pcap file (2.28 KB, application/vnd.tcpdump.pcap)
2023-11-21 23:31 UTC, Jon
Details
pcap file for collegebrowser.com (1.76 KB, application/vnd.tcpdump.pcap)
2023-11-22 16:54 UTC, Jon
Details
comparison script for testing (2.09 KB, text/plain)
2023-12-05 17:09 UTC, Jon
Details
readme (3.88 KB, text/plain)
2024-03-29 02:28 UTC, Jon
Details
dhcpScript (6.88 KB, text/plain)
2024-03-29 02:29 UTC, Jon
Details
dhcpd.conf.local config file (1.28 KB, text/plain)
2024-03-29 02:32 UTC, Jon
Details
timing (451.10 KB, image/png)
2024-04-04 18:18 UTC, Jon
Details
dhcpEvent updated with speed ups and updating of dhcp-leases.conf removed (6.33 KB, application/x-shellscript)
2024-04-05 11:13 UTC, Nick Howitt
Details
dhcpEvent - v30 (7.62 KB, text/plain)
2024-04-05 15:07 UTC, Jon
Details
Script to generate /etc/unbound/dhcp-leases.conf (12.84 KB, text/x-python3)
2024-04-06 13:48 UTC, Nick Howitt
Details
Zip file for dhcpEvent (3.40 KB, application/zip)
2024-04-15 22:21 UTC, Jon
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jon 2023-08-26 20:16:22 UTC
I went digging through the messages log (old a new) and normally I see a few "unbound: [31414:0] info: start of service" messages every week.  Less than 10 per week.  

Aug  2 13:09:29 ipfire unbound: [31414:0] info: start of service (unbound 1.17.1).

But after upgrading from cu 177 to CU 178 I see thousands per week.

```
-rw-rw-r-- 1 root syslogd 2508718 Aug  6 00:01 /var/log/messages.3.gz

Aug  5 20:35:02 ipfire pakfire: CORE UPGR: Upgrading from release 176 to 177

Aug  2 13:09:29 ipfire unbound: [31414:0] info: start of service (unbound 1.17.1).
Aug  2 13:14:05 ipfire unbound: [32097:0] info: start of service (unbound 1.17.1).
Aug  2 13:46:10 ipfire unbound: [2004:0] info: start of service (unbound 1.17.1).
Aug  2 13:51:27 ipfire unbound: [2880:0] info: start of service (unbound 1.17.1).
Aug  2 17:10:31 ipfire unbound: [1923:0] info: start of service (unbound 1.17.1).
Aug  2 17:10:58 ipfire unbound: [1923:0] info: start of service (unbound 1.17.1).
Aug  5 20:42:07 ipfire unbound: [1923:0] info: start of service (unbound 1.17.1).
Aug  5 20:52:42 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1).
Aug  5 20:53:06 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1).
Aug  5 20:53:42 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1).
Aug  5 20:53:59 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1).
Aug  5 20:54:17 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1).
Aug  5 21:02:28 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1).
```

Here is the count of "start of service" messages.  I just from a few to over 3000 per week:
 
```
-rw-rw-r-- 1 root syslogd 2211885 Jul 23 00:01 /var/log/messages.5.gz
0

-rw-rw-r-- 1 root syslogd 2318649 Jul 30 00:01 /var/log/messages.4.gz
4

-rw-rw-r-- 1 root syslogd 2508718 Aug  6 00:01 /var/log/messages.3.gz
Aug  5 20:35:02 ipfire pakfire: CORE UPGR: Upgrading from release 176 to 177
73

-rw-rw-r-- 1 root syslogd 4108718 Aug 13 00:01 /var/log/messages.2.gz
3559

-rw-rw-r-- 1 root syslogd 3927388 Aug 20 00:01 /var/log/messages.1.gz
Aug 18 18:52:51 ipfire pakfire: CORE UPGR: Upgrading from release 177 to 178
3773

-rw-rw-r-- 1 root syslogd 35421724 Aug 26 15:01 /var/log/messages
3512
[root@ipfire ~] # 
```

Tested on:
APU4D4
IPFire 2.27 (x86_64) - Core-Update 178 (stable)
Comment 1 Jon 2023-08-26 20:19:56 UTC
my bad - this should state:

But after upgrading from cu 176 to CU 177 I see thousands per week.
Comment 3 Adolf Belka 2023-08-28 16:06:23 UTC
I have just checked my logs and I see some but not very many most of the weeks and there are some weeks with zero.

/var/log/messages.9.gz     15
/var/log/messages.8.gz     0
/var/log/messages.7.gz     2
/var/log/messages.6.gz     0
/var/log/messages.5.gz     12
/var/log/messages.4.gz     9
/var/log/messages.3.gz     0
/var/log/messages.2.gz     31
/var/log/messages.1.gz     0
/var/log/messages          0
Comment 4 Jon 2023-08-28 18:00:24 UTC
Ugh!  that is not good (for me!).  I see this on my production and high number on my test APU also.


/var/log/messages.9.gz
2

/var/log/messages.8.gz
0

/var/log/messages.7.gz
Jul 13 20:09:56 ipfire pakfire: CORE UPGR: Upgrading from release 175 to 176
2

/var/log/messages.6.gz
0

/var/log/messages.5.gz
4

/var/log/messages.4.gz
Aug  5 20:35:02 ipfire pakfire: CORE UPGR: Upgrading from release 176 to 177
73

/var/log/messages.3.gz
3559

/var/log/messages.2.gz
Aug 18 18:52:51 ipfire pakfire: CORE UPGR: Upgrading from release 177 to 178
3773

/var/log/messages.1.gz
3718

/var/log/messages
850
Comment 5 Jon 2023-08-28 18:16:26 UTC
here is my test box

APU4D4
IPFire 2.27 (x86_64) - Core-Update 179 Development Build: master/b5d85855

(just formatted differently)


/var/log/messages.9.gz 	0

/var/log/messages.8.gz 	2

/var/log/messages.7.gz 	2
Jul 11 15:42:55 ipfireAPU pakfire: CORE UPGR: Upgrading from release 175 to 176

/var/log/messages.6.gz 	2

/var/log/messages.5.gz 	0

/var/log/messages.4.gz 	484
Jul 30 13:09:21 ipfireAPU pakfire: CORE UPGR: Upgrading from release 175 to 177

/var/log/messages.3.gz 	417

/var/log/messages.2.gz 	493
Aug 14 21:44:42 ipfireAPU pakfire: CORE UPGR: Upgrading from release 176 to 178

/var/log/messages.1.gz 	856
Aug 21 11:43:32 ipfireAPU pakfire: CORE UPGR: Upgrading from release 177 to 179

/var/log/messages 	117
Comment 6 Jon 2023-08-28 19:49:01 UTC
Adolf & all,

I found a ipfire box that has CU 174 on it and not too many unbound stop/start messages.

[root@ipfireAPU2 ~] # find /var/log -iname "messages*" | sort -Vr | tail -10 | xargs -I % bash -c 'echo ; printf "% \t" % ; grep --color -c "start of service" <(zcat -f %) ; grep --color "CORE UPGR:" <(zcat -f %)'

/var/log/messages.8.gz 	2
/var/log/messages.7.gz 	4
/var/log/messages.6.gz 	4
/var/log/messages.5.gz 	10
/var/log/messages.4.gz 	14
/var/log/messages.3.gz 	2
/var/log/messages.2.gz 	2
/var/log/messages.1.gz 	0
/var/log/messages 	2
[root@ipfireAPU2 ~] # 

I want to walk the Updates from CU 174 to CU 176 and let it sit for a day.  Then walk the Update from CU 176 to CU 177.  And let it sit again to check for unbound stop/start messages.

Do you know a way to do this?
Comment 7 Jon 2023-08-28 23:22:03 UTC
Same issue with a 3rd ipfire device.  Today I Updated from CU 174 to CU 178. The update happened only ~2 hours so not too many issues.

Interesting item - the stopping / start seem to happen right after a DHCPREQUEST and DHCPACK.

=====

```
[root@ipfireAPU2 ~] # find /var/log -iname "messages*" | sort -Vr | tail -1 | xargs -I % bash -c 'echo ; printf "% \n" % ; grep --color -e dhcp -e "CORE UPGR:" -e "start of service" -e reboot -e restart -e "service stopped" <(zcat -f %)'

/var/log/messages 
. . .
Aug 28 16:03:20 ipfireAPU2 pakfire: CORE UPGR: Upgrading from release 174 to 178

. . .

Aug 28 16:23:33 ipfireAPU2 shutdown[23875]: shutting down for system reboot

. . .

Aug 28 16:26:04 ipfireAPU2 syslogd 1.5.1: restart (remote reception).
Aug 28 16:26:04 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1).
Aug 28 16:27:15 ipfireAPU2 unbound: [3138:0] info: service stopped (unbound 1.17.1).
Aug 28 16:27:15 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1).
Aug 28 16:27:43 ipfireAPU2 dhcpd: Wrote 5 leases to leases file.
Aug 28 16:27:43 ipfireAPU2 dhcpd: Server starting service.
Aug 28 16:27:44 ipfireAPU2 unbound: [3138:0] info: service stopped (unbound 1.17.1).
Aug 28 16:27:45 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1).

Aug 28 16:27:51 ipfireAPU2 dhcpd: DHCPDISCOVER from d8:eb:97:b8:03:08 (iMac3) via green0
Aug 28 16:27:51 ipfireAPU2 dhcpd: DHCPOFFER on 192.168.7.100 to d8:eb:97:b8:03:08 (iMac3) via green0
Aug 28 16:27:52 ipfireAPU2 dhcpd: DHCPREQUEST for 192.168.7.100 (192.168.7.1) from d8:eb:97:b8:03:08 (iMac3) via green0
Aug 28 16:27:52 ipfireAPU2 dhcpd: DHCPACK on 192.168.7.100 to d8:eb:97:b8:03:08 (iMac3) via green0
Aug 28 16:27:55 ipfireAPU2 unbound: [3138:0] info: service stopped (unbound 1.17.1).
Aug 28 16:27:56 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1).

Aug 28 16:57:52 ipfireAPU2 dhcpd: DHCPREQUEST for 192.168.7.100 from d8:eb:97:b8:03:08 (iMac3) via green0
Aug 28 16:57:52 ipfireAPU2 dhcpd: DHCPACK on 192.168.7.100 to d8:eb:97:b8:03:08 (iMac3) via green0
Aug 28 16:57:56 ipfireAPU2 unbound: [3138:0] info: service stopped (unbound 1.17.1).
Aug 28 16:57:57 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1).

Aug 28 17:57:52 ipfireAPU2 dhcpd: DHCPREQUEST for 192.168.7.100 from d8:eb:97:b8:03:08 (iMac3) via green0
Aug 28 17:57:52 ipfireAPU2 dhcpd: Wrote 5 leases to leases file.
Aug 28 17:57:52 ipfireAPU2 dhcpd: DHCPACK on 192.168.7.100 to d8:eb:97:b8:03:08 (iMac3) via green0
Aug 28 17:57:53 ipfireAPU2 unbound: [3138:0] info: service stopped (unbound 1.17.1).
Aug 28 17:57:53 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1).
[root@ipfireAPU2 ~] # 
```
Comment 8 Adolf Belka 2023-09-04 09:59:23 UTC
I just ran your command on my min appliance system and got the following output.


/var/log/messages.19.gz 	2
/var/log/messages.18.gz 	0
/var/log/messages.17.gz 	0
/var/log/messages.16.gz 	0
/var/log/messages.15.gz 	7
/var/log/messages.14.gz 	2
/var/log/messages.13.gz 	0
/var/log/messages.12.gz 	5
Jun 13 11:51:09 iapetus pakfire: CORE UPGR: Upgrading from release 174 to 175
/var/log/messages.11.gz 	9
/var/log/messages.10.gz 	15
/var/log/messages.9.gz 	        0
/var/log/messages.8.gz 	        2
Jul 11 17:54:08 iapetus pakfire: CORE UPGR: Upgrading from release 175 to 176
/var/log/messages.7.gz 	        0
/var/log/messages.6.gz 	        12
/var/log/messages.5.gz 	        9
Aug  5 22:37:50 iapetus pakfire: CORE UPGR: Upgrading from release 176 to 177
/var/log/messages.4.gz 	        0
/var/log/messages.3.gz 	        31
Aug 18 12:30:42 iapetus pakfire: CORE UPGR: Upgrading from release 177 to 178
/var/log/messages.2.gz 	        0
/var/log/messages.1.gz 	        1
/var/log/messages 	        0


So I am not seeing what you are seeing.
A possible reason for that is your comment

"Interesting item - the stopping / start seem to happen right after a DHCPREQUEST and DHCPACK."

Are your clients running with dynamic IP assignments? All my clients have fixed leases on my mini system.
Comment 9 Jon 2023-09-04 15:30:53 UTC
(In reply to Adolf Belka from comment #8)
> So I am not seeing what you are seeing.
> A possible reason for that is your comment
> 
> "Interesting item - the stopping / start seem to happen right after a
> DHCPREQUEST and DHCPACK."
> 
> Are your clients running with dynamic IP assignments? All my clients have
> fixed leases on my mini system.

Yes.  Most everything in the BLUE is dynamic.  And only a few device in the GREEN are dynamic.
Comment 10 Jon 2023-09-20 19:24:23 UTC
Alexander @bmcs-alex has lots of unbound restarts also...

See:
https://community.ipfire.org/t/squid-stops-working-every-hour-at-x48/10364/5?u=jon
Comment 12 Michael Tremer 2023-10-30 10:19:47 UTC
Hello everyone,

we recently added some changes which make unbound reload its configuration when we need to update the internal local DNS configuration (i.e. a new DHCP lease was issued or has expired).

Technically unbound does not restart, as we only ask it to reload:

> https://git.ipfire.org/?p=ipfire-2.x.git;a=blob;f=config/unbound/unbound-dhcp-leases-bridge;hb=a85e9d4eb1c5a9d5b9bac3d90f59a321e4151871#l524

Internally, it probably makes some sense for unbound to halt the resolver, reload the configuration and then start a new resolver with the new configuration. The log message would then make sense.

However, are you experiencing any problems with DNS resolution or is this just an aesthetic issue with lots of those log messages being shown?
Comment 13 Jon 2023-11-14 00:16:50 UTC
I've been experimenting with DNS and I experience a delay when going to various websites.

Nov 13 17:31:05 ipfire unbound: [31555:0] notice: Restart of unbound 1.18.0.
Nov 13 17:31:05 ipfire unbound: [31555:0] notice: init module 0: respip
Nov 13 17:31:05 ipfire unbound: [31555:0] notice: init module 1: validator
Nov 13 17:31:05 ipfire unbound: [31555:0] notice: init module 2: iterator
Nov 13 17:31:17 ipfire unbound: [31555:0] info: start of service (unbound 1.18.0).

It seems like the delay, in obtaining a response from DNS, coincides with a ~12 second delay between "notice: init module 2: iterator" and "info: start of service (unbound 1.18.0)."

I happened to be monitoring the messages log with:
```
tail -f /var/log/messages | grep -e unbound -e dhcp
``` 
And I was searching for something in a web browser.  the delay on the webboorwser coincided with the unbound restart.

I had been seeing this browser delay for a few weeks but it did not make sense until today...

I'll do some more testing.
Comment 14 Michael Tremer 2023-11-20 15:47:13 UTC
(In reply to Jon from comment #13)
> I've been experimenting with DNS and I experience a delay when going to
> various websites.

Hmm, that would be a problem. I tested these changed on a network that has between 1200 and 1600 clients daily, but it is also running on an Enterprise appliance, so maybe the reload is slightly faster there? Can we rule out it isn't anything else?

Could you maybe run tcpdump and check if there are any DNS queries sent to IPFire which are not being answered until unbound is done reloading?
Comment 15 Jon 2023-11-20 18:03:58 UTC
Yes, I can do that (at least I think I can!) If you have a suggestion for the tcpdump command that would help me since I haven't used tcpdump since the DNS redirect days.


Via a browser I can see delays and I can see an occasional "drop".

The delays happen at every unbound restart.  The drops happen once in a while, maybe once every week or so.
Comment 16 Jon 2023-11-21 23:30:36 UTC
I wrote a script to pick 500 different (random) websites out of 1 million.  I did this because I wanted to avoid the dns caches.

I had to run the `unbound-control reload_keep_cache` to force the reload.  Otherwise this test is like watching paint dry...


I ran this tcpdump command:
```
tcpdump -i green0 host 192.168.60.212 and port 53 -w green_h60.212_p53.pcap
```


Each website is checked this command:

```
time dig +tries=3 +time=5 +stats @192.168.60.1 furchguitars.com
```

And I got this response every once in awhile:

furchguitars.com                                 FAIL: 9      15014

allResults =  
; <<>> DiG 9.10.6 <<>> dig +tries=3 +time=5 +stats @192.168.60.1 furchguitars.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

real	0m15.014s
user	0m0.002s
sys	0m0.003s



and I dug through the tcpdump list and found that `furchguitars.com` started at the time 16:25:46:
```
16:25:46.625255	192.168.60.212	192.168.60.1	DNS	87	Standard query 0x67a1 A furchguitars.com OPT
16:25:51.626709	192.168.60.212	192.168.60.1	DNS	87	Standard query 0x67a1 A furchguitars.com OPT
16:25:56.627352	192.168.60.212	192.168.60.1	DNS	87	Standard query 0x67a1 A furchguitars.com OPT
. . .
16:26:02.287736	192.168.60.1	192.168.60.212	DNS	103	Standard query response 0x67a1 A furchguitars.com A 89.221.213.64 OPT
16:26:02.287818	192.168.60.1	192.168.60.212	DNS	103	Standard query response 0x67a1 A furchguitars.com A 89.221.213.64 OPT
```


The first three commands are the three dig tries (+tries=3).  Each is 5 sec apart (+time=5).

So from the first dig send to the first response is 16 seconds.


I did a ping to make sure it was a real server and that worked:
```
iMac3:dns_speed_test jcmurphy$ ping furchguitars.com
PING furchguitars.com (89.221.213.64): 56 data bytes
64 bytes from 89.221.213.64: icmp_seq=0 ttl=45 time=131.229 ms
64 bytes from 89.221.213.64: icmp_seq=1 ttl=45 time=126.059 ms
64 bytes from 89.221.213.64: icmp_seq=2 ttl=45 time=134.259 ms
64 bytes from 89.221.213.64: icmp_seq=3 ttl=45 time=125.657 ms
^C
--- furchguitars.com ping statistics ---
4 packets transmitted, 4 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 125.657/129.301/134.259/3.609 ms
```


and ran dig to make sure it really worked:
```
iMac3:dns_speed_test jcmurphy$ dig furchguitars.com

; <<>> DiG 9.10.6 <<>> furchguitars.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37336
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;furchguitars.com.		IN	A

;; ANSWER SECTION:
furchguitars.com.	581	IN	A	89.221.213.64

;; Query time: 37 msec
;; SERVER: 192.168.60.2#53(192.168.60.2)
;; WHEN: Tue Nov 21 16:46:21 CST 2023
;; MSG SIZE  rcvd: 61

```
Comment 17 Jon 2023-11-21 23:31:50 UTC
Created attachment 1394 [details]
pcap file

Here are the tcpdump lines related to furchguitars
Comment 18 Jon 2023-11-21 23:38:04 UTC
I forgot the unbound log.  

See the restart (service stopped  / start of service) that occurred during the same time as the bad dig command.

```
Nov 21 16:25:50 ipfire unbound: [29467:0] info: service stopped (unbound 1.18.0).
Nov 21 16:25:50 ipfire unbound: [29467:0] info: server stats for thread 0: 12 queries, 6 answers from cache, 6 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Nov 21 16:25:50 ipfire unbound: [29467:0] info: server stats for thread 0: requestlist max 4 avg 1.66667 exceeded 0 jostled 0
Nov 21 16:25:50 ipfire unbound: [29467:0] info: average recursion processing time 0.102877 sec
Nov 21 16:25:50 ipfire unbound: [29467:0] info: histogram of recursion processing times
Nov 21 16:25:50 ipfire unbound: [29467:0] info: [25%]=0.0709973 median[50%]=0.098304 [75%]=0.125611
Nov 21 16:25:50 ipfire unbound: [29467:0] info: lower(secs) upper(secs) recursions
Nov 21 16:25:50 ipfire unbound: [29467:0] info:    0.032768    0.065536 1
Nov 21 16:25:50 ipfire unbound: [29467:0] info:    0.065536    0.131072 3
Nov 21 16:25:50 ipfire unbound: [29467:0] info:    0.131072    0.262144 1
Nov 21 16:25:50 ipfire unbound: [29467:0] notice: Restart of unbound 1.18.0.
Nov 21 16:25:50 ipfire unbound: [29467:0] notice: init module 0: respip
Nov 21 16:25:50 ipfire unbound: [29467:0] notice: init module 1: validator
Nov 21 16:25:50 ipfire unbound: [29467:0] notice: init module 2: iterator
Nov 21 16:26:01 ipfire unbound: [29467:0] info: start of service (unbound 1.18.0).
Nov 21 16:26:01 ipfire unbound: [29467:0] info: generate keytag query _ta-4a5c-4f66. NULL IN
```
Comment 19 Jon 2023-11-22 03:30:41 UTC
dig has +tries and +time (e.g., dig +tries=3 +time=5)

I picked these because of the original "12 second delay" I saw early on.  I was thinking this "might" be similar to unbound.  But I thinking now it may be masking the unbound reset question.

Is there an equivalent setting with unbound?
Comment 20 Jon 2023-11-22 16:53:09 UTC
And another test...

Each website is checked with this (slightly different than last time) command:
```
time dig +stats @192.168.60.1 collegebrowser.com


I got this response:
```
collegebrowser.com                                  FAIL: 9      15010
allResults =  
; <<>> DiG 9.10.6 <<>> +stats @192.168.60.1 collegebrowser.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

real	0m15.010s
user	0m0.002s
sys	0m0.003s


I can ping and dig to make sure the site is OK:
```
[root@ipfire ~] # ping collegebrowser.com
PING collegebrowser.com (72.167.58.94) 56(84) bytes of data.
64 bytes from 94.58.167.72.host.secureserver.net (72.167.58.94): icmp_seq=1 ttl=51 time=67.3 ms
64 bytes from 94.58.167.72.host.secureserver.net (72.167.58.94): icmp_seq=2 ttl=51 time=64.0 ms
64 bytes from 94.58.167.72.host.secureserver.net (72.167.58.94): icmp_seq=3 ttl=51 time=65.0 ms
64 bytes from 94.58.167.72.host.secureserver.net (72.167.58.94): icmp_seq=4 ttl=51 time=63.5 ms
^C
--- collegebrowser.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3002ms
rtt min/avg/max/mdev = 63.546/64.944/67.298/1.454 ms
[root@ipfire ~] # 

[root@ipfire ~] # dig collegebrowser.com

; <<>> DiG 9.16.44 <<>> collegebrowser.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51445
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;collegebrowser.com.		IN	A

;; ANSWER SECTION:
collegebrowser.com.	590	IN	A	72.167.58.94

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Nov 22 10:22:25 CST 2023
;; MSG SIZE  rcvd: 63


Digging through the tcpdump list and found that `collegebrowser.com ` appeared at 10:01:09:
```
312	2023-11-22 10:01:09.294139	192.168.60.212	192.168.60.1	DNS	89	Standard query 0x3ec2 A collegebrowser.com OPT
313	2023-11-22 10:01:14.294642	192.168.60.212	192.168.60.1	DNS	89	Standard query 0x3ec2 A collegebrowser.com OPT
314	2023-11-22 10:01:19.296406	192.168.60.212	192.168.60.1	DNS	89	Standard query 0x3ec2 A collegebrowser.com OPT
. . .
323	2023-11-22 10:01:26.983005	192.168.60.1	192.168.60.212	DNS	105	Standard query response 0x3ec2 A collegebrowser.com A 72.167.58.94 OPT
324	2023-11-22 10:01:26.983077	192.168.60.1	192.168.60.212	DNS	105	Standard query response 0x3ec2 A collegebrowser.com A 72.167.58.94 OPT 


The unbound service stopped  / start of service occurs at the same time as the dig command:

```
Nov 22 10:01:11 ipfire unbound: [29467:0] info: service stopped (unbound 1.18.0).
Nov 22 10:01:11 ipfire unbound: [29467:0] info: server stats for thread 0: 25 queries, 4 answers from cache, 21 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Nov 22 10:01:11 ipfire unbound: [29467:0] info: server stats for thread 0: requestlist max 2 avg 0.857143 exceeded 0 jostled 0
Nov 22 10:01:11 ipfire unbound: [29467:0] info: average recursion processing time 0.519950 sec
Nov 22 10:01:11 ipfire unbound: [29467:0] info: histogram of recursion processing times
Nov 22 10:01:11 ipfire unbound: [29467:0] info: [25%]=0.049152 median[50%]=0.098304 [75%]=0.32768
Nov 22 10:01:11 ipfire unbound: [29467:0] info: lower(secs) upper(secs) recursions
Nov 22 10:01:11 ipfire unbound: [29467:0] info:    0.016384    0.032768 1
Nov 22 10:01:11 ipfire unbound: [29467:0] info:    0.032768    0.065536 8
Nov 22 10:01:11 ipfire unbound: [29467:0] info:    0.065536    0.131072 2
Nov 22 10:01:11 ipfire unbound: [29467:0] info:    0.131072    0.262144 3
Nov 22 10:01:11 ipfire unbound: [29467:0] info:    0.262144    0.524288 4
Nov 22 10:01:11 ipfire unbound: [29467:0] info:    2.000000    4.000000 2
Nov 22 10:01:11 ipfire unbound: [29467:0] notice: Restart of unbound 1.18.0.
Nov 22 10:01:11 ipfire unbound: [29467:0] notice: init module 0: respip
Nov 22 10:01:11 ipfire unbound: [29467:0] notice: init module 1: validator
Nov 22 10:01:11 ipfire unbound: [29467:0] notice: init module 2: iterator
Nov 22 10:01:26 ipfire unbound: [29467:0] info: start of service (unbound 1.18.0).
```
Comment 21 Jon 2023-11-22 16:54:48 UTC
Created attachment 1395 [details]
pcap file for collegebrowser.com

Here is the pcap file
Comment 22 Michael Tremer 2023-11-23 15:57:33 UTC
Hmm this is interesting. Because you are getting the responses from unbound - just quite late. About 15 seconds, which is definitely noticeable.

Should we take this upstream? Or what other ideas do we have?
Comment 23 Jon 2023-11-23 16:37:33 UTC
The two examples above (the pcap files) are fails.  So yes, there is a response but something breaks during the "reload"

I get this in my log:
```
; <<>> DiG 9.10.6 <<>> +stats @192.168.60.1 collegebrowser.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
```

I cannot answer about reporting this to the unbound team.

I am worried this may be caused by this change:
https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=f20ca78eff6e8baeb86361f55adf52819d1bae1f

and how "unbound-dhcp-leases-bridge" interface is written around the reload.

Is there an IPFire python developer that can help??
Comment 24 Michael Tremer 2023-11-23 16:39:18 UTC
(In reply to Jon from comment #23)
> I am worried this may be caused by this change:
> https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;
> h=f20ca78eff6e8baeb86361f55adf52819d1bae1f

It very likely is. However, I am not sure if rolling this back will help us, because that would mean major rework of the leases bridge. So, what is easier to solve now?
Comment 25 Jon 2023-11-23 20:20:32 UTC
Is it more than just replacing the new with the old? (I thought it would be easy to go back one step - shows you how little I know!)

I don't mind going forward, but I don't speak python.  So we'd need someone fluid in pythonese.
Comment 26 Jon 2023-11-25 04:22:35 UTC
for what it is worth...  All of the fixed leases work A-OK with no unbound reloads.

I'm still checking but it looks good so far!
Comment 27 Michael Tremer 2023-11-25 14:40:44 UTC
The problem before was that the bridge was not very good at figuring out which leases needed to be loaded into unbound. It kind of worked okay in some cases, but on larger networks, some random leases were dropped from DNS. That problem is fixed by having unbound reload the entire file every time something changes.
Comment 28 Jon 2023-11-25 16:40:42 UTC
I have to image that someone has solved this problem before. 

We cannot be the only group that uses ISC-DHCP and unbound together.

I'd hope either the unbound group or the dhcp group would have written a nice bridge between the two.  I've been looking but have not been lucky in my searches.

I've read through LOTS of other posts about others having similar issues.  It looks like opensense _may_ have solved it.  but pf sense has similar issues.
Comment 29 Jon 2023-11-30 16:03:27 UTC
I added a note the the Dev mailing list about On Commit/Release/Expiry (CRE) that is already included into ISC DHCP.

Hope it helps!
Comment 30 Jon 2023-12-02 19:33:41 UTC
and I am back to bugzilla...

I wrote a small command to grab the dhcp-leases.conf file and grab the unbound cache and compare them.

```
#!/bin/bash

#	need HOSTNAME and DOMAINNAME value
eval $(/usr/local/bin/readhash /var/ipfire/main/settings)

dhcp_leases=$( sed 's/\"//g ; s/local-data: //g' /etc/unbound/dhcp-leases.conf | sort --version-sort --ignore-case )

uc_list=$( unbound-control list_local_data | grep "${DOMAINNAME}" | sed "/${HOSTNAME}\.${DOMAINNAME}/d" | sort --version-sort --ignore-case  )

diff --side-by-side --width=150 --minimal --ignore-case --ignore-all-space <(echo "$dhcp_leases") <(echo "$uc_list")

```

===

One thing I _think_ I noticed (I need to keep testing!) is sometimes it takes time for unbound to place devices into the unbound cache.  So they appear instantly in dhcp-leases.conf but take a minute or three to appear in the unbound cache.

On one of the networks you Admin, maybe you can run the above to check for differences?


If nothing else, the above code could be added in the bridge (as a "hack"!),  would detect differences, then do a launch of the "reload".  Like I said, a hack!
Comment 31 Jon 2023-12-02 23:19:30 UTC
I dropped a line for the last post.  It adds a period after the FQDNs.


```
dhcp_leases=$( sed .....
dhcp_leases=$( echo "${dhcp_leases}" | sed -E 's/(\.in-addr\.arpa)[[:blank:]]+/\1. /g ; s/(\.localdomain)+/\1./g' )
```
Comment 32 Michael Tremer 2023-12-04 17:37:53 UTC
(In reply to Jon from comment #30)
> If nothing else, the above code could be added in the bridge (as a "hack"!),
> would detect differences, then do a launch of the "reload".  Like I said, a
> hack!

If we go down this path, we should avoid calling any reloads, because:

* We should load the current set of data into Unbound when it starts. We need to do that by parsing the leases file and writing an unbound configuration file with all entries as we do now. More on that below.
* We will then keep unbound up to data by using the DHCP events.

The problem is that we picked Python for this job because we need to parse a rather complicated file and then need to check to which subnet a lease belongs. That is something that is very painful to do in shell. I played around with awk, which might be a middle ground, but as mentioned in a previous email, this is not a great approach. I just want to avoid that it takes a minute to start Unbound if we have to process a thousand leases in shell...
Comment 33 Jon 2023-12-05 01:11:02 UTC
(In reply to Michael Tremer from comment #32)
> (In reply to Jon from comment #30)
> * We should load the current set of data into Unbound when it starts. We
> need to do that by parsing the leases file and writing an unbound
> configuration file with all entries as we do now. More on that below.

I don't think this needs to be done!  Please hear me out.  (Sorry if I am explaining things already known.)

in the /etc/unbound/unbound.conf there are includes files.  And the `/etc/unbound/dhcp-leases.conf` include file is "imported" into unbound (and the unbound cache) every time there is a restart (unbound restart or ipfire reboot).  So `/etc/unbound/dhcp-leases.conf` are already populated on restart and don't need to be re-populated.  We don't need to parse the DHCP leases file.

To test, comment out the `unbound-dhcp-leases-bridge` lines ipfire system and do a restart, you'll see this is correct.  So why parse the lease and re-populate the unbound dhcp-leases.conf file?

Now to get new stuff into the unbound cache (and the /etc/unbound/dhcp-leases.conf file) this is where "unbound-control load-data" comes in.

So if we begin the process with DHCP On Commit/Release/Expire, add A & PTR records to `/etc/unbound/dhcp-leases.conf`, and add A & PTR records to unbound cache via "unbound-control load-data", I think we have a nice system.

With the above there are no periodic unbound reloads/restarts.

Again, this is all a proof-of concept.

Make sense?
Comment 34 Jon 2023-12-05 17:09:06 UTC
Created attachment 1406 [details]
comparison script for testing

FYI - Enclosed is a script to compare the unbound `dhcp-leases.conf` file (generated by Bridge) to the unbound cache (see attached compareMe_v3.sh file).

It does not handle static hosts so those will appear as non-matching.

Jon
Comment 35 Jon 2023-12-06 02:19:41 UTC
Michael - with the current "unbound-dhcp-leases-bridge" code.  How do fixed leases make it into unbound DNS?

I am stuck on this at the moment since the fixed leases appear in the `/etc/unbound/dhcp-leases.conf`file but are missing from the unbound cache.  

And the fix leases are missing from the `/var/state/dhcp/dhcpd.leases`.

I see the right items in `/var/ipfire/dhcp/fixleases`

I did a unbound init restart and a dhcp init restart.

(there is always a really good chance that I screwed up something!)
Comment 36 Jon 2023-12-06 15:45:36 UTC
Ignore Comment 35.  It is me - I figured it out.  

Sorry for the noise.
Comment 37 Michael Tremer 2023-12-07 16:44:59 UTC
(In reply to Jon from comment #35)
> Michael - with the current "unbound-dhcp-leases-bridge" code.  How do fixed
> leases make it into unbound DNS?

We read the configuration that the user manages through the web UI and create entries in unbound. This is not really correct, because they are a static lease and that really does not mean that this should show up in DNS. But users seem to want this.

> I am stuck on this at the moment since the fixed leases appear in the
> `/etc/unbound/dhcp-leases.conf`file but are missing from the unbound cache.  

Here we parse the file and add them as a lease that never expires:

> https://git.ipfire.org/?p=ipfire-2.x.git;a=blob;f=config/unbound/unbound-dhcp-leases-bridge;h=e9f022affa98a2483ca1f41d267fe443b0f048cf;hb=HEAD#l161

> And the fix leases are missing from the `/var/state/dhcp/dhcpd.leases`.

Sadly, there is no way to convince the DHCP server to write the fixed leases to the file. That said, I doubt that we will see events for them.
Comment 38 Jon 2024-02-08 20:30:21 UTC
With the update of unbound 1.19.0 this may have been fixed.  I'll test and report back
Comment 39 Jon 2024-02-08 23:05:29 UTC
Still does restarts.  About 100 restarts per 60 minutes.
Comment 40 Nick Howitt 2024-03-20 18:29:01 UTC
I am new to IPFire and have also noticed this.

If the file /etc/unbound/dhcp-leases.conf is written in a deterministic way, can it initially be written somewhere else like /tmp or /var/tmp or to /etc/unbound/ with a slightly different name. Then compare the new file with the original and only if they are different, copy the file over /etc/unbound/dhcp-leases.conf and restart or reload unbound? Most DHCP actions are renewals which will not change the file.
Comment 41 Nick Howitt 2024-03-20 18:47:39 UTC
I don't speak python but it looks like there is something in /usr/sbin/unbound-dhcp-leases-bridge, write_dhcp_leases(self, leases), which creates a temporary file then copies it over the original. It is called from class UnboundConfigWriter(object) which then goes on to self._control("reload_keep_cache").

What some genius could do is diff the temporary file and original file in write_dhcp_leases(self, leases) and set a flag if they are different before the temp file is copied over the original.

Then in UnboundConfigWriter(object), only call self._control("reload_keep_cache") if the flag is set. (Remember to clear the flag after .....).
Comment 42 Nick Howitt 2024-03-24 20:16:21 UTC
I've proposed a patch to only restart Unbound when the /etc/unbound/dhcp-leases.conf file changes - https://lists.ipfire.org/hyperkitty/list/development@lists.ipfire.org/thread/4RK4PR7M57QLHJXHYFWQ32QUGPPW5PRQ/. It is a minor rework of a couple of functions in the current program. It still requires restarts when the leases change, but not often.

It is nowhere near as cute as Jon's proposal which avoids restarts of unbound altogether, but it completely ditches the current methodology.
Comment 43 Nick Howitt 2024-03-24 20:18:14 UTC
Better link to the patch submission - https://patchwork.ipfire.org/project/ipfire/patch/ec4d6d18-41d3-4ba0-9713-e6a388a3119f@howitts.co.uk/
Comment 44 Jon 2024-03-25 17:51:46 UTC
(In reply to Nick Howitt from comment #42)
> It is nowhere near as cute as Jon's proposal ...
cute?!?  cute???  Nobody wants cute code!!  That is like the kiss of death to have "cute" code!
Comment 45 Nick Howitt 2024-03-25 18:12:45 UTC
Not cute code, I won't judge that (but it it prettier than my coding). It is a cute approach which avoids any restarts and hooks directly into the API's. I think that, as a concept, it is a great way to go. My patch is only a sticking plaster on an approach way inferior to yours.

If my patch gets accepted, I'll move on to your code. From my patch submission emails, I think Michael Tremer just has not had the time to go through your approach
Comment 46 Michael Tremer 2024-03-26 11:00:54 UTC
(In reply to Jon from comment #44)
> (In reply to Nick Howitt from comment #42)
> > It is nowhere near as cute as Jon's proposal ...
> cute?!?  cute???  Nobody wants cute code!!  That is like the kiss of death
> to have "cute" code!

It can't hurt. Well-readable should be a must. Cute is probably a more subjective feeling.

(In reply to Nick Howitt from comment #45)
> Not cute code, I won't judge that (but it it prettier than my coding). It is
> a cute approach which avoids any restarts and hooks directly into the API's.

This is what the bridge used to do, and it became a huge problem if there were thousands of leases going around. There have been systems where the bridge processed used of 20% of a single Intel Xeon core. That is why we changed it.

> https://git.ipfire.org/?p=ipfire-2.x.git;a=commitdiff;h=f20ca78eff6e8baeb86361f55adf52819d1bae1f

> I think that, as a concept, it is a great way to go. My patch is only a
> sticking plaster on an approach way inferior to yours.

That is due to be tested. We right now have the option to update multiple leases at the same time whereas the shell approach would run an update command for each event. That is a lot of forked shell processes and we need to create some database that will store the current state or import it from the DHCP leases file that after a restart we won't have a blank page.

I don't know what the overhead of that will be, but I don't think it will be that much cheaper (if at all) than the bridge.
Comment 47 Jon 2024-03-26 14:33:03 UTC
Using the dhcp Event approach (the Proof-of-Concept) how can I check if multiple request can happen at one time?

The commit/release/expiry comes from the DHCP server.  And the `execute` command below launches a script (the Proof-of-Concept script).

```
#
#	commit
#
on commit {
set ClientIP = binary-to-ascii(10, 8, ".", leased-address);
set NoName = "";
set ClientName = pick-first-value(option host-name, config-option-host-name, client-name, NoName);
log(concat("CRE-Commit:  ClientIP: ", ClientIP, "  Name: ", ClientName));

execute("/root/dhcpdconf/dhcpEvent.sh", "commit", ClientIP, ClientName); }
```

Is there a way to add a different execute command to monitor for multiple requests?
Comment 48 Michael Tremer 2024-03-26 14:53:35 UTC
(In reply to Jon from comment #47)
> Using the dhcp Event approach (the Proof-of-Concept) how can I check if
> multiple request can happen at one time?

The script will only be called once at a time. This is because of dhcpd being single-threaded. That has a lot of implications that we need to consider:

* The process will not be able to do anything else until the script is done. That means that the script needs to be *very* fast. If the script blocks, dhcpd will not be able to even read incoming packets.
* We cannot cover up any performance problems here with running multiple scripts at the same time.
* Client won't wait very for long for a response from the DHCP server. They will retry, but that might make the situation a lot worse on our side since we are then running the script multiple times for the client and no reason.

The code for dhcpd is here:

> https://git.ipfire.org/?p=thirdparty/dhcp.git;a=blob;f=common/execute.c;hb=572032cb0e514606559de3784e3f7ca8e1539d17#l1

> Is there a way to add a different execute command to monitor for multiple
> requests?

I don't know what you are asking...
Comment 49 Jon 2024-03-29 02:23:29 UTC
(In reply to Michael Tremer from comment #48)
> (In reply to Jon from comment #47)
> > Using the dhcp Event approach (the Proof-of-Concept) how can I check if
> > multiple request can happen at one time?
> 
> The script will only be called once at a time. This is because of dhcpd
> being single-threaded. That has a lot of implications that we need to
> consider:
> 
> * The process will not be able to do anything else until the script is done.
> That means that the script needs to be *very* fast. If the script blocks,
> dhcpd will not be able to even read incoming packets.
> * We cannot cover up any performance problems here with running multiple
> scripts at the same time.
> * Client won't wait very for long for a response from the DHCP server. They
> will retry, but that might make the situation a lot worse on our side since
> we are then running the script multiple times for the client and no reason.

Knowing DHCP is single threaded helps!

The Proof-of-Concept script runs in 0.5 seconds.  But it is not elegant code.

I added logger code to get an idea of time and if it blocks DHCP.  Here is a typical log entry:

```
Mar 28 18:15:34 ipfire dhcpd: DHCPDISCOVER from 28:c2:dd:af:c7:33 via blue0
Mar 28 18:15:35 ipfire dhcpd: DHCPOFFER on 192.168.65.114 to 28:c2:dd:af:c7:33 via blue0
Mar 28 18:15:35 ipfire dhcpd: CRE-Commit:  ClientIP: 192.168.65.114  Name: Onelink   begin
Mar 28 18:15:35 ipfire dhcpd: execute_statement argv[0] = /usr/sbin/dhcpEvent
Mar 28 18:15:35 ipfire dhcpd: execute_statement argv[1] = commit
Mar 28 18:15:35 ipfire dhcpd: execute_statement argv[2] = 192.168.65.114
Mar 28 18:15:35 ipfire dhcpd: execute_statement argv[3] = Onelink
Mar 28 18:15:36 ipfire dhcpEv28: DNS Update (RFC2136) is enabled
Mar 28 18:15:36 ipfire dhcpEv28: commit: added A & PTR records to unbound
Mar 28 18:15:36 ipfire dhcpEv28: CRE script: complete - exiting
Mar 28 18:15:36 ipfire dhcpd: CRE-Commit:  ClientIP: 192.168.65.114  Name: Onelink   finish
Mar 28 18:15:36 ipfire dhcpd: DHCPREQUEST for 192.168.65.114 (192.168.65.1) from 28:c2:dd:af:c7:33 via blue0
Mar 28 18:15:36 ipfire dhcpd: DHCPACK on 192.168.65.114 to 28:c2:dd:af:c7:33 (Onelink) via blue0
```

`dhcpd: CRE-Commit . . . begin` is the start of the `dhcp.conf` "on commit" log entry.

`dhcpd: execute_statement argv` is passing info from DHCP to the Proof-of-Concept script.

`dhcpEv28` is the Proof-of-Concept script.  

`dhcpd: CRE-Commit . . . finish` is the end of the DHCP execute.

The total time for the "block" is 2-3 seconds.

I measured using `time dhcpEvent commit <ip addr> <hostname>` and got 0.5 seconds

How fast is "*very* fast"??  Is that OK?
Comment 50 Jon 2024-03-29 02:28:52 UTC
Created attachment 1493 [details]
readme
Comment 51 Jon 2024-03-29 02:29:47 UTC
Created attachment 1494 [details]
dhcpScript
Comment 52 Jon 2024-03-29 02:32:43 UTC
Created attachment 1495 [details]
dhcpd.conf.local config file

Added three files: README.md, dhcpEvent and dhcpd.conf.local.  This is the current PoC version.
Comment 53 Nick Howitt 2024-03-29 09:18:48 UTC
If HOSTNAME is available in scripts, is extraction of the domain using DOMAINNAME=${HOSTNAME#*.*} considered dangerous as it is not reading from file? If it is OK, it could be faster than and eval.
Comment 54 Jon 2024-03-29 16:01:36 UTC
(In reply to Nick Howitt from comment #53)
> If HOSTNAME is available in scripts, is extraction of the domain using
> DOMAINNAME=${HOSTNAME#*.*} considered dangerous as it is not reading from
> file? If it is OK, it could be faster than and eval.

Yes, the HOSTNAME is available.  It is sent from DHCP.  But it arrives as like this:
```
Mar 29 10:53:48 ipfire dhcpd: execute_statement argv[3] = MM2
```

and does not have the DOMAINNAME attached.  The only way I knew of to grab the "official" DOMAINNAME was via the: `eval $(/usr/local/bin/readhash /var/ipfire/main/settings)`
Comment 55 Jon 2024-03-29 16:07:07 UTC
I changed one line to grep-ing a fixed string so that should make this a bit faster.

#	get existing records from unbound-control list_local_data
ucList=$( /usr/sbin/unbound-control list_local_data | 
	/bin/grep --ignore-case --fixed-strings ".${DOMAINNAME}." | 
	expand -t1 )
Comment 56 Nick Howitt 2024-03-29 16:09:26 UTC
AFAIK HOSTNAME is a standard environment variable in many Linux distros:

[root@gateway ~]# echo $HOSTNAME
gateway.howitts.co.uk

On my IPFire it contains the FQDN as above.
Comment 57 Michael Tremer 2024-03-30 13:04:03 UTC
(In reply to Jon from comment #49)
> The total time for the "block" is 2-3 seconds.
> 
> I measured using `time dhcpEvent commit <ip addr> <hostname>` and got 0.5
> seconds
> 
> How fast is "*very* fast"??  Is that OK?

I think we are looking at an executing time of around 100ms. The reason is as follows:

Consider a group of people (let's say 60 for simple maths) enter a room with a wireless access point all having a phone in their pockets that connect simultaneously to the WiFi and therefore will request a DHCP Lease at exactly the same time.

dhcpd will then queue all requests and process them one at a time. If the script would take one second, that would initially mean that the last person is waiting around 60s to get an IP address offered by dhcpd. That is already disruptive.

In practise clients don't wait that long. Since DHCP is UDP-based they would assume that the request was lost and send another request. Considering that dhcpd processes all requests one after the other ordered by when they were received, it is suddenly processing requests that no client is waiting for any more. Whatever the timeout is depends on the client, but dhclient implements three seconds.

That means that if initially all clients send one request, dhcpd will process the first client and respond after 1 second. The second and third client will also get a response. The fourth client will already have timed out and therefore send another request that is going to be appended to the queue. The maths is now becoming huge.

If the queue initially has 60 requests and processes three before all clients time out and send another request, the queue will have grown to 117 requests. After another three seconds we will have responded to another three clients and the remaining 54 have timed out again adding a further 54 requests totalling 171... This will continue until all clients have received a response by what time the queue will have 570 requests queued.

If a new client now connects to the network, it will have to wait almost ten minutes for the DHCP server to respond. The client will long have gone away as most of the others in the first example.

I don't have a good name for this, but I think we can consider this an avalanche effect. Once the script cannot keep up, the situation is running out of control and no client will be able to connect to the network.

This is the huge risk I see with this approach. It will work in small test environments, but IPFire does not only run there. Therefore we need to solve this problem and have a script that is *fast*. With running so many shell commands this might be very difficult.

---

And yes, you can use $HOSTNAME.
Comment 58 Jon 2024-03-30 14:29:43 UTC
excellent.  all of this helps!

So now looking at the overall concept (after all, this is a proof-of-concept): 
- is getting info from DHCP using a Commit/Release/Expiry event   <--this is new
- and processing it with a script (a very very fast script!) 
- and sending it to unbound as  
    • unbound-control local_datas <data>                          <--this is new
    • and an update to the `/etc/unbound/dhcp-leases.conf` file


an OK way to go forward?
Comment 59 Nick Howitt 2024-03-30 16:38:27 UTC
How about this. What is the point of maintaining the /etc/unbound/dhcp-leases.conf file? This can be done initially as a pre-start task of dhcpd virtually using /usr/sbin/unbound-dhcp-leases-bridge, but as it is done as a pre-start task, unbound does not have to be restarted and no listener would be needed for any DNS events? /usr/sbin/unbound-dhcp-leases-bridge already gets run straight after unbound starts. This should be moved to before and then simplified so it is a one-shot task.

Once that happens none of the calls are then needed to maintain the on-the-fly. It is only ever read when dhcpd is started or reloaded so jt only ever needs to be valid at the point that unbound starts. Then Jon can remove all those bits of the his script and just do the calls.

If Jon strips out all his maintenance of the /etc/unbound/dhcp-leases.conf file and just leaves in the calls to the unbound cache, how quick does the processing become?

/usr/sbin/unbound-dhcp-leases-bridge would also then be massively simplified just to write the /etc/unbound/dhcp-leases.conf and exit.
Comment 60 Jon 2024-03-30 17:55:36 UTC
(In reply to Nick Howitt from comment #59)
> How about this. What is the point of maintaining the
> /etc/unbound/dhcp-leases.conf file? This can be done initially as a

It primary purpose is to load unbound DNS on a reboot/restart/power-up.  Otherwise it would have no DNS entries.

does that help?
Comment 61 Nick Howitt 2024-03-30 18:38:23 UTC
So just generate it at startup with a modified /usr/sbin/unbound-dhcp-leases-bridge, run just before the unbound start, rather than after it as it does at the moment. Then pull all the /etc/unbound/dhcp-leases.conf file maintenance from your scripts just leaving the parts related to the unbound cache calls. Then how fast does your script become? It cuts right down on the file i/o.

As a PoC, don't even bother generating the /etc/unbound/dhcp-leases.conf and just test your modified script on its own.
Comment 62 Jon 2024-03-30 19:34:56 UTC
(In reply to Nick Howitt from comment #61)
> So just generate it at startup with a modified
> /usr/sbin/unbound-dhcp-leases-bridge, run just before the unbound start,
> rather than after it as it does at the moment. Then pull all the
> /etc/unbound/dhcp-leases.conf file maintenance from your scripts just
> leaving the parts related to the unbound cache calls. 

I am not sure I understand your proposal.

Since the bash needs to go from more than 500ms to less than 100ms I am not sure the file i/o will make huge difference.  

It would be better to write as perl or python if @ms agrees the overall concept is correct.
Comment 63 Jon 2024-03-30 20:21:36 UTC
(In reply to Nick Howitt from comment #61)
> ...   Then pull all the
> /etc/unbound/dhcp-leases.conf file maintenance from your scripts just
> leaving the parts related to the unbound cache calls. Then how fast does
> your script become? It cuts right down on the file i/o.
> 
> As a PoC, don't even bother generating the /etc/unbound/dhcp-leases.conf and
> just test your modified script on its own.

I commented out the reads, writes, sed's, etc /etc/unbound/dhcp-leases.conf and it saves about 10ms.

This is on a APU4D4.
Comment 64 Nick Howitt 2024-03-30 22:46:01 UTC
Perhaps it is worth timing just the unbound cache commands. That should give you the absolute shortest time possible. If that is too long, then perhaps we are dead.

If those times are OK, then we need to speed up your stuff. You end up with small scripts so could they be rune directly from dhcpd.conf.local rather than calling /usr/sbin/dhcpEvent? Just ideas.
Comment 65 Michael Tremer 2024-04-01 11:29:10 UTC
(In reply to Nick Howitt from comment #64)
> Perhaps it is worth timing just the unbound cache commands. That should give
> you the absolute shortest time possible. If that is too long, then perhaps
> we are dead.

I think that is a good start. We also need to take into consideration that there are people with lots of leases. So the loop that is in the script rewriting the file might take a short moment in testing, but might take several seconds if there are 1000 leases to process.

Option C could be to have the most minimal script that just talks to a daemon that can hold and cache state. For example the bridge.
Comment 66 Nick Howitt 2024-04-01 11:59:34 UTC
I've been trying to do some basic testing and research. Can some things be spead up by avoiding external calls, for example, to get a reverse IP, from https://unix.stackexchange.com/questions/132779/how-to-read-an-ip-address-backwards you can do:

IFS=. read w x y z <<<'192.168.1.1'
printf '%d.%d.%d.%d.in-addr.arpa.' "$z" "$y" "$x" "$w"

or:

echo '192.168.1.1' | { IFS=. read w x y z; echo "$z.$y.$w.$x.in-addr.arpa."; }

to avoid calling awk. Would that be faster.

Also, from empirical testing, there seems little point in trapping errors before   calling unbound-control. For example, in delete_unbound, perhaps just deleting the IP and reverse IP may be quicker than testing if it is there.

Would it just be faster to delete any A and PTR records before adding them rather than test if they exist
Comment 67 Nick Howitt 2024-04-02 09:36:19 UTC
There is a bug with the this new approach. It also exists in the current approach. The cache auto-expires all entries when it hits the TTL so this should be read from /etc/dhcp/dhcpd.conf every time a lease is renewed or, if it appears, from the renewal event. I also hope that a renewal event kicks off a commit event.

Currently it is hard-coded to 60min in both Jon's update and /usr/sbin/unbound-dhcp-leases-bridge. The issue probably does not hit in /usr/sbin/unbound-dhcp-leases-bridge because it regenerates and restarts so often but I think would come more apparent if someone increased the DHCP TTL. I need to check out my own updates to /etc/unbound/dhcp-leases.conf for the same issue.
Comment 68 Jon 2024-04-02 16:36:02 UTC
(In reply to Nick Howitt from comment #67)
> There is a bug with the this new approach. It also exists in the current
> approach. The cache auto-expires all entries when it hits the TTL so this
> should be read from /etc/dhcp/dhcpd.conf every time a lease is renewed or,
> if it appears, from the renewal event. I also hope that a renewal event
> kicks off a commit event.

I do not see a TTL in the dhcp.conf.  Are you referring to default-lease-time or max-lease-time?

> 
> Currently it is hard-coded to 60min in both Jon's update and
> /usr/sbin/unbound-dhcp-leases-bridge. The issue probably does not hit in
> /usr/sbin/unbound-dhcp-leases-bridge because it regenerates and restarts so
> often but I think would come more apparent if someone increased the DHCP
> TTL. I need to check out my own updates to /etc/unbound/dhcp-leases.conf for
> the same issue.

Are the the DHCP lease times and the unbound DNS TTL times really connected or related??
Comment 69 Nick Howitt 2024-04-02 17:51:00 UTC
I tried looking at the ISC docs and it really isn't clear to me. I was wondering if I could experiment. Both values are used in the UI. The IPF documentation says the default-lease-time and does not mention the max-lease-time.

https://dhcp-users.isc.narkive.com/jeVKUrHW/what-is-the-subtil-difference-between-max-lease-time-and-default-lease-time indicates the default is the one handed out if the client does not request a specific lease time and the max is the maximum DHCPD would allow if a client requested a specific one.

I tried looking at the on-commit and it does not seem to be passed as a parameter, unless I missed it. Then the only option that I can see is to calculate it from /var/state/dhcp/dhcpd.leases but this risks slowing down the script.
Comment 70 Jon 2024-04-02 18:15:55 UTC
(In reply to Nick Howitt from comment #69)
> I tried looking at the ISC docs and it really isn't clear to me. I was
> wondering if I could experiment. 

Of course!

> https://dhcp-users.isc.narkive.com/jeVKUrHW/what-is-the-subtil-difference-
> between-max-lease-time-and-default-lease-time indicates the default is the
> one handed out if the client does not request a specific lease time and the
> max is the maximum DHCPD would allow if a client requested a specific one.

Max is the expiry time.  When I leave home, DHCP sends an "on-expiry" after 2 hours.  My dhcp is set to: "Max lease time (mins): 120"


> I tried looking at the on-commit and it does not seem to be passed as a
> parameter, unless I missed it.

There is not one that I found.  I only found status (like "commit") IP address & hostname.
Comment 71 Jon 2024-04-02 18:20:37 UTC
And a MAC address!
Comment 72 Nick Howitt 2024-04-02 19:05:49 UTC
(In reply to Jon from comment #70)
> (In reply to Nick Howitt from comment #69)
> > I tried looking at the ISC docs and it really isn't clear to me. I was
> > wondering if I could experiment. 
> 
> Of course!
> 
> > https://dhcp-users.isc.narkive.com/jeVKUrHW/what-is-the-subtil-difference-
> > between-max-lease-time-and-default-lease-time indicates the default is the
> > one handed out if the client does not request a specific lease time and the
> > max is the maximum DHCPD would allow if a client requested a specific one.
> 
> Max is the expiry time.  When I leave home, DHCP sends an "on-expiry" after
> 2 hours.  My dhcp is set to: "Max lease time (mins): 120"

That isn't my reading. My reading is that many clients don't request a specific lease time. In that case they get the default lease time. However if the client explicitly requested 10 minutes, it would get 10 minutes and if it requested 10h, it would get max-lease-time if it is less than 10h.

https://dhcp-users.isc.narkive.com/jeVKUrHW/what-is-the-subtil-difference-between-max-lease-time-and-default-lease-time

> 
> 
> > I tried looking at the on-commit and it does not seem to be passed as a
> > parameter, unless I missed it.
> 
> There is not one that I found.  I only found status (like "commit") IP
> address & hostname.
Comment 73 Jon 2024-04-02 19:26:35 UTC
This feels like we may be going off topic...

How is this related to the multiple unbound restarts?
Comment 74 Nick Howitt 2024-04-02 20:00:39 UTC
Multiple restarts are caused by /usr/sbin/unbound-dhcp-leases-bridge updating, effectively, the hosts in in Unbound. I have tried to work around it by updating /usr/sbin/unbound-dhcp-leases-bridge, but you have a better solution (IMHO) of taking the output directly from dhcpd and populating the Unbound cache.

In testing, I have noticed that my records added to the Unbound cache disappear after the the TTL specified. e.g., with my testing simplification of your script:

IP=$1
PTR=$(echo $IP | { IFS=. read w x y z; echo "$z.$y.$w.$x.in-addr.arpa."; })
DOMAIN=${HOSTNAME#*.*}

echo -e "$2.$DOMAIN"'\n'"$PTR" | /usr/sbin/unbound-control local_datas_remove

echo -e "$2.$DOMAIN 60 IN A $IP"'\n'"$PTR 60 IN PTR $2.$DOMAIN" | /usr/sbin/unbound-control local_datas

Anything I add disappears in 60 minutes. Now, If I had a lease time of 3h and this triggered the add to cache, for 60 mins the A record would appear, but then it would disappear until the lease were renewed. To me, that is a bug.

I am trying to work round it.

Now, if the default-lease-time is set to 60 mins, the bug won't generally be hit as releases are renewed from half the granted lease time onwards. But if the default were set to 8h (the IPF default is very low), then the bug would be continually hit as you will never get a message to renew the hosts entry in time. Similarly, if a client requested a long lease, you'd hit the same problem.

The current methodology works with a 60 minute default works because the hosts file gets regenerated so often but it is not good. It would also suffer if there were very few leases and a long lease time.

Why do you think it is not related to multiple unbound restarts and fixing the issue?
Comment 75 Jon 2024-04-02 20:23:05 UTC
(In reply to Nick Howitt from comment #74)
> Why do you think it is not related to multiple unbound restarts and fixing
> the issue?

Mostly because it was more related to DHCP...

If I understanding what you are saying:  Changing the DHCP "max lease time" from default of 120min (2h) to 3h, the Unbound cache record A (but not record PTR?) disappears from the unbound cache after 60 minutes (the TTL setting in dhcpEvent script)?

I am trying to understand your scenario so I can test also...
Comment 76 Nick Howitt 2024-04-02 20:35:04 UTC
The PTR also disappears.

The 60min TTL is used in both dhcpEvent and /usr/sbin/unbound-dhcp-leases-bridge.

It may be tempting to stick the TTL very high and rely on the dhcpd expiry event to remove the record from Unbound, but it won't be a safe approach for downstream which may have cached the record until the original TTL expires. Downstream won't then re-query Unbound, but Unbound's record may have changed.
Comment 77 Jon 2024-04-02 21:25:43 UTC
I think this is the lease time that could be grabbed with on-commit:
https://github.com/isc-projects/dhcp/blob/master/common/print.c#L979-L981
Comment 78 Nick Howitt 2024-04-03 09:53:06 UTC
Looks like you are right - https://linux.die.net/man/5/dhcp-eval
Comment 79 Nick Howitt 2024-04-03 17:11:00 UTC
Do you know how to use lease-time? I have tried all sorts just outputting to a log file. Most things end up outputting nothing and killing the log lone, or dhcpd refuses to start. The most progress I made was with:

set LeaseTime = encode-int(lease-time, 8);
set PTR = concat (binary-to-ascii (10, 8, ".", reverse (1, leased-address)), ".in-addr.arpa.");
log(concat("CRE-Commit:  ClientIP: ", ClientIP, "  Name: ", ClientName, "  PTR: ", PTR, "  Lease time: ", LeaseTime )); 

But this list gives:
Apr  3 17:57:10 gateway dhcpd: CRE-Commit:  ClientIP: 172.17.2.95  Name: LAPTOP-Rachel  PTR: 95.2.17.172.in-addr.arpa.  Lease time: ^P

The additional paramater of 10 or more stops dhcpd from starting. ChatGPT isn't helping as it suggests:

set lease_time_seconds = option lease-time; but lease-time is not an option. If I correct it, it switches to packet-lease-time and my log line fails.
Comment 80 Nick Howitt 2024-04-03 17:59:56 UTC
Just to add, if you change the encode-int second parameter to 16 you get ^N^P and if you change it to 32 you get a blank.
Comment 81 Jon 2024-04-03 18:11:51 UTC
having similar issues.

just found this thread:
https://lists.isc.org/mailman/htdig/dhcp-users/2011-December/014521.html

and "maybe" this solution:
https://lists.isc.org/mailman/htdig/dhcp-users/2012-January/014593.html
Comment 82 Jon 2024-04-03 18:29:14 UTC
(In reply to Nick Howitt from comment #80)
> Just to add, if you change the encode-int second parameter to 16 you get
> ^N^P and if you change it to 32 you get a blank.

^N^P = 0E10x = 3600

what it your lease time setting?
Comment 83 Jon 2024-04-03 18:40:51 UTC
mine is: either blank or ^Ah

^Ah = 0168 hex = 360 dec = Default lease time (mins): 5

So my number is correct (as a hex) for my 5 minute default time.

But I dont understand why some are blank...
Comment 84 Jon 2024-04-03 18:41:36 UTC
oops not quite correct...
Comment 85 Jon 2024-04-03 19:12:00 UTC
maybe this...

```
set LeaseTime = encode-int(lease-time, 16);
set LeaseTime2 = binary-to-ascii(10, 16, "", LeaseTime);
```
Comment 86 Nick Howitt 2024-04-03 19:30:23 UTC
Well done. I've tried:
set LeaseTime = binary-to-ascii(10, 16, "", encode-int(lease-time, 16));

and it works. I was trying ,, instead of ,"", when you posted.

I am also confused. All the documentation I read indicates the TTL is in seconds, which means the setting used currently in /etc/unbound/dhcp-leases.conf and in your replacement method may be wrong? I'm checking.
Comment 87 Nick Howitt 2024-04-03 19:41:45 UTC
It also means that my premise that the DNS record was expiring on the TTL is incorrect but something is expiring it.
Comment 88 Jon 2024-04-03 19:45:53 UTC
This was the part I did not understand...

I could not re-create what you saw.  What error or issue did you see that hinted this was true?
Comment 89 Nick Howitt 2024-04-03 19:54:57 UTC
I was adding a pair of DNS records with:
IP=$1
PTR=$(echo $IP | { IFS=. read w x y z; echo "$z.$y.$w.$x.in-addr.arpa."; })
DOMAIN=${HOSTNAME#*.*}
echo -e "$2.$DOMAIN"'\n'"$PTR" | /usr/sbin/unbound-control local_datas_remove
echo -e "$2.$DOMAIN 60 IN A $IP"'\n'"$PTR 60 IN PTR $2.$DOMAIN" | /usr/sbin/unbound-control local_datas

so:
./test-add2 192.168.1.2 xxx

giving:
[root@gateway ~]# /usr/sbin/unbound-control list_local_data | grep xxx
xxx.howitts.co.uk.      60      IN      A       192.168.1.2
2.1.192.168.in-addr.arpa.       60      IN      PTR     xxx.howitts.co.uk.

Then at some point in the future they disappeared. I'll shoot myself quietly if it was an unbound restart that was causing it, which I guess it could have been. I am testing now. I've added a record and I'm watching it in a loop. If it disappears I'll see when, check for a restart and check back.

I am running my updated script so it may not restart....
Comment 90 Nick Howitt 2024-04-03 21:09:48 UTC
Still working. I'm thinking I've led you up the garden path. Apologies. I'll keep the loop running overnight.
Comment 91 Nick Howitt 2024-04-04 09:02:48 UTC
The manually cached entry was still there this morning, so I have led you up the garden path with the TTL issue.

When I restarted Unbound, the cache was flushed and the entry disappeared as I expected. It did not disappear when I restarted dhcp. To me, this then means the /etc/unbound/dhcp-leases.conf only needs to be written as a pre-start task of unbound. It can be maintained on-the-fly by dhcpScript but this puts an unnecessary overhead on dhcp, especially if the emphasis is speed so as to not block dhcpd from issuing leases.

What do others think?

Another question, which is quicker for generating the PTR? it can be done in the  dhcpd.conf.local with:

set PTR = concat (binary-to-ascii (10, 8, ".", reverse (1, leased-address)), ".in-addr.arpa.");

Or in dhcpScript as it is done now calling awk or directly in bash with:

PTR=$(echo $IP | { IFS=. read w x y z; echo "$z.$y.$w.$x.in-addr.arpa."; })

I suspect the awk method is the slowest. I can try to set up a timing loop for the two methods in dhcpScript later today.
Comment 92 Nick Howitt 2024-04-04 13:31:55 UTC
Here are some timings for calculating the reverse IP over 256 IP's:

IP="192.168.1."
time for i in {0..255} ; do echo ${IP}$i | { IFS=. read w x y z; echo "$z.$y.$w.$x.in-addr.arpa."; }; done
<snip>
real    0m0.166s
user    0m0.156s
sys     0m0.109s

time for i in {0..255} ; do echo "${IP}$i" | /usr/bin/awk -F. '{print $4 "." $3 "." $2 "." $1 ".in-addr.arpa." }'; done
<snip>
real    0m0.878s
user    0m0.617s
sys     0m0.368s

It is a saving of 0.003s per iteration. Perhaps not too significant, but a small reduction.

I don't know how to time the same thing in dhcpd.conf.local with:

set PTR = concat (binary-to-ascii (10, 8, ".", reverse (1, leased-address)), ".in-addr.arpa.");
Comment 93 Nick Howitt 2024-04-04 14:32:06 UTC
But for extracting the valid part of a hostname, these are near as dammit the same:

test="abcABD-_.khaoih"
time for i in {0..255} ; do expr "$i$test" : '\([A-Za-z0-9.-]*\)'; done
time for i in {0..255} ; do echo "$i$test" | /bin/sed 's/[^A-Za-z0-9.-]/_/g' | /usr/bin/awk -F_ '{print $1}'; done

But this is a touch faster:
time for i in {0..255} ; do echo "$i$test" | /bin/sed 's/[^A-Za-z0-9.-].*//'; done

but not significantly - about 0.0005s per iteration
Comment 94 Nick Howitt 2024-04-04 16:37:27 UTC
if you're allowed to us it for the domain name:
time for i in {0..255} ; do DOMAINNAME=${HOSTNAME#*.*}; echo $DOMAINNAME ; done
<snip>
real    0m0.010s
user    0m0.005s
sys     0m0.004s

vs
time for i in {0..255} ; do eval "$(/usr/local/bin/readhash /var/ipfire/main/settings)"; echo $DOMAINNAME; done
<snip>
real    0m2.485s
user    0m1.700s
sys     0m0.817s
Comment 95 Nick Howitt 2024-04-04 16:43:23 UTC
Do you need the "expand":

[root@gateway ~]# time for i in {0..255} ; do ucList=$( /usr/sbin/unbound-control list_local_data | /bin/grep -i "${DOMAINNAME}" | expand -t1 ) ; done

real    0m1.869s
user    0m1.785s
sys     0m1.141s
[root@gateway ~]# time for i in {0..255} ; do ucList=$( /usr/sbin/unbound-control list_local_data | /bin/grep -i "${DOMAINNAME}") ; done

real    0m1.577s
user    0m1.214s
sys     0m1.062s
Comment 96 Jon 2024-04-04 18:18:45 UTC
Created attachment 1513 [details]
timing

Hey Nick!  Sorry I haven't replied recently.  Just been busy with the weekend holiday and now Tax Day filings!  (ugh!).

I've been doing the same for each command and running time trials (see attached image).  For most everyone I can save near 5-20 ms for a command.  But I have not been able to find "the smoking gun" that is using a big chunk of time.  So as mentioned in Comment 62, I am not sure we can get things down from 500ms to 100 ms without perl (or something faster than bash).

I should be able to get back to this sometime soon!  

Let me know if you find the smoking gun!
Comment 97 Jon 2024-04-04 19:10:46 UTC
(In reply to Nick Howitt from comment #92)
> I don't know how to time the same thing in dhcpd.conf.local with:
> 
> set PTR = concat (binary-to-ascii (10, 8, ".", reverse (1, leased-address)),
> ".in-addr.arpa.");

This could be done.  I just don't know how to measure the execution time if PTR is moved to dhcpd.conf.   If we can find a way to measure (beside the time stamp on a log entry) that would be great!
Comment 98 Nick Howitt 2024-04-04 19:22:54 UTC
I have no idea there.

For IP validation over 500 iterations:
time for i in {0..499} ; do if [[ "$IP$i" =~ [0-9]{1,3}\.[0-9]{1,3}\[0-9]{1,3}\[0-9]{1,3} ]] ; then echo "good" ; else echo "bad"; fi ; done
real    0m0.079s
user    0m0.062s
sys     0m0.018s

vs

time for i in {0..499} ; do if ! /bin/grep --quiet --extended-regexp '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' <<< "${IP}${i}" ; then echo "bad"; else echo "good" ;fi ; done
real    0m1.842s
user    0m1.200s
sys     0m0.672s

Note the original test is negative. My other one is positive but can be negated. Again no smoking gun.

I guess I may have to load in your your code fully rather than do synthetic testing.

How did you do your timings?
Comment 99 Jon 2024-04-04 19:26:26 UTC
(In reply to Nick Howitt from comment #95)
> Do you need the "expand":

Yes, I just dont remember the "why".  It probably had to do the one of the greps on the ucList. I'll review
Comment 100 Jon 2024-04-04 19:35:29 UTC
(In reply to Nick Howitt from comment #98)
> How did you do your timings?


For command line: similar to yours except for 1000:
`time for i in {0..999} ; do DOMAINNAME=${HOSTNAME#*.*}; done`


for dhcpEvent just one time:
`time dhcpEvent commit <ip addr> <hostname>`


For multiple IPs and I time this script as a whole:
```
#!/bin/bash
for y in `seq 101 104`					#	4 x 250 = 1000 lines			
do
	for z in `seq 3 252`				#	250 lines total
	do
		IP="192.168.${y}.${z}"
		/root/dhcpdconf/dhcpEvent_v29b.sh commit "${IP}" "testme${y}${z}"
	done
done
```
Comment 101 Jon 2024-04-04 19:57:54 UTC
One more item:  The last command (the for loop) stresses everything.  Dont run on production.  It will add to unbound cache and /etc/unbound/dhcp-leases.conf
Comment 102 Nick Howitt 2024-04-04 20:00:44 UTC
I made a mistake in my testing and piped some rubbish into /bin/grep. I had to spin up a VM to grab a copy of the file to fix the firewall. I can try to copy over data to test there.
Comment 103 Jon 2024-04-04 20:53:39 UTC
Hmmm.  I may have made an error also.  The smoking gun may this:
```
eval $(/usr/local/bin/readhash /var/ipfire/dhcp/settings)

# Is "DNS Update > Enable DNS Update (RFC2136)" enabled or disabled?
if [[ "${DNS_UPDATE_ENABLED}" != on ]]; then
	/usr/bin/logger --tag dhcpEvent "DNS Update (RFC2136) is NOT enabled.  Bridge may be running"
else
	/usr/bin/logger --tag dhcpEvent "DNS Update (RFC2136) is enabled"
fi

pgrep -f unbound-dhcp-leases-bridge &&
	kill -s SIGTERM $(pgrep -d',' -f unbound-dhcp-leases-bridge) &&
	/usr/bin/logger --tag dhcpEvent "terminated unbound-dhcp-leases-bridge"
```

And secondly this:
```
#	send message to message log
msg_log () {
	if tty --silent ; then
 		echo "${tagName}:" "$*"
	else
		/usr/bin/logger --tag "${tagName}" "$*"
	fi
}
```

time to test!
Comment 104 Jon 2024-04-04 21:24:04 UTC
[root@ipfire ~] # time /root/dhcpdconf/dhcpEvent_v30.sh commit 192.168.200.16 testME20016

real	0m0.404s
user	0m0.270s
sys	0m0.202s


the above got rid of ~100ms.  300ms more to go!
Comment 105 Nick Howitt 2024-04-04 21:36:23 UTC
I had not decoded that part. I am not sure what you see wrong in the first bit or what to expect from DNS_UPDATE_ENABLED, but I don't think Unbound can handle RFC2136.

A killall unbound-dhcp-leases-bridge may be a bit better than a kill with a pgrep.

Does all that execute every time the app runs?

I know nothing about tty so skipped that bit of code, but too much debugging output will slow things.
Comment 106 Jon 2024-04-04 21:53:34 UTC
It does for the current dhcpEvent, but all of this will be removed if the idea is acceptable to the Core Devs.

I was just being paranoid and I wanted to make sure `unbound-dhcp-leases-bridge` was not running at the same time.  And yes, it did run and check every time.  So I've changed the test so that does not happen

It is not worth spending time on to make faster since it will disappear one day.  

But it is noteworthy since it is/was included the the original 500ms.  And this needs to be commented out for time tests:
```
eval $(/usr/local/bin/readhash /var/ipfire/dhcp/settings)

# Is "DNS Update > Enable DNS Update (RFC2136)" enabled or disabled?
if [[ "${DNS_UPDATE_ENABLED}" != on ]]; then
	/usr/bin/logger --tag dhcpEvent "DNS Update (RFC2136) is NOT enabled.  Bridge may be running"

	pgrep -f unbound-dhcp-leases-bridge &&
		kill -s SIGTERM $(pgrep -d',' -f unbound-dhcp-leases-bridge) &&
		/usr/bin/logger --tag dhcpEvent "terminated unbound-dhcp-leases-bridge"
else
 	/usr/bin/logger --tag dhcpEvent "DNS Update (RFC2136) is enabled"
fi
```
Comment 107 Nick Howitt 2024-04-05 11:13:40 UTC
Created attachment 1514 [details]
dhcpEvent updated with speed ups and updating of dhcp-leases.conf removed

I've commented out the "Kill the unbound-dhcp-leases-bridge process" section and added all the speed ups I mentioned. Note the RE I originally proposed to validate the IP address was buggy. THis has the correct version.

I have also removed the lines updating the dhcp-leases.conf file. This should be removed to the Unbound start up as a pre-start task. It hen becomes a one off task rather than a perpetual maintenance task. It is only used to load the cache at the beginning and, thereafter, the cache is maintained by this script.

Now the release is slower than the commit!

[root@ipfire ~]# time /root/dhcpEvent.njh commit 192.168.1.1 xxxx
added 2 datas
dhcpEv28: commit: added A & PTR records to unbound

real    0m0.067s
user    0m0.034s
sys     0m0.040s
[root@ipfire ~]# /usr/sbin/unbound-control list_local_data | grep xxx
1.1.192.168.in-addr.arpa.       60      IN      PTR     xxxx.localdomain.
xxxx.localdomain.       60      IN      A       192.168.1.1
[root@ipfire ~]# time /root/dhcpEvent.njh release 192.168.1.1
removed 2 datas
dhcpEv28: release: references to xxxx.localdomain. removed from unbound

real    0m0.075s
user    0m0.050s
sys     0m0.045s
[root@ipfire ~]# /usr/sbin/unbound-control list_local_data | grep xxx
[root@ipfire ~]# 

This was tested on a 2-core VM with 2GB RAM
Comment 108 Nick Howitt 2024-04-05 11:21:18 UTC
I have a feeling this can leave orphaned PTR records. If you get a lease on your phone, for example, then change the phone's name and refresh the lease on the phone (WiFi off then on, for example), the old A record will get overwritten, but I think the PTR will remain?
Comment 109 Nick Howitt 2024-04-05 14:31:12 UTC
The "revIP" line is not needed in delete_unbound.

unbound-control can take a -q which may have a benefit.

There are a couple of orphaned parameters in the script now e.g unboundDHCPleases and in delete_unbound, revIP, if the unnecessary line is removed.
Comment 110 Jon 2024-04-05 15:07:31 UTC
Created attachment 1515 [details]
dhcpEvent - v30

Here is the current dhcpEvent (v30).  It includes the changes up to April 4.  This is the 300ms version (after msg_log commands commented out).
Comment 111 Jon 2024-04-05 15:13:02 UTC
(In reply to Nick Howitt from comment #109)
> The "revIP" line is not needed in delete_unbound.

How are PTR / `.in-addr.arpa` lines deleted without reverse IP?

> 
> unbound-control can take a -q which may have a benefit.

Got it - I will give it a try!

> There are a couple of orphaned parameters in the script now e.g
> unboundDHCPleases and in delete_unbound, revIP, if the unnecessary line is
> removed.

I do not understand how I can get rid of code related to unbound DHCP leases.  How does `/etc/unbound/dhcp-leases.conf` get updated?
Comment 112 Nick Howitt 2024-04-05 15:22:15 UTC
(In reply to Jon from comment #111)
> (In reply to Nick Howitt from comment #109)
> > The "revIP" line is not needed in delete_unbound.
> 
> How are PTR / `.in-addr.arpa` lines deleted without reverse IP?
> 
Have a look at the code. It is not used. The reverse IP is derived from grepping the ucList for the FQDN
> > 
> > unbound-control can take a -q which may have a benefit.
> 
> Got it - I will give it a try!
> 
I don't think it makes much difference from my observations
> > There are a couple of orphaned parameters in the script now e.g
> > unboundDHCPleases and in delete_unbound, revIP, if the unnecessary line is
> > removed.
> 
> I do not understand how I can get rid of code related to unbound DHCP
> leases.  How does `/etc/unbound/dhcp-leases.conf` get updated?
`/etc/unbound/dhcp-leases.conf` only gets read during the start or reload of Unbound. While unbound is running, changing the file does nothing. As a logical extension, why, then, bother to maintain it while unbound is running? Its generation could then move to the Unbound start routine, /etc/init.d/unbound to run prior to unbound starting. There is a huge amount of pre-processing done in init file, so why not add the file generation there, where it is not so time critical and will rarely get called.
Comment 113 Jon 2024-04-05 16:52:51 UTC
(In reply to Nick Howitt from comment #112)
> (In reply to Jon from comment #111)
> > (In reply to Nick Howitt from comment #109)
> > > The "revIP" line is not needed in delete_unbound.
> > 
> > How are PTR / `.in-addr.arpa` lines deleted without reverse IP?
> > 
> Have a look at the code. It is not used. The reverse IP is derived from
> grepping the ucList for the FQDN
removed - thank you!

> > > unbound-control can take a -q which may have a benefit.
> > 
> > Got it - I will give it a try!
> > 
> I don't think it makes much difference from my observations
added -q and removed `>/dev/null`

> `/etc/unbound/dhcp-leases.conf` only gets read during the start or reload of
> Unbound. While unbound is running, changing the file does nothing. As a
> logical extension, why, then, bother to maintain it while unbound is
> running? Its generation could then move to the Unbound start routine,
> /etc/init.d/unbound to run prior to unbound starting. There is a huge amount
> of pre-processing done in init file, so why not add the file generation
> there, where it is not so time critical and will rarely get called.

You have to show me the code for this one.  I dont think this will work.  Records A & PTR cannot be created on the fly.  That is why `/etc/unbound/dhcp-leases.conf` exists.
Comment 114 Nick Howitt 2024-04-05 17:16:20 UTC
You've missed what I am trying to say. Another script would be needed. It could be based on a cut down /usr/sbin/unbound-dhcp-leases-bridge. It queries the leases file (/var/state/dhcp/dhcpd.leases) and static leases, eliminates anything expired, anything covered by the hosts file and anything without a name on the lease and then generates the /etc/unbound/dhcp-leases.conf. After that it reloads unbound (which we would not need). Problem is I don't speak python. The same thing could probably be scripted in bash but I don't know how to query the information out of the dhcpd.leases file. This should be run during the unbound start up.

So the code does not exist, but generating the /etc/unbound/dhcp-leases.conf is better done as part of the unbound start (IMHO).

Also generating the file during the unbound start removes the risk of the file diverging over time as it gives a full reset.
Comment 115 Bernhard Bitsch 2024-04-05 18:14:39 UTC
Just decided to join this discussion. Didn't read it as a whole, yet.

Therefore my motivation only.
I am examining filtering using the RPZ mechanism of unbound. The block lists are updated by the SOA functionality. I noticed that lists with a longer update period ( usually 1 day ) are not checked. I suppose, the reload of unbound resets the associated timers. This results in deactivation of the update mechanism.
Don't know, whether other SOA based lists lack actuality also.

Jon's concept of syncing DHCP and unbound by the CRE mechanism and unbound-control load-local-data seems a way out of this problem.
Comment 116 Jon 2024-04-06 01:05:06 UTC
Nick - Now I understand better.  But I am not going to touch this aspect of the current unbound bridge.  It is way over my head!

Hey Bernhard!

(In reply to Nick Howitt from comment #108)
> I have a feeling this can leave orphaned PTR records. If you get a lease on
> your phone, for example, then change the phone's name and refresh the lease
> on the phone (WiFi off then on, for example), the old A record will get
> overwritten, but I think the PTR will remain?

I do not see stay PTRs.  Is this with your code from Comment 107 or my code?  Does this still happen?
Comment 117 Nick Howitt 2024-04-06 08:47:44 UTC
I may be able to have a look at the unbound bridge, but I'll be slow!

The thing about orphaned PTR's is hypothetical and I have not tried to demonstrate. It goes roughly like this:

Obtain a lease with a name like Pixel7a e.g. for 192.168.1.10.
You get an A record Pixel7a.domain pointing to 192.168.1.10.
You get a PTR record 10.1.168.192.in-addr.arpa pointing to Pixel7a.domain
Cut the device off and rename it to Dad.
Get a new lease. As the MAC address has not changed, it will get the same IP
You get an A record Dad.domain pointing to 192.168.1.10. This appears to overwrite the current A record.
You get a PTR record, 10.1.168.192.in-addr.arpa pointing to Dad.domain. This does not overwrite the previous PTR record.

BUT I have just tested and I cannot reproduce it so possibly the code handles it. Originally I came across it when I mocked up a few lines of code, but not with this full routine.
Comment 118 Nick Howitt 2024-04-06 13:48:21 UTC
Created attachment 1516 [details]
Script to generate /etc/unbound/dhcp-leases.conf

This is a major hack job on the original /usr/sbin/unbound-dhcp-leases-bridge (not my proposed modified version as my modifications are redundant with this approach). It creates a new /etc/unbound/dhcp-leases.conf and is designed to run during the unbound startup.

I have removed the inotify bits (file monitoring), the perpetual loop and daemonizing elements. It also drops the restart. I have tried to cut out all the bits I can but may have left in some redundant "imports" or other bits so feel free to make any changes.

If you want to test and not overwrite your current /etc/unbound/dhcp-leases.conf, comment out line 487 and uncomment line 488. Modify line 488 to suit.

This script should be run from /etc/init.d/unbound after the write_forward_conf line (574).
Comment 119 Jon 2024-04-06 16:30:19 UTC
I'll keep an eye out for stray PTR also.  So far I have not seen any.  

(In reply to Nick Howitt from comment #117)
> Cut the device off and rename it to Dad.

The code in dhcpEvent looks at a hostname name change.
```
UC_FQDN=$( /bin/grep --word-regexp --ignore-case "^${clientHostname}\.${DOMAINNAME}" <<< "${ucList}" )
if [[ -z "${UC_FQDN}" ]] ; then
#	No, hostname.  So the hostname changed.
	#	delete A & PTR records from unbound
	delete_unbound "${clientIP}"

	#	add new to unbound!
	write-unbound "${clientIP}" "${clientHostname}"
fi
```

so hopefully this handles things correctly!


(In reply to Nick Howitt from comment #118)
> Created attachment 1516 [details]
> Script to generate /etc/unbound/dhcp-leases.conf
> 
> This is a major hack job on the original
> /usr/sbin/unbound-dhcp-leases-bridge (not my proposed modified version as my
> modifications are redundant with this approach). It creates a new
> /etc/unbound/dhcp-leases.conf and is designed to run during the unbound
> startup.

One of the reasons I wrote dhcpEvent was to find a way around parsing the file generated by DHCP @ `/var/state/dhcp/dhcpd.leases`.  So my preference is to stick with `dhcp-leases.conf` being populated with dhcpEvent.
Comment 120 Nick Howitt 2024-04-06 16:59:12 UTC
(In reply to Jon from comment #119)
> I'll keep an eye out for stray PTR also.  So far I have not seen any.  
> 
> (In reply to Nick Howitt from comment #117)
> > Cut the device off and rename it to Dad.
> 
> The code in dhcpEvent looks at a hostname name change.
> ```
> UC_FQDN=$( /bin/grep --word-regexp --ignore-case
> "^${clientHostname}\.${DOMAINNAME}" <<< "${ucList}" )
> if [[ -z "${UC_FQDN}" ]] ; then
> #	No, hostname.  So the hostname changed.
> 	#	delete A & PTR records from unbound
> 	delete_unbound "${clientIP}"
> 
> 	#	add new to unbound!
> 	write-unbound "${clientIP}" "${clientHostname}"
> fi
> ```
> 
> so hopefully this handles things correctly!
> 
> 

I'd missed the hostname change code. Looks like it it is handling it.

> (In reply to Nick Howitt from comment #118)
> > Created attachment 1516 [details]
> > Script to generate /etc/unbound/dhcp-leases.conf
> > 
> > This is a major hack job on the original
> > /usr/sbin/unbound-dhcp-leases-bridge (not my proposed modified version as my
> > modifications are redundant with this approach). It creates a new
> > /etc/unbound/dhcp-leases.conf and is designed to run during the unbound
> > startup.
> 
> One of the reasons I wrote dhcpEvent was to find a way around parsing the
> file generated by DHCP @ `/var/state/dhcp/dhcpd.leases`.  So my preference
> is to stick with `dhcp-leases.conf` being populated with dhcpEvent.
The code already existed to parse the /var/state/dhcp/dhcpd.leases - /usr/sbin/unbound-dhcp-leases-bridge. All I've done, more or less, is remove the daemonising elements so it runs as a one-shot.

By using this you can remove your lines which maintain /etc/unbound/dhcp-leases.conf and gain valuable time in your code execution and execution time is at a premium.

It also stops any possibility of /etc/unbound/dhcp-leases.conf deviating from where it should be, for example on a crash or termination of a routine. The file only ever needs to be correct when unbound starts and it will be with this script. This is the safer and faster option.
Comment 121 Nick Howitt 2024-04-06 17:36:18 UTC
(In reply to Nick Howitt from comment #118)
> Created attachment 1516 [details]
> Script to generate /etc/unbound/dhcp-leases.conf
> 
> This is a major hack job on the original
> /usr/sbin/unbound-dhcp-leases-bridge (not my proposed modified version as my
> modifications are redundant with this approach). It creates a new
> /etc/unbound/dhcp-leases.conf and is designed to run during the unbound
> startup.
> 
> I have removed the inotify bits (file monitoring), the perpetual loop and
> daemonizing elements. It also drops the restart. I have tried to cut out all
> the bits I can but may have left in some redundant "imports" or other bits
> so feel free to make any changes.
> 
> If you want to test and not overwrite your current
> /etc/unbound/dhcp-leases.conf, comment out line 487 and uncomment line 488.
> Modify line 488 to suit.
> 
> This script should be run from /etc/init.d/unbound after the
> write_forward_conf line (574).

The code should also be run on a reload, so after write_hosts_conf on line 598.
Comment 122 Jon 2024-04-15 22:21:29 UTC
Created attachment 1524 [details]
Zip file for dhcpEvent

Update dhcpd.conf.local
- Added commit check for a hostname.  If there is no commit hostname, then do not execute dhcpEvent script.  (small speed improvement)
- General clean-up of config file.

Update dhcpEvent
- removed hostname check for commit.  Now done as part of "on commit" in dhcpd.conf (small speed improvement)
- General clean up.
Comment 123 Jon 2024-04-16 19:17:46 UTC
Michael - I need your input.

I'd like to keep moving forward.  And I am thinking of re-writing the bash script into perl.  And I don't want to go forward without your approval.

So, do you agree with the proof-of-concept and using the DHCP on-commit / on-release / on-expire functions to send IP address and Hostname to Unbound DNS?

If yes, are you OK with re-writing the dchpEvent using perl?  (I have not written anything using perl so this will be a bit of a learning curve)
Comment 124 Nick Howitt 2024-04-16 19:35:16 UTC
I think we also need input on 2 other issues:

1 - can we use `DOMAINNAME=${HOSTNAME#*.*}` using the built-in HOSTNAME do derive the DOMAINNAME or do we have to do `eval "$(/usr/local/bin/readhash /var/ipfire/main/settings)"` which is much slower (in bash), so I assume similar in perl.

2 - does the dhcp-leases file have to be maintained during cache updates? As far as I can see, the file is only ever read on startup of unbound so a cut down version of /usr/sbin/unbound-dhcp-leases-bridge could be run as a pre-start and pre-reload task in init.d/unbound to generate the dhcp-leases file. All the script needs to do is regenerate dhcp-leases with out any restarts. I have attached a working cut down version of the script which removes all deamonising elements and restarts. It also removes the need for my mods which I submitted a couple of weeks ago. It runs very quickly on my machine and makes no noticeable difference to the start up time of unbound. It also, then, removes any requirement to maintain dhcp-leases when doing cache updates, speeding them up.
Comment 125 Jon 2024-04-22 16:21:27 UTC
(In reply to Jon from comment #123)
> Michael - I need your input.
> 
> I'd like to keep moving forward.  And I am thinking of re-writing the bash
> script into perl.  And I don't want to go forward without your approval.
> 
> So, do you agree with the proof-of-concept and using the DHCP on-commit /
> on-release / on-expire functions to send IP address and Hostname to Unbound
> DNS?
> 
> If yes, are you OK with re-writing the dchpEvent using perl?  (I have not
> written anything using perl so this will be a bit of a learning curve)


Michael - are you OK with me going forward with a perl script to do what dhcpEvent.sh does?
Comment 126 Michael Tremer 2024-04-23 11:28:24 UTC
(In reply to Jon from comment #125)
> Michael - are you OK with me going forward with a perl script to do what
> dhcpEvent.sh does?

Erm, why Perl?
Comment 127 Jon 2024-04-23 13:42:07 UTC
It seems to be MUCH faster than bash and had the possibility of getting from 300 ms (w/ bash) down to 100 ms.  And I did not want to learn C.
Comment 128 Jon 2024-04-23 13:43:46 UTC
and I have someone to help me write perl!
Comment 129 Nick Howitt 2024-04-23 13:51:17 UTC
When I ran your dhcpEvent in bash but removed the updates to the /etc/unbound/dhcp-leases.conf and all your pre-amble at the top to kill the unbound-dhcp-leases-bridge process and used my alt method for deriving the domain name, my dhcpEvent took about 70 ms to execute.
Comment 130 Bernhard Bitsch 2024-04-23 16:03:24 UTC
I suppose Nick's changes have found their way into CU185.
The restarts are really minimised, but the transfer of new DHCP info needs a reload nevertheless.
This resets the update timers for RPZ list, for example.
Jon's approach with unbound-control avoids this.
But it is necessary to separate the DHCP events handling from the update process:
- the event writes the parameters to a pipe
- the bridge process works as a server reading this pipe

My idea.
Comment 131 Nick Howitt 2024-04-23 16:08:56 UTC
Hmm. I never received any feedback for my patch. I was going to chase it again on Thursday, 28 days after I submitted it and I have not updated to CU185 to see if it is there.

However, Jon's approach is way better and mine should only be considered an interim fix. I'll update my machine to see if my fix has been incorporated.
Comment 132 Jon 2024-04-23 22:11:51 UTC
I did not locate any recent changes to the bridge.  The last change I found is July 2023.

https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=f20ca78eff6e8baeb86361f55adf52819d1bae1f


But I do see an update to Unbound:

https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=4e57a4f31ac0b3bebac4fd43ad1dd282a4ded352

Maybe that changed the "restarts"
Comment 133 Nick Howitt 2024-04-24 09:44:10 UTC
The only way that could affect the restarts is if it no longer wrote DHCP leases to the cache at /var/state/dhcp/dhcpd.leases

I can't test for the moment as my test system is bust and won't do DNS so I can't update it and I've had to decommission my production system to test a work related issue on another firewall.