I went digging through the messages log (old a new) and normally I see a few "unbound: [31414:0] info: start of service" messages every week. Less than 10 per week. Aug 2 13:09:29 ipfire unbound: [31414:0] info: start of service (unbound 1.17.1). But after upgrading from cu 177 to CU 178 I see thousands per week. ``` -rw-rw-r-- 1 root syslogd 2508718 Aug 6 00:01 /var/log/messages.3.gz Aug 5 20:35:02 ipfire pakfire: CORE UPGR: Upgrading from release 176 to 177 Aug 2 13:09:29 ipfire unbound: [31414:0] info: start of service (unbound 1.17.1). Aug 2 13:14:05 ipfire unbound: [32097:0] info: start of service (unbound 1.17.1). Aug 2 13:46:10 ipfire unbound: [2004:0] info: start of service (unbound 1.17.1). Aug 2 13:51:27 ipfire unbound: [2880:0] info: start of service (unbound 1.17.1). Aug 2 17:10:31 ipfire unbound: [1923:0] info: start of service (unbound 1.17.1). Aug 2 17:10:58 ipfire unbound: [1923:0] info: start of service (unbound 1.17.1). Aug 5 20:42:07 ipfire unbound: [1923:0] info: start of service (unbound 1.17.1). Aug 5 20:52:42 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1). Aug 5 20:53:06 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1). Aug 5 20:53:42 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1). Aug 5 20:53:59 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1). Aug 5 20:54:17 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1). Aug 5 21:02:28 ipfire unbound: [2058:0] info: start of service (unbound 1.17.1). ``` Here is the count of "start of service" messages. I just from a few to over 3000 per week: ``` -rw-rw-r-- 1 root syslogd 2211885 Jul 23 00:01 /var/log/messages.5.gz 0 -rw-rw-r-- 1 root syslogd 2318649 Jul 30 00:01 /var/log/messages.4.gz 4 -rw-rw-r-- 1 root syslogd 2508718 Aug 6 00:01 /var/log/messages.3.gz Aug 5 20:35:02 ipfire pakfire: CORE UPGR: Upgrading from release 176 to 177 73 -rw-rw-r-- 1 root syslogd 4108718 Aug 13 00:01 /var/log/messages.2.gz 3559 -rw-rw-r-- 1 root syslogd 3927388 Aug 20 00:01 /var/log/messages.1.gz Aug 18 18:52:51 ipfire pakfire: CORE UPGR: Upgrading from release 177 to 178 3773 -rw-rw-r-- 1 root syslogd 35421724 Aug 26 15:01 /var/log/messages 3512 [root@ipfire ~] # ``` Tested on: APU4D4 IPFire 2.27 (x86_64) - Core-Update 178 (stable)
my bad - this should state: But after upgrading from cu 176 to CU 177 I see thousands per week.
maybe related to: https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=f20ca78eff6e8baeb86361f55adf52819d1bae1f
I have just checked my logs and I see some but not very many most of the weeks and there are some weeks with zero. /var/log/messages.9.gz 15 /var/log/messages.8.gz 0 /var/log/messages.7.gz 2 /var/log/messages.6.gz 0 /var/log/messages.5.gz 12 /var/log/messages.4.gz 9 /var/log/messages.3.gz 0 /var/log/messages.2.gz 31 /var/log/messages.1.gz 0 /var/log/messages 0
Ugh! that is not good (for me!). I see this on my production and high number on my test APU also. /var/log/messages.9.gz 2 /var/log/messages.8.gz 0 /var/log/messages.7.gz Jul 13 20:09:56 ipfire pakfire: CORE UPGR: Upgrading from release 175 to 176 2 /var/log/messages.6.gz 0 /var/log/messages.5.gz 4 /var/log/messages.4.gz Aug 5 20:35:02 ipfire pakfire: CORE UPGR: Upgrading from release 176 to 177 73 /var/log/messages.3.gz 3559 /var/log/messages.2.gz Aug 18 18:52:51 ipfire pakfire: CORE UPGR: Upgrading from release 177 to 178 3773 /var/log/messages.1.gz 3718 /var/log/messages 850
here is my test box APU4D4 IPFire 2.27 (x86_64) - Core-Update 179 Development Build: master/b5d85855 (just formatted differently) /var/log/messages.9.gz 0 /var/log/messages.8.gz 2 /var/log/messages.7.gz 2 Jul 11 15:42:55 ipfireAPU pakfire: CORE UPGR: Upgrading from release 175 to 176 /var/log/messages.6.gz 2 /var/log/messages.5.gz 0 /var/log/messages.4.gz 484 Jul 30 13:09:21 ipfireAPU pakfire: CORE UPGR: Upgrading from release 175 to 177 /var/log/messages.3.gz 417 /var/log/messages.2.gz 493 Aug 14 21:44:42 ipfireAPU pakfire: CORE UPGR: Upgrading from release 176 to 178 /var/log/messages.1.gz 856 Aug 21 11:43:32 ipfireAPU pakfire: CORE UPGR: Upgrading from release 177 to 179 /var/log/messages 117
Adolf & all, I found a ipfire box that has CU 174 on it and not too many unbound stop/start messages. [root@ipfireAPU2 ~] # find /var/log -iname "messages*" | sort -Vr | tail -10 | xargs -I % bash -c 'echo ; printf "% \t" % ; grep --color -c "start of service" <(zcat -f %) ; grep --color "CORE UPGR:" <(zcat -f %)' /var/log/messages.8.gz 2 /var/log/messages.7.gz 4 /var/log/messages.6.gz 4 /var/log/messages.5.gz 10 /var/log/messages.4.gz 14 /var/log/messages.3.gz 2 /var/log/messages.2.gz 2 /var/log/messages.1.gz 0 /var/log/messages 2 [root@ipfireAPU2 ~] # I want to walk the Updates from CU 174 to CU 176 and let it sit for a day. Then walk the Update from CU 176 to CU 177. And let it sit again to check for unbound stop/start messages. Do you know a way to do this?
Same issue with a 3rd ipfire device. Today I Updated from CU 174 to CU 178. The update happened only ~2 hours so not too many issues. Interesting item - the stopping / start seem to happen right after a DHCPREQUEST and DHCPACK. ===== ``` [root@ipfireAPU2 ~] # find /var/log -iname "messages*" | sort -Vr | tail -1 | xargs -I % bash -c 'echo ; printf "% \n" % ; grep --color -e dhcp -e "CORE UPGR:" -e "start of service" -e reboot -e restart -e "service stopped" <(zcat -f %)' /var/log/messages . . . Aug 28 16:03:20 ipfireAPU2 pakfire: CORE UPGR: Upgrading from release 174 to 178 . . . Aug 28 16:23:33 ipfireAPU2 shutdown[23875]: shutting down for system reboot . . . Aug 28 16:26:04 ipfireAPU2 syslogd 1.5.1: restart (remote reception). Aug 28 16:26:04 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1). Aug 28 16:27:15 ipfireAPU2 unbound: [3138:0] info: service stopped (unbound 1.17.1). Aug 28 16:27:15 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1). Aug 28 16:27:43 ipfireAPU2 dhcpd: Wrote 5 leases to leases file. Aug 28 16:27:43 ipfireAPU2 dhcpd: Server starting service. Aug 28 16:27:44 ipfireAPU2 unbound: [3138:0] info: service stopped (unbound 1.17.1). Aug 28 16:27:45 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1). Aug 28 16:27:51 ipfireAPU2 dhcpd: DHCPDISCOVER from d8:eb:97:b8:03:08 (iMac3) via green0 Aug 28 16:27:51 ipfireAPU2 dhcpd: DHCPOFFER on 192.168.7.100 to d8:eb:97:b8:03:08 (iMac3) via green0 Aug 28 16:27:52 ipfireAPU2 dhcpd: DHCPREQUEST for 192.168.7.100 (192.168.7.1) from d8:eb:97:b8:03:08 (iMac3) via green0 Aug 28 16:27:52 ipfireAPU2 dhcpd: DHCPACK on 192.168.7.100 to d8:eb:97:b8:03:08 (iMac3) via green0 Aug 28 16:27:55 ipfireAPU2 unbound: [3138:0] info: service stopped (unbound 1.17.1). Aug 28 16:27:56 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1). Aug 28 16:57:52 ipfireAPU2 dhcpd: DHCPREQUEST for 192.168.7.100 from d8:eb:97:b8:03:08 (iMac3) via green0 Aug 28 16:57:52 ipfireAPU2 dhcpd: DHCPACK on 192.168.7.100 to d8:eb:97:b8:03:08 (iMac3) via green0 Aug 28 16:57:56 ipfireAPU2 unbound: [3138:0] info: service stopped (unbound 1.17.1). Aug 28 16:57:57 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1). Aug 28 17:57:52 ipfireAPU2 dhcpd: DHCPREQUEST for 192.168.7.100 from d8:eb:97:b8:03:08 (iMac3) via green0 Aug 28 17:57:52 ipfireAPU2 dhcpd: Wrote 5 leases to leases file. Aug 28 17:57:52 ipfireAPU2 dhcpd: DHCPACK on 192.168.7.100 to d8:eb:97:b8:03:08 (iMac3) via green0 Aug 28 17:57:53 ipfireAPU2 unbound: [3138:0] info: service stopped (unbound 1.17.1). Aug 28 17:57:53 ipfireAPU2 unbound: [3138:0] info: start of service (unbound 1.17.1). [root@ipfireAPU2 ~] # ```
I just ran your command on my min appliance system and got the following output. /var/log/messages.19.gz 2 /var/log/messages.18.gz 0 /var/log/messages.17.gz 0 /var/log/messages.16.gz 0 /var/log/messages.15.gz 7 /var/log/messages.14.gz 2 /var/log/messages.13.gz 0 /var/log/messages.12.gz 5 Jun 13 11:51:09 iapetus pakfire: CORE UPGR: Upgrading from release 174 to 175 /var/log/messages.11.gz 9 /var/log/messages.10.gz 15 /var/log/messages.9.gz 0 /var/log/messages.8.gz 2 Jul 11 17:54:08 iapetus pakfire: CORE UPGR: Upgrading from release 175 to 176 /var/log/messages.7.gz 0 /var/log/messages.6.gz 12 /var/log/messages.5.gz 9 Aug 5 22:37:50 iapetus pakfire: CORE UPGR: Upgrading from release 176 to 177 /var/log/messages.4.gz 0 /var/log/messages.3.gz 31 Aug 18 12:30:42 iapetus pakfire: CORE UPGR: Upgrading from release 177 to 178 /var/log/messages.2.gz 0 /var/log/messages.1.gz 1 /var/log/messages 0 So I am not seeing what you are seeing. A possible reason for that is your comment "Interesting item - the stopping / start seem to happen right after a DHCPREQUEST and DHCPACK." Are your clients running with dynamic IP assignments? All my clients have fixed leases on my mini system.
(In reply to Adolf Belka from comment #8) > So I am not seeing what you are seeing. > A possible reason for that is your comment > > "Interesting item - the stopping / start seem to happen right after a > DHCPREQUEST and DHCPACK." > > Are your clients running with dynamic IP assignments? All my clients have > fixed leases on my mini system. Yes. Most everything in the BLUE is dynamic. And only a few device in the GREEN are dynamic.
Alexander @bmcs-alex has lots of unbound restarts also... See: https://community.ipfire.org/t/squid-stops-working-every-hour-at-x48/10364/5?u=jon
Also see: https://community.ipfire.org/t/unbound-starts-and-stops-a-lot-for-unknown-reason/10425/1
Hello everyone, we recently added some changes which make unbound reload its configuration when we need to update the internal local DNS configuration (i.e. a new DHCP lease was issued or has expired). Technically unbound does not restart, as we only ask it to reload: > https://git.ipfire.org/?p=ipfire-2.x.git;a=blob;f=config/unbound/unbound-dhcp-leases-bridge;hb=a85e9d4eb1c5a9d5b9bac3d90f59a321e4151871#l524 Internally, it probably makes some sense for unbound to halt the resolver, reload the configuration and then start a new resolver with the new configuration. The log message would then make sense. However, are you experiencing any problems with DNS resolution or is this just an aesthetic issue with lots of those log messages being shown?
I've been experimenting with DNS and I experience a delay when going to various websites. Nov 13 17:31:05 ipfire unbound: [31555:0] notice: Restart of unbound 1.18.0. Nov 13 17:31:05 ipfire unbound: [31555:0] notice: init module 0: respip Nov 13 17:31:05 ipfire unbound: [31555:0] notice: init module 1: validator Nov 13 17:31:05 ipfire unbound: [31555:0] notice: init module 2: iterator Nov 13 17:31:17 ipfire unbound: [31555:0] info: start of service (unbound 1.18.0). It seems like the delay, in obtaining a response from DNS, coincides with a ~12 second delay between "notice: init module 2: iterator" and "info: start of service (unbound 1.18.0)." I happened to be monitoring the messages log with: ``` tail -f /var/log/messages | grep -e unbound -e dhcp ``` And I was searching for something in a web browser. the delay on the webboorwser coincided with the unbound restart. I had been seeing this browser delay for a few weeks but it did not make sense until today... I'll do some more testing.
(In reply to Jon from comment #13) > I've been experimenting with DNS and I experience a delay when going to > various websites. Hmm, that would be a problem. I tested these changed on a network that has between 1200 and 1600 clients daily, but it is also running on an Enterprise appliance, so maybe the reload is slightly faster there? Can we rule out it isn't anything else? Could you maybe run tcpdump and check if there are any DNS queries sent to IPFire which are not being answered until unbound is done reloading?
Yes, I can do that (at least I think I can!) If you have a suggestion for the tcpdump command that would help me since I haven't used tcpdump since the DNS redirect days. Via a browser I can see delays and I can see an occasional "drop". The delays happen at every unbound restart. The drops happen once in a while, maybe once every week or so.
I wrote a script to pick 500 different (random) websites out of 1 million. I did this because I wanted to avoid the dns caches. I had to run the `unbound-control reload_keep_cache` to force the reload. Otherwise this test is like watching paint dry... I ran this tcpdump command: ``` tcpdump -i green0 host 192.168.60.212 and port 53 -w green_h60.212_p53.pcap ``` Each website is checked this command: ``` time dig +tries=3 +time=5 +stats @192.168.60.1 furchguitars.com ``` And I got this response every once in awhile: furchguitars.com FAIL: 9 15014 allResults = ; <<>> DiG 9.10.6 <<>> dig +tries=3 +time=5 +stats @192.168.60.1 furchguitars.com ; (1 server found) ;; global options: +cmd ;; connection timed out; no servers could be reached real 0m15.014s user 0m0.002s sys 0m0.003s and I dug through the tcpdump list and found that `furchguitars.com` started at the time 16:25:46: ``` 16:25:46.625255 192.168.60.212 192.168.60.1 DNS 87 Standard query 0x67a1 A furchguitars.com OPT 16:25:51.626709 192.168.60.212 192.168.60.1 DNS 87 Standard query 0x67a1 A furchguitars.com OPT 16:25:56.627352 192.168.60.212 192.168.60.1 DNS 87 Standard query 0x67a1 A furchguitars.com OPT . . . 16:26:02.287736 192.168.60.1 192.168.60.212 DNS 103 Standard query response 0x67a1 A furchguitars.com A 89.221.213.64 OPT 16:26:02.287818 192.168.60.1 192.168.60.212 DNS 103 Standard query response 0x67a1 A furchguitars.com A 89.221.213.64 OPT ``` The first three commands are the three dig tries (+tries=3). Each is 5 sec apart (+time=5). So from the first dig send to the first response is 16 seconds. I did a ping to make sure it was a real server and that worked: ``` iMac3:dns_speed_test jcmurphy$ ping furchguitars.com PING furchguitars.com (89.221.213.64): 56 data bytes 64 bytes from 89.221.213.64: icmp_seq=0 ttl=45 time=131.229 ms 64 bytes from 89.221.213.64: icmp_seq=1 ttl=45 time=126.059 ms 64 bytes from 89.221.213.64: icmp_seq=2 ttl=45 time=134.259 ms 64 bytes from 89.221.213.64: icmp_seq=3 ttl=45 time=125.657 ms ^C --- furchguitars.com ping statistics --- 4 packets transmitted, 4 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 125.657/129.301/134.259/3.609 ms ``` and ran dig to make sure it really worked: ``` iMac3:dns_speed_test jcmurphy$ dig furchguitars.com ; <<>> DiG 9.10.6 <<>> furchguitars.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37336 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ;; QUESTION SECTION: ;furchguitars.com. IN A ;; ANSWER SECTION: furchguitars.com. 581 IN A 89.221.213.64 ;; Query time: 37 msec ;; SERVER: 192.168.60.2#53(192.168.60.2) ;; WHEN: Tue Nov 21 16:46:21 CST 2023 ;; MSG SIZE rcvd: 61 ```
Created attachment 1394 [details] pcap file Here are the tcpdump lines related to furchguitars
I forgot the unbound log. See the restart (service stopped / start of service) that occurred during the same time as the bad dig command. ``` Nov 21 16:25:50 ipfire unbound: [29467:0] info: service stopped (unbound 1.18.0). Nov 21 16:25:50 ipfire unbound: [29467:0] info: server stats for thread 0: 12 queries, 6 answers from cache, 6 recursions, 0 prefetch, 0 rejected by ip ratelimiting Nov 21 16:25:50 ipfire unbound: [29467:0] info: server stats for thread 0: requestlist max 4 avg 1.66667 exceeded 0 jostled 0 Nov 21 16:25:50 ipfire unbound: [29467:0] info: average recursion processing time 0.102877 sec Nov 21 16:25:50 ipfire unbound: [29467:0] info: histogram of recursion processing times Nov 21 16:25:50 ipfire unbound: [29467:0] info: [25%]=0.0709973 median[50%]=0.098304 [75%]=0.125611 Nov 21 16:25:50 ipfire unbound: [29467:0] info: lower(secs) upper(secs) recursions Nov 21 16:25:50 ipfire unbound: [29467:0] info: 0.032768 0.065536 1 Nov 21 16:25:50 ipfire unbound: [29467:0] info: 0.065536 0.131072 3 Nov 21 16:25:50 ipfire unbound: [29467:0] info: 0.131072 0.262144 1 Nov 21 16:25:50 ipfire unbound: [29467:0] notice: Restart of unbound 1.18.0. Nov 21 16:25:50 ipfire unbound: [29467:0] notice: init module 0: respip Nov 21 16:25:50 ipfire unbound: [29467:0] notice: init module 1: validator Nov 21 16:25:50 ipfire unbound: [29467:0] notice: init module 2: iterator Nov 21 16:26:01 ipfire unbound: [29467:0] info: start of service (unbound 1.18.0). Nov 21 16:26:01 ipfire unbound: [29467:0] info: generate keytag query _ta-4a5c-4f66. NULL IN ```
dig has +tries and +time (e.g., dig +tries=3 +time=5) I picked these because of the original "12 second delay" I saw early on. I was thinking this "might" be similar to unbound. But I thinking now it may be masking the unbound reset question. Is there an equivalent setting with unbound?
And another test... Each website is checked with this (slightly different than last time) command: ``` time dig +stats @192.168.60.1 collegebrowser.com I got this response: ``` collegebrowser.com FAIL: 9 15010 allResults = ; <<>> DiG 9.10.6 <<>> +stats @192.168.60.1 collegebrowser.com ; (1 server found) ;; global options: +cmd ;; connection timed out; no servers could be reached real 0m15.010s user 0m0.002s sys 0m0.003s I can ping and dig to make sure the site is OK: ``` [root@ipfire ~] # ping collegebrowser.com PING collegebrowser.com (72.167.58.94) 56(84) bytes of data. 64 bytes from 94.58.167.72.host.secureserver.net (72.167.58.94): icmp_seq=1 ttl=51 time=67.3 ms 64 bytes from 94.58.167.72.host.secureserver.net (72.167.58.94): icmp_seq=2 ttl=51 time=64.0 ms 64 bytes from 94.58.167.72.host.secureserver.net (72.167.58.94): icmp_seq=3 ttl=51 time=65.0 ms 64 bytes from 94.58.167.72.host.secureserver.net (72.167.58.94): icmp_seq=4 ttl=51 time=63.5 ms ^C --- collegebrowser.com ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3002ms rtt min/avg/max/mdev = 63.546/64.944/67.298/1.454 ms [root@ipfire ~] # [root@ipfire ~] # dig collegebrowser.com ; <<>> DiG 9.16.44 <<>> collegebrowser.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51445 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ;; QUESTION SECTION: ;collegebrowser.com. IN A ;; ANSWER SECTION: collegebrowser.com. 590 IN A 72.167.58.94 ;; Query time: 0 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: Wed Nov 22 10:22:25 CST 2023 ;; MSG SIZE rcvd: 63 Digging through the tcpdump list and found that `collegebrowser.com ` appeared at 10:01:09: ``` 312 2023-11-22 10:01:09.294139 192.168.60.212 192.168.60.1 DNS 89 Standard query 0x3ec2 A collegebrowser.com OPT 313 2023-11-22 10:01:14.294642 192.168.60.212 192.168.60.1 DNS 89 Standard query 0x3ec2 A collegebrowser.com OPT 314 2023-11-22 10:01:19.296406 192.168.60.212 192.168.60.1 DNS 89 Standard query 0x3ec2 A collegebrowser.com OPT . . . 323 2023-11-22 10:01:26.983005 192.168.60.1 192.168.60.212 DNS 105 Standard query response 0x3ec2 A collegebrowser.com A 72.167.58.94 OPT 324 2023-11-22 10:01:26.983077 192.168.60.1 192.168.60.212 DNS 105 Standard query response 0x3ec2 A collegebrowser.com A 72.167.58.94 OPT The unbound service stopped / start of service occurs at the same time as the dig command: ``` Nov 22 10:01:11 ipfire unbound: [29467:0] info: service stopped (unbound 1.18.0). Nov 22 10:01:11 ipfire unbound: [29467:0] info: server stats for thread 0: 25 queries, 4 answers from cache, 21 recursions, 0 prefetch, 0 rejected by ip ratelimiting Nov 22 10:01:11 ipfire unbound: [29467:0] info: server stats for thread 0: requestlist max 2 avg 0.857143 exceeded 0 jostled 0 Nov 22 10:01:11 ipfire unbound: [29467:0] info: average recursion processing time 0.519950 sec Nov 22 10:01:11 ipfire unbound: [29467:0] info: histogram of recursion processing times Nov 22 10:01:11 ipfire unbound: [29467:0] info: [25%]=0.049152 median[50%]=0.098304 [75%]=0.32768 Nov 22 10:01:11 ipfire unbound: [29467:0] info: lower(secs) upper(secs) recursions Nov 22 10:01:11 ipfire unbound: [29467:0] info: 0.016384 0.032768 1 Nov 22 10:01:11 ipfire unbound: [29467:0] info: 0.032768 0.065536 8 Nov 22 10:01:11 ipfire unbound: [29467:0] info: 0.065536 0.131072 2 Nov 22 10:01:11 ipfire unbound: [29467:0] info: 0.131072 0.262144 3 Nov 22 10:01:11 ipfire unbound: [29467:0] info: 0.262144 0.524288 4 Nov 22 10:01:11 ipfire unbound: [29467:0] info: 2.000000 4.000000 2 Nov 22 10:01:11 ipfire unbound: [29467:0] notice: Restart of unbound 1.18.0. Nov 22 10:01:11 ipfire unbound: [29467:0] notice: init module 0: respip Nov 22 10:01:11 ipfire unbound: [29467:0] notice: init module 1: validator Nov 22 10:01:11 ipfire unbound: [29467:0] notice: init module 2: iterator Nov 22 10:01:26 ipfire unbound: [29467:0] info: start of service (unbound 1.18.0). ```
Created attachment 1395 [details] pcap file for collegebrowser.com Here is the pcap file
Hmm this is interesting. Because you are getting the responses from unbound - just quite late. About 15 seconds, which is definitely noticeable. Should we take this upstream? Or what other ideas do we have?
The two examples above (the pcap files) are fails. So yes, there is a response but something breaks during the "reload" I get this in my log: ``` ; <<>> DiG 9.10.6 <<>> +stats @192.168.60.1 collegebrowser.com ; (1 server found) ;; global options: +cmd ;; connection timed out; no servers could be reached ``` I cannot answer about reporting this to the unbound team. I am worried this may be caused by this change: https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=f20ca78eff6e8baeb86361f55adf52819d1bae1f and how "unbound-dhcp-leases-bridge" interface is written around the reload. Is there an IPFire python developer that can help??
(In reply to Jon from comment #23) > I am worried this may be caused by this change: > https://git.ipfire.org/?p=ipfire-2.x.git;a=commit; > h=f20ca78eff6e8baeb86361f55adf52819d1bae1f It very likely is. However, I am not sure if rolling this back will help us, because that would mean major rework of the leases bridge. So, what is easier to solve now?
Is it more than just replacing the new with the old? (I thought it would be easy to go back one step - shows you how little I know!) I don't mind going forward, but I don't speak python. So we'd need someone fluid in pythonese.
for what it is worth... All of the fixed leases work A-OK with no unbound reloads. I'm still checking but it looks good so far!
The problem before was that the bridge was not very good at figuring out which leases needed to be loaded into unbound. It kind of worked okay in some cases, but on larger networks, some random leases were dropped from DNS. That problem is fixed by having unbound reload the entire file every time something changes.
I have to image that someone has solved this problem before. We cannot be the only group that uses ISC-DHCP and unbound together. I'd hope either the unbound group or the dhcp group would have written a nice bridge between the two. I've been looking but have not been lucky in my searches. I've read through LOTS of other posts about others having similar issues. It looks like opensense _may_ have solved it. but pf sense has similar issues.
I added a note the the Dev mailing list about On Commit/Release/Expiry (CRE) that is already included into ISC DHCP. Hope it helps!
and I am back to bugzilla... I wrote a small command to grab the dhcp-leases.conf file and grab the unbound cache and compare them. ``` #!/bin/bash # need HOSTNAME and DOMAINNAME value eval $(/usr/local/bin/readhash /var/ipfire/main/settings) dhcp_leases=$( sed 's/\"//g ; s/local-data: //g' /etc/unbound/dhcp-leases.conf | sort --version-sort --ignore-case ) uc_list=$( unbound-control list_local_data | grep "${DOMAINNAME}" | sed "/${HOSTNAME}\.${DOMAINNAME}/d" | sort --version-sort --ignore-case ) diff --side-by-side --width=150 --minimal --ignore-case --ignore-all-space <(echo "$dhcp_leases") <(echo "$uc_list") ``` === One thing I _think_ I noticed (I need to keep testing!) is sometimes it takes time for unbound to place devices into the unbound cache. So they appear instantly in dhcp-leases.conf but take a minute or three to appear in the unbound cache. On one of the networks you Admin, maybe you can run the above to check for differences? If nothing else, the above code could be added in the bridge (as a "hack"!), would detect differences, then do a launch of the "reload". Like I said, a hack!
I dropped a line for the last post. It adds a period after the FQDNs. ``` dhcp_leases=$( sed ..... dhcp_leases=$( echo "${dhcp_leases}" | sed -E 's/(\.in-addr\.arpa)[[:blank:]]+/\1. /g ; s/(\.localdomain)+/\1./g' ) ```
(In reply to Jon from comment #30) > If nothing else, the above code could be added in the bridge (as a "hack"!), > would detect differences, then do a launch of the "reload". Like I said, a > hack! If we go down this path, we should avoid calling any reloads, because: * We should load the current set of data into Unbound when it starts. We need to do that by parsing the leases file and writing an unbound configuration file with all entries as we do now. More on that below. * We will then keep unbound up to data by using the DHCP events. The problem is that we picked Python for this job because we need to parse a rather complicated file and then need to check to which subnet a lease belongs. That is something that is very painful to do in shell. I played around with awk, which might be a middle ground, but as mentioned in a previous email, this is not a great approach. I just want to avoid that it takes a minute to start Unbound if we have to process a thousand leases in shell...
(In reply to Michael Tremer from comment #32) > (In reply to Jon from comment #30) > * We should load the current set of data into Unbound when it starts. We > need to do that by parsing the leases file and writing an unbound > configuration file with all entries as we do now. More on that below. I don't think this needs to be done! Please hear me out. (Sorry if I am explaining things already known.) in the /etc/unbound/unbound.conf there are includes files. And the `/etc/unbound/dhcp-leases.conf` include file is "imported" into unbound (and the unbound cache) every time there is a restart (unbound restart or ipfire reboot). So `/etc/unbound/dhcp-leases.conf` are already populated on restart and don't need to be re-populated. We don't need to parse the DHCP leases file. To test, comment out the `unbound-dhcp-leases-bridge` lines ipfire system and do a restart, you'll see this is correct. So why parse the lease and re-populate the unbound dhcp-leases.conf file? Now to get new stuff into the unbound cache (and the /etc/unbound/dhcp-leases.conf file) this is where "unbound-control load-data" comes in. So if we begin the process with DHCP On Commit/Release/Expire, add A & PTR records to `/etc/unbound/dhcp-leases.conf`, and add A & PTR records to unbound cache via "unbound-control load-data", I think we have a nice system. With the above there are no periodic unbound reloads/restarts. Again, this is all a proof-of concept. Make sense?
Created attachment 1406 [details] comparison script for testing FYI - Enclosed is a script to compare the unbound `dhcp-leases.conf` file (generated by Bridge) to the unbound cache (see attached compareMe_v3.sh file). It does not handle static hosts so those will appear as non-matching. Jon
Michael - with the current "unbound-dhcp-leases-bridge" code. How do fixed leases make it into unbound DNS? I am stuck on this at the moment since the fixed leases appear in the `/etc/unbound/dhcp-leases.conf`file but are missing from the unbound cache. And the fix leases are missing from the `/var/state/dhcp/dhcpd.leases`. I see the right items in `/var/ipfire/dhcp/fixleases` I did a unbound init restart and a dhcp init restart. (there is always a really good chance that I screwed up something!)
Ignore Comment 35. It is me - I figured it out. Sorry for the noise.
(In reply to Jon from comment #35) > Michael - with the current "unbound-dhcp-leases-bridge" code. How do fixed > leases make it into unbound DNS? We read the configuration that the user manages through the web UI and create entries in unbound. This is not really correct, because they are a static lease and that really does not mean that this should show up in DNS. But users seem to want this. > I am stuck on this at the moment since the fixed leases appear in the > `/etc/unbound/dhcp-leases.conf`file but are missing from the unbound cache. Here we parse the file and add them as a lease that never expires: > https://git.ipfire.org/?p=ipfire-2.x.git;a=blob;f=config/unbound/unbound-dhcp-leases-bridge;h=e9f022affa98a2483ca1f41d267fe443b0f048cf;hb=HEAD#l161 > And the fix leases are missing from the `/var/state/dhcp/dhcpd.leases`. Sadly, there is no way to convince the DHCP server to write the fixed leases to the file. That said, I doubt that we will see events for them.
With the update of unbound 1.19.0 this may have been fixed. I'll test and report back
Still does restarts. About 100 restarts per 60 minutes.
I am new to IPFire and have also noticed this. If the file /etc/unbound/dhcp-leases.conf is written in a deterministic way, can it initially be written somewhere else like /tmp or /var/tmp or to /etc/unbound/ with a slightly different name. Then compare the new file with the original and only if they are different, copy the file over /etc/unbound/dhcp-leases.conf and restart or reload unbound? Most DHCP actions are renewals which will not change the file.
I don't speak python but it looks like there is something in /usr/sbin/unbound-dhcp-leases-bridge, write_dhcp_leases(self, leases), which creates a temporary file then copies it over the original. It is called from class UnboundConfigWriter(object) which then goes on to self._control("reload_keep_cache"). What some genius could do is diff the temporary file and original file in write_dhcp_leases(self, leases) and set a flag if they are different before the temp file is copied over the original. Then in UnboundConfigWriter(object), only call self._control("reload_keep_cache") if the flag is set. (Remember to clear the flag after .....).
I've proposed a patch to only restart Unbound when the /etc/unbound/dhcp-leases.conf file changes - https://lists.ipfire.org/hyperkitty/list/development@lists.ipfire.org/thread/4RK4PR7M57QLHJXHYFWQ32QUGPPW5PRQ/. It is a minor rework of a couple of functions in the current program. It still requires restarts when the leases change, but not often. It is nowhere near as cute as Jon's proposal which avoids restarts of unbound altogether, but it completely ditches the current methodology.
Better link to the patch submission - https://patchwork.ipfire.org/project/ipfire/patch/ec4d6d18-41d3-4ba0-9713-e6a388a3119f@howitts.co.uk/
(In reply to Nick Howitt from comment #42) > It is nowhere near as cute as Jon's proposal ... cute?!? cute??? Nobody wants cute code!! That is like the kiss of death to have "cute" code!
Not cute code, I won't judge that (but it it prettier than my coding). It is a cute approach which avoids any restarts and hooks directly into the API's. I think that, as a concept, it is a great way to go. My patch is only a sticking plaster on an approach way inferior to yours. If my patch gets accepted, I'll move on to your code. From my patch submission emails, I think Michael Tremer just has not had the time to go through your approach
(In reply to Jon from comment #44) > (In reply to Nick Howitt from comment #42) > > It is nowhere near as cute as Jon's proposal ... > cute?!? cute??? Nobody wants cute code!! That is like the kiss of death > to have "cute" code! It can't hurt. Well-readable should be a must. Cute is probably a more subjective feeling. (In reply to Nick Howitt from comment #45) > Not cute code, I won't judge that (but it it prettier than my coding). It is > a cute approach which avoids any restarts and hooks directly into the API's. This is what the bridge used to do, and it became a huge problem if there were thousands of leases going around. There have been systems where the bridge processed used of 20% of a single Intel Xeon core. That is why we changed it. > https://git.ipfire.org/?p=ipfire-2.x.git;a=commitdiff;h=f20ca78eff6e8baeb86361f55adf52819d1bae1f > I think that, as a concept, it is a great way to go. My patch is only a > sticking plaster on an approach way inferior to yours. That is due to be tested. We right now have the option to update multiple leases at the same time whereas the shell approach would run an update command for each event. That is a lot of forked shell processes and we need to create some database that will store the current state or import it from the DHCP leases file that after a restart we won't have a blank page. I don't know what the overhead of that will be, but I don't think it will be that much cheaper (if at all) than the bridge.
Using the dhcp Event approach (the Proof-of-Concept) how can I check if multiple request can happen at one time? The commit/release/expiry comes from the DHCP server. And the `execute` command below launches a script (the Proof-of-Concept script). ``` # # commit # on commit { set ClientIP = binary-to-ascii(10, 8, ".", leased-address); set NoName = ""; set ClientName = pick-first-value(option host-name, config-option-host-name, client-name, NoName); log(concat("CRE-Commit: ClientIP: ", ClientIP, " Name: ", ClientName)); execute("/root/dhcpdconf/dhcpEvent.sh", "commit", ClientIP, ClientName); } ``` Is there a way to add a different execute command to monitor for multiple requests?
(In reply to Jon from comment #47) > Using the dhcp Event approach (the Proof-of-Concept) how can I check if > multiple request can happen at one time? The script will only be called once at a time. This is because of dhcpd being single-threaded. That has a lot of implications that we need to consider: * The process will not be able to do anything else until the script is done. That means that the script needs to be *very* fast. If the script blocks, dhcpd will not be able to even read incoming packets. * We cannot cover up any performance problems here with running multiple scripts at the same time. * Client won't wait very for long for a response from the DHCP server. They will retry, but that might make the situation a lot worse on our side since we are then running the script multiple times for the client and no reason. The code for dhcpd is here: > https://git.ipfire.org/?p=thirdparty/dhcp.git;a=blob;f=common/execute.c;hb=572032cb0e514606559de3784e3f7ca8e1539d17#l1 > Is there a way to add a different execute command to monitor for multiple > requests? I don't know what you are asking...
(In reply to Michael Tremer from comment #48) > (In reply to Jon from comment #47) > > Using the dhcp Event approach (the Proof-of-Concept) how can I check if > > multiple request can happen at one time? > > The script will only be called once at a time. This is because of dhcpd > being single-threaded. That has a lot of implications that we need to > consider: > > * The process will not be able to do anything else until the script is done. > That means that the script needs to be *very* fast. If the script blocks, > dhcpd will not be able to even read incoming packets. > * We cannot cover up any performance problems here with running multiple > scripts at the same time. > * Client won't wait very for long for a response from the DHCP server. They > will retry, but that might make the situation a lot worse on our side since > we are then running the script multiple times for the client and no reason. Knowing DHCP is single threaded helps! The Proof-of-Concept script runs in 0.5 seconds. But it is not elegant code. I added logger code to get an idea of time and if it blocks DHCP. Here is a typical log entry: ``` Mar 28 18:15:34 ipfire dhcpd: DHCPDISCOVER from 28:c2:dd:af:c7:33 via blue0 Mar 28 18:15:35 ipfire dhcpd: DHCPOFFER on 192.168.65.114 to 28:c2:dd:af:c7:33 via blue0 Mar 28 18:15:35 ipfire dhcpd: CRE-Commit: ClientIP: 192.168.65.114 Name: Onelink begin Mar 28 18:15:35 ipfire dhcpd: execute_statement argv[0] = /usr/sbin/dhcpEvent Mar 28 18:15:35 ipfire dhcpd: execute_statement argv[1] = commit Mar 28 18:15:35 ipfire dhcpd: execute_statement argv[2] = 192.168.65.114 Mar 28 18:15:35 ipfire dhcpd: execute_statement argv[3] = Onelink Mar 28 18:15:36 ipfire dhcpEv28: DNS Update (RFC2136) is enabled Mar 28 18:15:36 ipfire dhcpEv28: commit: added A & PTR records to unbound Mar 28 18:15:36 ipfire dhcpEv28: CRE script: complete - exiting Mar 28 18:15:36 ipfire dhcpd: CRE-Commit: ClientIP: 192.168.65.114 Name: Onelink finish Mar 28 18:15:36 ipfire dhcpd: DHCPREQUEST for 192.168.65.114 (192.168.65.1) from 28:c2:dd:af:c7:33 via blue0 Mar 28 18:15:36 ipfire dhcpd: DHCPACK on 192.168.65.114 to 28:c2:dd:af:c7:33 (Onelink) via blue0 ``` `dhcpd: CRE-Commit . . . begin` is the start of the `dhcp.conf` "on commit" log entry. `dhcpd: execute_statement argv` is passing info from DHCP to the Proof-of-Concept script. `dhcpEv28` is the Proof-of-Concept script. `dhcpd: CRE-Commit . . . finish` is the end of the DHCP execute. The total time for the "block" is 2-3 seconds. I measured using `time dhcpEvent commit <ip addr> <hostname>` and got 0.5 seconds How fast is "*very* fast"?? Is that OK?
Created attachment 1493 [details] readme
Created attachment 1494 [details] dhcpScript
Created attachment 1495 [details] dhcpd.conf.local config file Added three files: README.md, dhcpEvent and dhcpd.conf.local. This is the current PoC version.
If HOSTNAME is available in scripts, is extraction of the domain using DOMAINNAME=${HOSTNAME#*.*} considered dangerous as it is not reading from file? If it is OK, it could be faster than and eval.
(In reply to Nick Howitt from comment #53) > If HOSTNAME is available in scripts, is extraction of the domain using > DOMAINNAME=${HOSTNAME#*.*} considered dangerous as it is not reading from > file? If it is OK, it could be faster than and eval. Yes, the HOSTNAME is available. It is sent from DHCP. But it arrives as like this: ``` Mar 29 10:53:48 ipfire dhcpd: execute_statement argv[3] = MM2 ``` and does not have the DOMAINNAME attached. The only way I knew of to grab the "official" DOMAINNAME was via the: `eval $(/usr/local/bin/readhash /var/ipfire/main/settings)`
I changed one line to grep-ing a fixed string so that should make this a bit faster. # get existing records from unbound-control list_local_data ucList=$( /usr/sbin/unbound-control list_local_data | /bin/grep --ignore-case --fixed-strings ".${DOMAINNAME}." | expand -t1 )
AFAIK HOSTNAME is a standard environment variable in many Linux distros: [root@gateway ~]# echo $HOSTNAME gateway.howitts.co.uk On my IPFire it contains the FQDN as above.
(In reply to Jon from comment #49) > The total time for the "block" is 2-3 seconds. > > I measured using `time dhcpEvent commit <ip addr> <hostname>` and got 0.5 > seconds > > How fast is "*very* fast"?? Is that OK? I think we are looking at an executing time of around 100ms. The reason is as follows: Consider a group of people (let's say 60 for simple maths) enter a room with a wireless access point all having a phone in their pockets that connect simultaneously to the WiFi and therefore will request a DHCP Lease at exactly the same time. dhcpd will then queue all requests and process them one at a time. If the script would take one second, that would initially mean that the last person is waiting around 60s to get an IP address offered by dhcpd. That is already disruptive. In practise clients don't wait that long. Since DHCP is UDP-based they would assume that the request was lost and send another request. Considering that dhcpd processes all requests one after the other ordered by when they were received, it is suddenly processing requests that no client is waiting for any more. Whatever the timeout is depends on the client, but dhclient implements three seconds. That means that if initially all clients send one request, dhcpd will process the first client and respond after 1 second. The second and third client will also get a response. The fourth client will already have timed out and therefore send another request that is going to be appended to the queue. The maths is now becoming huge. If the queue initially has 60 requests and processes three before all clients time out and send another request, the queue will have grown to 117 requests. After another three seconds we will have responded to another three clients and the remaining 54 have timed out again adding a further 54 requests totalling 171... This will continue until all clients have received a response by what time the queue will have 570 requests queued. If a new client now connects to the network, it will have to wait almost ten minutes for the DHCP server to respond. The client will long have gone away as most of the others in the first example. I don't have a good name for this, but I think we can consider this an avalanche effect. Once the script cannot keep up, the situation is running out of control and no client will be able to connect to the network. This is the huge risk I see with this approach. It will work in small test environments, but IPFire does not only run there. Therefore we need to solve this problem and have a script that is *fast*. With running so many shell commands this might be very difficult. --- And yes, you can use $HOSTNAME.
excellent. all of this helps! So now looking at the overall concept (after all, this is a proof-of-concept): - is getting info from DHCP using a Commit/Release/Expiry event <--this is new - and processing it with a script (a very very fast script!) - and sending it to unbound as • unbound-control local_datas <data> <--this is new • and an update to the `/etc/unbound/dhcp-leases.conf` file an OK way to go forward?
How about this. What is the point of maintaining the /etc/unbound/dhcp-leases.conf file? This can be done initially as a pre-start task of dhcpd virtually using /usr/sbin/unbound-dhcp-leases-bridge, but as it is done as a pre-start task, unbound does not have to be restarted and no listener would be needed for any DNS events? /usr/sbin/unbound-dhcp-leases-bridge already gets run straight after unbound starts. This should be moved to before and then simplified so it is a one-shot task. Once that happens none of the calls are then needed to maintain the on-the-fly. It is only ever read when dhcpd is started or reloaded so jt only ever needs to be valid at the point that unbound starts. Then Jon can remove all those bits of the his script and just do the calls. If Jon strips out all his maintenance of the /etc/unbound/dhcp-leases.conf file and just leaves in the calls to the unbound cache, how quick does the processing become? /usr/sbin/unbound-dhcp-leases-bridge would also then be massively simplified just to write the /etc/unbound/dhcp-leases.conf and exit.
(In reply to Nick Howitt from comment #59) > How about this. What is the point of maintaining the > /etc/unbound/dhcp-leases.conf file? This can be done initially as a It primary purpose is to load unbound DNS on a reboot/restart/power-up. Otherwise it would have no DNS entries. does that help?
So just generate it at startup with a modified /usr/sbin/unbound-dhcp-leases-bridge, run just before the unbound start, rather than after it as it does at the moment. Then pull all the /etc/unbound/dhcp-leases.conf file maintenance from your scripts just leaving the parts related to the unbound cache calls. Then how fast does your script become? It cuts right down on the file i/o. As a PoC, don't even bother generating the /etc/unbound/dhcp-leases.conf and just test your modified script on its own.
(In reply to Nick Howitt from comment #61) > So just generate it at startup with a modified > /usr/sbin/unbound-dhcp-leases-bridge, run just before the unbound start, > rather than after it as it does at the moment. Then pull all the > /etc/unbound/dhcp-leases.conf file maintenance from your scripts just > leaving the parts related to the unbound cache calls. I am not sure I understand your proposal. Since the bash needs to go from more than 500ms to less than 100ms I am not sure the file i/o will make huge difference. It would be better to write as perl or python if @ms agrees the overall concept is correct.
(In reply to Nick Howitt from comment #61) > ... Then pull all the > /etc/unbound/dhcp-leases.conf file maintenance from your scripts just > leaving the parts related to the unbound cache calls. Then how fast does > your script become? It cuts right down on the file i/o. > > As a PoC, don't even bother generating the /etc/unbound/dhcp-leases.conf and > just test your modified script on its own. I commented out the reads, writes, sed's, etc /etc/unbound/dhcp-leases.conf and it saves about 10ms. This is on a APU4D4.
Perhaps it is worth timing just the unbound cache commands. That should give you the absolute shortest time possible. If that is too long, then perhaps we are dead. If those times are OK, then we need to speed up your stuff. You end up with small scripts so could they be rune directly from dhcpd.conf.local rather than calling /usr/sbin/dhcpEvent? Just ideas.
(In reply to Nick Howitt from comment #64) > Perhaps it is worth timing just the unbound cache commands. That should give > you the absolute shortest time possible. If that is too long, then perhaps > we are dead. I think that is a good start. We also need to take into consideration that there are people with lots of leases. So the loop that is in the script rewriting the file might take a short moment in testing, but might take several seconds if there are 1000 leases to process. Option C could be to have the most minimal script that just talks to a daemon that can hold and cache state. For example the bridge.
I've been trying to do some basic testing and research. Can some things be spead up by avoiding external calls, for example, to get a reverse IP, from https://unix.stackexchange.com/questions/132779/how-to-read-an-ip-address-backwards you can do: IFS=. read w x y z <<<'192.168.1.1' printf '%d.%d.%d.%d.in-addr.arpa.' "$z" "$y" "$x" "$w" or: echo '192.168.1.1' | { IFS=. read w x y z; echo "$z.$y.$w.$x.in-addr.arpa."; } to avoid calling awk. Would that be faster. Also, from empirical testing, there seems little point in trapping errors before calling unbound-control. For example, in delete_unbound, perhaps just deleting the IP and reverse IP may be quicker than testing if it is there. Would it just be faster to delete any A and PTR records before adding them rather than test if they exist
There is a bug with the this new approach. It also exists in the current approach. The cache auto-expires all entries when it hits the TTL so this should be read from /etc/dhcp/dhcpd.conf every time a lease is renewed or, if it appears, from the renewal event. I also hope that a renewal event kicks off a commit event. Currently it is hard-coded to 60min in both Jon's update and /usr/sbin/unbound-dhcp-leases-bridge. The issue probably does not hit in /usr/sbin/unbound-dhcp-leases-bridge because it regenerates and restarts so often but I think would come more apparent if someone increased the DHCP TTL. I need to check out my own updates to /etc/unbound/dhcp-leases.conf for the same issue.
(In reply to Nick Howitt from comment #67) > There is a bug with the this new approach. It also exists in the current > approach. The cache auto-expires all entries when it hits the TTL so this > should be read from /etc/dhcp/dhcpd.conf every time a lease is renewed or, > if it appears, from the renewal event. I also hope that a renewal event > kicks off a commit event. I do not see a TTL in the dhcp.conf. Are you referring to default-lease-time or max-lease-time? > > Currently it is hard-coded to 60min in both Jon's update and > /usr/sbin/unbound-dhcp-leases-bridge. The issue probably does not hit in > /usr/sbin/unbound-dhcp-leases-bridge because it regenerates and restarts so > often but I think would come more apparent if someone increased the DHCP > TTL. I need to check out my own updates to /etc/unbound/dhcp-leases.conf for > the same issue. Are the the DHCP lease times and the unbound DNS TTL times really connected or related??
I tried looking at the ISC docs and it really isn't clear to me. I was wondering if I could experiment. Both values are used in the UI. The IPF documentation says the default-lease-time and does not mention the max-lease-time. https://dhcp-users.isc.narkive.com/jeVKUrHW/what-is-the-subtil-difference-between-max-lease-time-and-default-lease-time indicates the default is the one handed out if the client does not request a specific lease time and the max is the maximum DHCPD would allow if a client requested a specific one. I tried looking at the on-commit and it does not seem to be passed as a parameter, unless I missed it. Then the only option that I can see is to calculate it from /var/state/dhcp/dhcpd.leases but this risks slowing down the script.
(In reply to Nick Howitt from comment #69) > I tried looking at the ISC docs and it really isn't clear to me. I was > wondering if I could experiment. Of course! > https://dhcp-users.isc.narkive.com/jeVKUrHW/what-is-the-subtil-difference- > between-max-lease-time-and-default-lease-time indicates the default is the > one handed out if the client does not request a specific lease time and the > max is the maximum DHCPD would allow if a client requested a specific one. Max is the expiry time. When I leave home, DHCP sends an "on-expiry" after 2 hours. My dhcp is set to: "Max lease time (mins): 120" > I tried looking at the on-commit and it does not seem to be passed as a > parameter, unless I missed it. There is not one that I found. I only found status (like "commit") IP address & hostname.
And a MAC address!
(In reply to Jon from comment #70) > (In reply to Nick Howitt from comment #69) > > I tried looking at the ISC docs and it really isn't clear to me. I was > > wondering if I could experiment. > > Of course! > > > https://dhcp-users.isc.narkive.com/jeVKUrHW/what-is-the-subtil-difference- > > between-max-lease-time-and-default-lease-time indicates the default is the > > one handed out if the client does not request a specific lease time and the > > max is the maximum DHCPD would allow if a client requested a specific one. > > Max is the expiry time. When I leave home, DHCP sends an "on-expiry" after > 2 hours. My dhcp is set to: "Max lease time (mins): 120" That isn't my reading. My reading is that many clients don't request a specific lease time. In that case they get the default lease time. However if the client explicitly requested 10 minutes, it would get 10 minutes and if it requested 10h, it would get max-lease-time if it is less than 10h. https://dhcp-users.isc.narkive.com/jeVKUrHW/what-is-the-subtil-difference-between-max-lease-time-and-default-lease-time > > > > I tried looking at the on-commit and it does not seem to be passed as a > > parameter, unless I missed it. > > There is not one that I found. I only found status (like "commit") IP > address & hostname.
This feels like we may be going off topic... How is this related to the multiple unbound restarts?
Multiple restarts are caused by /usr/sbin/unbound-dhcp-leases-bridge updating, effectively, the hosts in in Unbound. I have tried to work around it by updating /usr/sbin/unbound-dhcp-leases-bridge, but you have a better solution (IMHO) of taking the output directly from dhcpd and populating the Unbound cache. In testing, I have noticed that my records added to the Unbound cache disappear after the the TTL specified. e.g., with my testing simplification of your script: IP=$1 PTR=$(echo $IP | { IFS=. read w x y z; echo "$z.$y.$w.$x.in-addr.arpa."; }) DOMAIN=${HOSTNAME#*.*} echo -e "$2.$DOMAIN"'\n'"$PTR" | /usr/sbin/unbound-control local_datas_remove echo -e "$2.$DOMAIN 60 IN A $IP"'\n'"$PTR 60 IN PTR $2.$DOMAIN" | /usr/sbin/unbound-control local_datas Anything I add disappears in 60 minutes. Now, If I had a lease time of 3h and this triggered the add to cache, for 60 mins the A record would appear, but then it would disappear until the lease were renewed. To me, that is a bug. I am trying to work round it. Now, if the default-lease-time is set to 60 mins, the bug won't generally be hit as releases are renewed from half the granted lease time onwards. But if the default were set to 8h (the IPF default is very low), then the bug would be continually hit as you will never get a message to renew the hosts entry in time. Similarly, if a client requested a long lease, you'd hit the same problem. The current methodology works with a 60 minute default works because the hosts file gets regenerated so often but it is not good. It would also suffer if there were very few leases and a long lease time. Why do you think it is not related to multiple unbound restarts and fixing the issue?
(In reply to Nick Howitt from comment #74) > Why do you think it is not related to multiple unbound restarts and fixing > the issue? Mostly because it was more related to DHCP... If I understanding what you are saying: Changing the DHCP "max lease time" from default of 120min (2h) to 3h, the Unbound cache record A (but not record PTR?) disappears from the unbound cache after 60 minutes (the TTL setting in dhcpEvent script)? I am trying to understand your scenario so I can test also...
The PTR also disappears. The 60min TTL is used in both dhcpEvent and /usr/sbin/unbound-dhcp-leases-bridge. It may be tempting to stick the TTL very high and rely on the dhcpd expiry event to remove the record from Unbound, but it won't be a safe approach for downstream which may have cached the record until the original TTL expires. Downstream won't then re-query Unbound, but Unbound's record may have changed.
I think this is the lease time that could be grabbed with on-commit: https://github.com/isc-projects/dhcp/blob/master/common/print.c#L979-L981
Looks like you are right - https://linux.die.net/man/5/dhcp-eval
Do you know how to use lease-time? I have tried all sorts just outputting to a log file. Most things end up outputting nothing and killing the log lone, or dhcpd refuses to start. The most progress I made was with: set LeaseTime = encode-int(lease-time, 8); set PTR = concat (binary-to-ascii (10, 8, ".", reverse (1, leased-address)), ".in-addr.arpa."); log(concat("CRE-Commit: ClientIP: ", ClientIP, " Name: ", ClientName, " PTR: ", PTR, " Lease time: ", LeaseTime )); But this list gives: Apr 3 17:57:10 gateway dhcpd: CRE-Commit: ClientIP: 172.17.2.95 Name: LAPTOP-Rachel PTR: 95.2.17.172.in-addr.arpa. Lease time: ^P The additional paramater of 10 or more stops dhcpd from starting. ChatGPT isn't helping as it suggests: set lease_time_seconds = option lease-time; but lease-time is not an option. If I correct it, it switches to packet-lease-time and my log line fails.
Just to add, if you change the encode-int second parameter to 16 you get ^N^P and if you change it to 32 you get a blank.
having similar issues. just found this thread: https://lists.isc.org/mailman/htdig/dhcp-users/2011-December/014521.html and "maybe" this solution: https://lists.isc.org/mailman/htdig/dhcp-users/2012-January/014593.html
(In reply to Nick Howitt from comment #80) > Just to add, if you change the encode-int second parameter to 16 you get > ^N^P and if you change it to 32 you get a blank. ^N^P = 0E10x = 3600 what it your lease time setting?
mine is: either blank or ^Ah ^Ah = 0168 hex = 360 dec = Default lease time (mins): 5 So my number is correct (as a hex) for my 5 minute default time. But I dont understand why some are blank...
oops not quite correct...
maybe this... ``` set LeaseTime = encode-int(lease-time, 16); set LeaseTime2 = binary-to-ascii(10, 16, "", LeaseTime); ```
Well done. I've tried: set LeaseTime = binary-to-ascii(10, 16, "", encode-int(lease-time, 16)); and it works. I was trying ,, instead of ,"", when you posted. I am also confused. All the documentation I read indicates the TTL is in seconds, which means the setting used currently in /etc/unbound/dhcp-leases.conf and in your replacement method may be wrong? I'm checking.
It also means that my premise that the DNS record was expiring on the TTL is incorrect but something is expiring it.
This was the part I did not understand... I could not re-create what you saw. What error or issue did you see that hinted this was true?
I was adding a pair of DNS records with: IP=$1 PTR=$(echo $IP | { IFS=. read w x y z; echo "$z.$y.$w.$x.in-addr.arpa."; }) DOMAIN=${HOSTNAME#*.*} echo -e "$2.$DOMAIN"'\n'"$PTR" | /usr/sbin/unbound-control local_datas_remove echo -e "$2.$DOMAIN 60 IN A $IP"'\n'"$PTR 60 IN PTR $2.$DOMAIN" | /usr/sbin/unbound-control local_datas so: ./test-add2 192.168.1.2 xxx giving: [root@gateway ~]# /usr/sbin/unbound-control list_local_data | grep xxx xxx.howitts.co.uk. 60 IN A 192.168.1.2 2.1.192.168.in-addr.arpa. 60 IN PTR xxx.howitts.co.uk. Then at some point in the future they disappeared. I'll shoot myself quietly if it was an unbound restart that was causing it, which I guess it could have been. I am testing now. I've added a record and I'm watching it in a loop. If it disappears I'll see when, check for a restart and check back. I am running my updated script so it may not restart....
Still working. I'm thinking I've led you up the garden path. Apologies. I'll keep the loop running overnight.
The manually cached entry was still there this morning, so I have led you up the garden path with the TTL issue. When I restarted Unbound, the cache was flushed and the entry disappeared as I expected. It did not disappear when I restarted dhcp. To me, this then means the /etc/unbound/dhcp-leases.conf only needs to be written as a pre-start task of unbound. It can be maintained on-the-fly by dhcpScript but this puts an unnecessary overhead on dhcp, especially if the emphasis is speed so as to not block dhcpd from issuing leases. What do others think? Another question, which is quicker for generating the PTR? it can be done in the dhcpd.conf.local with: set PTR = concat (binary-to-ascii (10, 8, ".", reverse (1, leased-address)), ".in-addr.arpa."); Or in dhcpScript as it is done now calling awk or directly in bash with: PTR=$(echo $IP | { IFS=. read w x y z; echo "$z.$y.$w.$x.in-addr.arpa."; }) I suspect the awk method is the slowest. I can try to set up a timing loop for the two methods in dhcpScript later today.
Here are some timings for calculating the reverse IP over 256 IP's: IP="192.168.1." time for i in {0..255} ; do echo ${IP}$i | { IFS=. read w x y z; echo "$z.$y.$w.$x.in-addr.arpa."; }; done <snip> real 0m0.166s user 0m0.156s sys 0m0.109s time for i in {0..255} ; do echo "${IP}$i" | /usr/bin/awk -F. '{print $4 "." $3 "." $2 "." $1 ".in-addr.arpa." }'; done <snip> real 0m0.878s user 0m0.617s sys 0m0.368s It is a saving of 0.003s per iteration. Perhaps not too significant, but a small reduction. I don't know how to time the same thing in dhcpd.conf.local with: set PTR = concat (binary-to-ascii (10, 8, ".", reverse (1, leased-address)), ".in-addr.arpa.");
But for extracting the valid part of a hostname, these are near as dammit the same: test="abcABD-_.khaoih" time for i in {0..255} ; do expr "$i$test" : '\([A-Za-z0-9.-]*\)'; done time for i in {0..255} ; do echo "$i$test" | /bin/sed 's/[^A-Za-z0-9.-]/_/g' | /usr/bin/awk -F_ '{print $1}'; done But this is a touch faster: time for i in {0..255} ; do echo "$i$test" | /bin/sed 's/[^A-Za-z0-9.-].*//'; done but not significantly - about 0.0005s per iteration
if you're allowed to us it for the domain name: time for i in {0..255} ; do DOMAINNAME=${HOSTNAME#*.*}; echo $DOMAINNAME ; done <snip> real 0m0.010s user 0m0.005s sys 0m0.004s vs time for i in {0..255} ; do eval "$(/usr/local/bin/readhash /var/ipfire/main/settings)"; echo $DOMAINNAME; done <snip> real 0m2.485s user 0m1.700s sys 0m0.817s
Do you need the "expand": [root@gateway ~]# time for i in {0..255} ; do ucList=$( /usr/sbin/unbound-control list_local_data | /bin/grep -i "${DOMAINNAME}" | expand -t1 ) ; done real 0m1.869s user 0m1.785s sys 0m1.141s [root@gateway ~]# time for i in {0..255} ; do ucList=$( /usr/sbin/unbound-control list_local_data | /bin/grep -i "${DOMAINNAME}") ; done real 0m1.577s user 0m1.214s sys 0m1.062s
Created attachment 1513 [details] timing Hey Nick! Sorry I haven't replied recently. Just been busy with the weekend holiday and now Tax Day filings! (ugh!). I've been doing the same for each command and running time trials (see attached image). For most everyone I can save near 5-20 ms for a command. But I have not been able to find "the smoking gun" that is using a big chunk of time. So as mentioned in Comment 62, I am not sure we can get things down from 500ms to 100 ms without perl (or something faster than bash). I should be able to get back to this sometime soon! Let me know if you find the smoking gun!
(In reply to Nick Howitt from comment #92) > I don't know how to time the same thing in dhcpd.conf.local with: > > set PTR = concat (binary-to-ascii (10, 8, ".", reverse (1, leased-address)), > ".in-addr.arpa."); This could be done. I just don't know how to measure the execution time if PTR is moved to dhcpd.conf. If we can find a way to measure (beside the time stamp on a log entry) that would be great!
I have no idea there. For IP validation over 500 iterations: time for i in {0..499} ; do if [[ "$IP$i" =~ [0-9]{1,3}\.[0-9]{1,3}\[0-9]{1,3}\[0-9]{1,3} ]] ; then echo "good" ; else echo "bad"; fi ; done real 0m0.079s user 0m0.062s sys 0m0.018s vs time for i in {0..499} ; do if ! /bin/grep --quiet --extended-regexp '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' <<< "${IP}${i}" ; then echo "bad"; else echo "good" ;fi ; done real 0m1.842s user 0m1.200s sys 0m0.672s Note the original test is negative. My other one is positive but can be negated. Again no smoking gun. I guess I may have to load in your your code fully rather than do synthetic testing. How did you do your timings?
(In reply to Nick Howitt from comment #95) > Do you need the "expand": Yes, I just dont remember the "why". It probably had to do the one of the greps on the ucList. I'll review
(In reply to Nick Howitt from comment #98) > How did you do your timings? For command line: similar to yours except for 1000: `time for i in {0..999} ; do DOMAINNAME=${HOSTNAME#*.*}; done` for dhcpEvent just one time: `time dhcpEvent commit <ip addr> <hostname>` For multiple IPs and I time this script as a whole: ``` #!/bin/bash for y in `seq 101 104` # 4 x 250 = 1000 lines do for z in `seq 3 252` # 250 lines total do IP="192.168.${y}.${z}" /root/dhcpdconf/dhcpEvent_v29b.sh commit "${IP}" "testme${y}${z}" done done ```
One more item: The last command (the for loop) stresses everything. Dont run on production. It will add to unbound cache and /etc/unbound/dhcp-leases.conf
I made a mistake in my testing and piped some rubbish into /bin/grep. I had to spin up a VM to grab a copy of the file to fix the firewall. I can try to copy over data to test there.
Hmmm. I may have made an error also. The smoking gun may this: ``` eval $(/usr/local/bin/readhash /var/ipfire/dhcp/settings) # Is "DNS Update > Enable DNS Update (RFC2136)" enabled or disabled? if [[ "${DNS_UPDATE_ENABLED}" != on ]]; then /usr/bin/logger --tag dhcpEvent "DNS Update (RFC2136) is NOT enabled. Bridge may be running" else /usr/bin/logger --tag dhcpEvent "DNS Update (RFC2136) is enabled" fi pgrep -f unbound-dhcp-leases-bridge && kill -s SIGTERM $(pgrep -d',' -f unbound-dhcp-leases-bridge) && /usr/bin/logger --tag dhcpEvent "terminated unbound-dhcp-leases-bridge" ``` And secondly this: ``` # send message to message log msg_log () { if tty --silent ; then echo "${tagName}:" "$*" else /usr/bin/logger --tag "${tagName}" "$*" fi } ``` time to test!
[root@ipfire ~] # time /root/dhcpdconf/dhcpEvent_v30.sh commit 192.168.200.16 testME20016 real 0m0.404s user 0m0.270s sys 0m0.202s the above got rid of ~100ms. 300ms more to go!
I had not decoded that part. I am not sure what you see wrong in the first bit or what to expect from DNS_UPDATE_ENABLED, but I don't think Unbound can handle RFC2136. A killall unbound-dhcp-leases-bridge may be a bit better than a kill with a pgrep. Does all that execute every time the app runs? I know nothing about tty so skipped that bit of code, but too much debugging output will slow things.
It does for the current dhcpEvent, but all of this will be removed if the idea is acceptable to the Core Devs. I was just being paranoid and I wanted to make sure `unbound-dhcp-leases-bridge` was not running at the same time. And yes, it did run and check every time. So I've changed the test so that does not happen It is not worth spending time on to make faster since it will disappear one day. But it is noteworthy since it is/was included the the original 500ms. And this needs to be commented out for time tests: ``` eval $(/usr/local/bin/readhash /var/ipfire/dhcp/settings) # Is "DNS Update > Enable DNS Update (RFC2136)" enabled or disabled? if [[ "${DNS_UPDATE_ENABLED}" != on ]]; then /usr/bin/logger --tag dhcpEvent "DNS Update (RFC2136) is NOT enabled. Bridge may be running" pgrep -f unbound-dhcp-leases-bridge && kill -s SIGTERM $(pgrep -d',' -f unbound-dhcp-leases-bridge) && /usr/bin/logger --tag dhcpEvent "terminated unbound-dhcp-leases-bridge" else /usr/bin/logger --tag dhcpEvent "DNS Update (RFC2136) is enabled" fi ```
Created attachment 1514 [details] dhcpEvent updated with speed ups and updating of dhcp-leases.conf removed I've commented out the "Kill the unbound-dhcp-leases-bridge process" section and added all the speed ups I mentioned. Note the RE I originally proposed to validate the IP address was buggy. THis has the correct version. I have also removed the lines updating the dhcp-leases.conf file. This should be removed to the Unbound start up as a pre-start task. It hen becomes a one off task rather than a perpetual maintenance task. It is only used to load the cache at the beginning and, thereafter, the cache is maintained by this script. Now the release is slower than the commit! [root@ipfire ~]# time /root/dhcpEvent.njh commit 192.168.1.1 xxxx added 2 datas dhcpEv28: commit: added A & PTR records to unbound real 0m0.067s user 0m0.034s sys 0m0.040s [root@ipfire ~]# /usr/sbin/unbound-control list_local_data | grep xxx 1.1.192.168.in-addr.arpa. 60 IN PTR xxxx.localdomain. xxxx.localdomain. 60 IN A 192.168.1.1 [root@ipfire ~]# time /root/dhcpEvent.njh release 192.168.1.1 removed 2 datas dhcpEv28: release: references to xxxx.localdomain. removed from unbound real 0m0.075s user 0m0.050s sys 0m0.045s [root@ipfire ~]# /usr/sbin/unbound-control list_local_data | grep xxx [root@ipfire ~]# This was tested on a 2-core VM with 2GB RAM
I have a feeling this can leave orphaned PTR records. If you get a lease on your phone, for example, then change the phone's name and refresh the lease on the phone (WiFi off then on, for example), the old A record will get overwritten, but I think the PTR will remain?
The "revIP" line is not needed in delete_unbound. unbound-control can take a -q which may have a benefit. There are a couple of orphaned parameters in the script now e.g unboundDHCPleases and in delete_unbound, revIP, if the unnecessary line is removed.
Created attachment 1515 [details] dhcpEvent - v30 Here is the current dhcpEvent (v30). It includes the changes up to April 4. This is the 300ms version (after msg_log commands commented out).
(In reply to Nick Howitt from comment #109) > The "revIP" line is not needed in delete_unbound. How are PTR / `.in-addr.arpa` lines deleted without reverse IP? > > unbound-control can take a -q which may have a benefit. Got it - I will give it a try! > There are a couple of orphaned parameters in the script now e.g > unboundDHCPleases and in delete_unbound, revIP, if the unnecessary line is > removed. I do not understand how I can get rid of code related to unbound DHCP leases. How does `/etc/unbound/dhcp-leases.conf` get updated?
(In reply to Jon from comment #111) > (In reply to Nick Howitt from comment #109) > > The "revIP" line is not needed in delete_unbound. > > How are PTR / `.in-addr.arpa` lines deleted without reverse IP? > Have a look at the code. It is not used. The reverse IP is derived from grepping the ucList for the FQDN > > > > unbound-control can take a -q which may have a benefit. > > Got it - I will give it a try! > I don't think it makes much difference from my observations > > There are a couple of orphaned parameters in the script now e.g > > unboundDHCPleases and in delete_unbound, revIP, if the unnecessary line is > > removed. > > I do not understand how I can get rid of code related to unbound DHCP > leases. How does `/etc/unbound/dhcp-leases.conf` get updated? `/etc/unbound/dhcp-leases.conf` only gets read during the start or reload of Unbound. While unbound is running, changing the file does nothing. As a logical extension, why, then, bother to maintain it while unbound is running? Its generation could then move to the Unbound start routine, /etc/init.d/unbound to run prior to unbound starting. There is a huge amount of pre-processing done in init file, so why not add the file generation there, where it is not so time critical and will rarely get called.
(In reply to Nick Howitt from comment #112) > (In reply to Jon from comment #111) > > (In reply to Nick Howitt from comment #109) > > > The "revIP" line is not needed in delete_unbound. > > > > How are PTR / `.in-addr.arpa` lines deleted without reverse IP? > > > Have a look at the code. It is not used. The reverse IP is derived from > grepping the ucList for the FQDN removed - thank you! > > > unbound-control can take a -q which may have a benefit. > > > > Got it - I will give it a try! > > > I don't think it makes much difference from my observations added -q and removed `>/dev/null` > `/etc/unbound/dhcp-leases.conf` only gets read during the start or reload of > Unbound. While unbound is running, changing the file does nothing. As a > logical extension, why, then, bother to maintain it while unbound is > running? Its generation could then move to the Unbound start routine, > /etc/init.d/unbound to run prior to unbound starting. There is a huge amount > of pre-processing done in init file, so why not add the file generation > there, where it is not so time critical and will rarely get called. You have to show me the code for this one. I dont think this will work. Records A & PTR cannot be created on the fly. That is why `/etc/unbound/dhcp-leases.conf` exists.
You've missed what I am trying to say. Another script would be needed. It could be based on a cut down /usr/sbin/unbound-dhcp-leases-bridge. It queries the leases file (/var/state/dhcp/dhcpd.leases) and static leases, eliminates anything expired, anything covered by the hosts file and anything without a name on the lease and then generates the /etc/unbound/dhcp-leases.conf. After that it reloads unbound (which we would not need). Problem is I don't speak python. The same thing could probably be scripted in bash but I don't know how to query the information out of the dhcpd.leases file. This should be run during the unbound start up. So the code does not exist, but generating the /etc/unbound/dhcp-leases.conf is better done as part of the unbound start (IMHO). Also generating the file during the unbound start removes the risk of the file diverging over time as it gives a full reset.
Just decided to join this discussion. Didn't read it as a whole, yet. Therefore my motivation only. I am examining filtering using the RPZ mechanism of unbound. The block lists are updated by the SOA functionality. I noticed that lists with a longer update period ( usually 1 day ) are not checked. I suppose, the reload of unbound resets the associated timers. This results in deactivation of the update mechanism. Don't know, whether other SOA based lists lack actuality also. Jon's concept of syncing DHCP and unbound by the CRE mechanism and unbound-control load-local-data seems a way out of this problem.
Nick - Now I understand better. But I am not going to touch this aspect of the current unbound bridge. It is way over my head! Hey Bernhard! (In reply to Nick Howitt from comment #108) > I have a feeling this can leave orphaned PTR records. If you get a lease on > your phone, for example, then change the phone's name and refresh the lease > on the phone (WiFi off then on, for example), the old A record will get > overwritten, but I think the PTR will remain? I do not see stay PTRs. Is this with your code from Comment 107 or my code? Does this still happen?
I may be able to have a look at the unbound bridge, but I'll be slow! The thing about orphaned PTR's is hypothetical and I have not tried to demonstrate. It goes roughly like this: Obtain a lease with a name like Pixel7a e.g. for 192.168.1.10. You get an A record Pixel7a.domain pointing to 192.168.1.10. You get a PTR record 10.1.168.192.in-addr.arpa pointing to Pixel7a.domain Cut the device off and rename it to Dad. Get a new lease. As the MAC address has not changed, it will get the same IP You get an A record Dad.domain pointing to 192.168.1.10. This appears to overwrite the current A record. You get a PTR record, 10.1.168.192.in-addr.arpa pointing to Dad.domain. This does not overwrite the previous PTR record. BUT I have just tested and I cannot reproduce it so possibly the code handles it. Originally I came across it when I mocked up a few lines of code, but not with this full routine.
Created attachment 1516 [details] Script to generate /etc/unbound/dhcp-leases.conf This is a major hack job on the original /usr/sbin/unbound-dhcp-leases-bridge (not my proposed modified version as my modifications are redundant with this approach). It creates a new /etc/unbound/dhcp-leases.conf and is designed to run during the unbound startup. I have removed the inotify bits (file monitoring), the perpetual loop and daemonizing elements. It also drops the restart. I have tried to cut out all the bits I can but may have left in some redundant "imports" or other bits so feel free to make any changes. If you want to test and not overwrite your current /etc/unbound/dhcp-leases.conf, comment out line 487 and uncomment line 488. Modify line 488 to suit. This script should be run from /etc/init.d/unbound after the write_forward_conf line (574).
I'll keep an eye out for stray PTR also. So far I have not seen any. (In reply to Nick Howitt from comment #117) > Cut the device off and rename it to Dad. The code in dhcpEvent looks at a hostname name change. ``` UC_FQDN=$( /bin/grep --word-regexp --ignore-case "^${clientHostname}\.${DOMAINNAME}" <<< "${ucList}" ) if [[ -z "${UC_FQDN}" ]] ; then # No, hostname. So the hostname changed. # delete A & PTR records from unbound delete_unbound "${clientIP}" # add new to unbound! write-unbound "${clientIP}" "${clientHostname}" fi ``` so hopefully this handles things correctly! (In reply to Nick Howitt from comment #118) > Created attachment 1516 [details] > Script to generate /etc/unbound/dhcp-leases.conf > > This is a major hack job on the original > /usr/sbin/unbound-dhcp-leases-bridge (not my proposed modified version as my > modifications are redundant with this approach). It creates a new > /etc/unbound/dhcp-leases.conf and is designed to run during the unbound > startup. One of the reasons I wrote dhcpEvent was to find a way around parsing the file generated by DHCP @ `/var/state/dhcp/dhcpd.leases`. So my preference is to stick with `dhcp-leases.conf` being populated with dhcpEvent.
(In reply to Jon from comment #119) > I'll keep an eye out for stray PTR also. So far I have not seen any. > > (In reply to Nick Howitt from comment #117) > > Cut the device off and rename it to Dad. > > The code in dhcpEvent looks at a hostname name change. > ``` > UC_FQDN=$( /bin/grep --word-regexp --ignore-case > "^${clientHostname}\.${DOMAINNAME}" <<< "${ucList}" ) > if [[ -z "${UC_FQDN}" ]] ; then > # No, hostname. So the hostname changed. > # delete A & PTR records from unbound > delete_unbound "${clientIP}" > > # add new to unbound! > write-unbound "${clientIP}" "${clientHostname}" > fi > ``` > > so hopefully this handles things correctly! > > I'd missed the hostname change code. Looks like it it is handling it. > (In reply to Nick Howitt from comment #118) > > Created attachment 1516 [details] > > Script to generate /etc/unbound/dhcp-leases.conf > > > > This is a major hack job on the original > > /usr/sbin/unbound-dhcp-leases-bridge (not my proposed modified version as my > > modifications are redundant with this approach). It creates a new > > /etc/unbound/dhcp-leases.conf and is designed to run during the unbound > > startup. > > One of the reasons I wrote dhcpEvent was to find a way around parsing the > file generated by DHCP @ `/var/state/dhcp/dhcpd.leases`. So my preference > is to stick with `dhcp-leases.conf` being populated with dhcpEvent. The code already existed to parse the /var/state/dhcp/dhcpd.leases - /usr/sbin/unbound-dhcp-leases-bridge. All I've done, more or less, is remove the daemonising elements so it runs as a one-shot. By using this you can remove your lines which maintain /etc/unbound/dhcp-leases.conf and gain valuable time in your code execution and execution time is at a premium. It also stops any possibility of /etc/unbound/dhcp-leases.conf deviating from where it should be, for example on a crash or termination of a routine. The file only ever needs to be correct when unbound starts and it will be with this script. This is the safer and faster option.
(In reply to Nick Howitt from comment #118) > Created attachment 1516 [details] > Script to generate /etc/unbound/dhcp-leases.conf > > This is a major hack job on the original > /usr/sbin/unbound-dhcp-leases-bridge (not my proposed modified version as my > modifications are redundant with this approach). It creates a new > /etc/unbound/dhcp-leases.conf and is designed to run during the unbound > startup. > > I have removed the inotify bits (file monitoring), the perpetual loop and > daemonizing elements. It also drops the restart. I have tried to cut out all > the bits I can but may have left in some redundant "imports" or other bits > so feel free to make any changes. > > If you want to test and not overwrite your current > /etc/unbound/dhcp-leases.conf, comment out line 487 and uncomment line 488. > Modify line 488 to suit. > > This script should be run from /etc/init.d/unbound after the > write_forward_conf line (574). The code should also be run on a reload, so after write_hosts_conf on line 598.
Created attachment 1524 [details] Zip file for dhcpEvent Update dhcpd.conf.local - Added commit check for a hostname. If there is no commit hostname, then do not execute dhcpEvent script. (small speed improvement) - General clean-up of config file. Update dhcpEvent - removed hostname check for commit. Now done as part of "on commit" in dhcpd.conf (small speed improvement) - General clean up.
Michael - I need your input. I'd like to keep moving forward. And I am thinking of re-writing the bash script into perl. And I don't want to go forward without your approval. So, do you agree with the proof-of-concept and using the DHCP on-commit / on-release / on-expire functions to send IP address and Hostname to Unbound DNS? If yes, are you OK with re-writing the dchpEvent using perl? (I have not written anything using perl so this will be a bit of a learning curve)
I think we also need input on 2 other issues: 1 - can we use `DOMAINNAME=${HOSTNAME#*.*}` using the built-in HOSTNAME do derive the DOMAINNAME or do we have to do `eval "$(/usr/local/bin/readhash /var/ipfire/main/settings)"` which is much slower (in bash), so I assume similar in perl. 2 - does the dhcp-leases file have to be maintained during cache updates? As far as I can see, the file is only ever read on startup of unbound so a cut down version of /usr/sbin/unbound-dhcp-leases-bridge could be run as a pre-start and pre-reload task in init.d/unbound to generate the dhcp-leases file. All the script needs to do is regenerate dhcp-leases with out any restarts. I have attached a working cut down version of the script which removes all deamonising elements and restarts. It also removes the need for my mods which I submitted a couple of weeks ago. It runs very quickly on my machine and makes no noticeable difference to the start up time of unbound. It also, then, removes any requirement to maintain dhcp-leases when doing cache updates, speeding them up.
(In reply to Jon from comment #123) > Michael - I need your input. > > I'd like to keep moving forward. And I am thinking of re-writing the bash > script into perl. And I don't want to go forward without your approval. > > So, do you agree with the proof-of-concept and using the DHCP on-commit / > on-release / on-expire functions to send IP address and Hostname to Unbound > DNS? > > If yes, are you OK with re-writing the dchpEvent using perl? (I have not > written anything using perl so this will be a bit of a learning curve) Michael - are you OK with me going forward with a perl script to do what dhcpEvent.sh does?
(In reply to Jon from comment #125) > Michael - are you OK with me going forward with a perl script to do what > dhcpEvent.sh does? Erm, why Perl?
It seems to be MUCH faster than bash and had the possibility of getting from 300 ms (w/ bash) down to 100 ms. And I did not want to learn C.
and I have someone to help me write perl!
When I ran your dhcpEvent in bash but removed the updates to the /etc/unbound/dhcp-leases.conf and all your pre-amble at the top to kill the unbound-dhcp-leases-bridge process and used my alt method for deriving the domain name, my dhcpEvent took about 70 ms to execute.
I suppose Nick's changes have found their way into CU185. The restarts are really minimised, but the transfer of new DHCP info needs a reload nevertheless. This resets the update timers for RPZ list, for example. Jon's approach with unbound-control avoids this. But it is necessary to separate the DHCP events handling from the update process: - the event writes the parameters to a pipe - the bridge process works as a server reading this pipe My idea.
Hmm. I never received any feedback for my patch. I was going to chase it again on Thursday, 28 days after I submitted it and I have not updated to CU185 to see if it is there. However, Jon's approach is way better and mine should only be considered an interim fix. I'll update my machine to see if my fix has been incorporated.
I did not locate any recent changes to the bridge. The last change I found is July 2023. https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=f20ca78eff6e8baeb86361f55adf52819d1bae1f But I do see an update to Unbound: https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=4e57a4f31ac0b3bebac4fd43ad1dd282a4ded352 Maybe that changed the "restarts"
The only way that could affect the restarts is if it no longer wrote DHCP leases to the cache at /var/state/dhcp/dhcpd.leases I can't test for the moment as my test system is bust and won't do DNS so I can't update it and I've had to decommission my production system to test a work related issue on another firewall.
So I have implemented the version that I had in mind where a simple script (as simple as possible really) communities with the bridge over a UNIX domain socket: > https://git.ipfire.org/?p=people/ms/ipfire-2.x.git;a=shortlog;h=refs/heads/unbound-socket The client currently waits for a response from the bridge which takes around 100 ms. Without the wait it finishes in less than 10 ms on an IPFire Mini Appliance. That should pretty much be the goal. The bridge will then take all the events and put them into a queue so that it is ready to read the event as soon as possible. The queue is being processed by a worker thread which will update all existing leases and apply any changes to Unbound if necessary. Please give this a test and let me know if this is all working for you, too.
I’d be happy to test but I really don’t know how to install this. Sorry for being stupid. I saw ~22 patches. Is this part of CU 186 or CU 187?
Michael, I started looking through the patches (they are not installed). I came across the "Add event handlers" which looks like the code that goes into a "dhcpd.conf.local" file for me: https://git.ipfire.org/?p=people/ms/ipfire-2.x.git;a=blob;f=html/cgi-bin/dhcp.cgi;h=83cd809656ad0076fd5d4646b8ce64d09aff3a55;hb=83cd809656ad0076fd5d4646b8ce64d09aff3a55#l1377 If it is the same code for "dhcpd.conf.local", then there is a parameter missing. Bernhard noted the dhcp.cgi WebGUI page "Domain name suffix" should change the domain name. So this was added to the dhcpd.conf code: set NoDomain = "localdomain"; set ClientDomain = pick-first-value(config-option domain-name, NoDomain); . . . execute("/usr/sbin/dhcpEvent", "commit", ClientIP, ClientName, ClientDomain);
(In reply to Jon from comment #135) > I’d be happy to test but I really don’t know how to install this. Sorry for > being stupid. > > I saw ~22 patches. Is this part of CU 186 or CU 187? You just need to copy the three files that I have touched: dhcp.cgi and the bridge and the new client. The scripts go to /usr/sbin. Make sure they are executable. Then hit the Save button on the DHCP page to rewrite the configuration file and everything should be up and running. This is not yet scheduled for any of the updates. (In reply to Jon from comment #136) > Michael, > I started looking through the patches (they are not installed). > > I came across the "Add event handlers" which looks like the code that goes > into a "dhcpd.conf.local" file for me: > > https://git.ipfire.org/?p=people/ms/ipfire-2.x.git;a=blob;f=html/cgi-bin/ > dhcp.cgi;h=83cd809656ad0076fd5d4646b8ce64d09aff3a55; > hb=83cd809656ad0076fd5d4646b8ce64d09aff3a55#l1377 > > If it is the same code for "dhcpd.conf.local", then there is a parameter > missing. Yeah, I took this from the file that you uploaded in the ZIP container in an earlier comment. Maybe I messed this up when copy-and-pasting it? > Bernhard noted the dhcp.cgi WebGUI page "Domain name suffix" should change > the domain name. So this was added to the dhcpd.conf code: > > set NoDomain = "localdomain"; > set ClientDomain = pick-first-value(config-option domain-name, NoDomain); > . . . > execute("/usr/sbin/dhcpEvent", "commit", ClientIP, ClientName, ClientDomain); I did experiment with that, because we currently have a rather complicated solution to determine the full FQDN. However, on my test system the domain-name field was always empty. The domain-name is also empty in the leases file. Not sure if this requires some more changes elsewhere in the configuration? The protocol that the bridge is talking over the socket should be easy to extend without even touching the client. It just pipes the event into the socket and waits for a response :)
I have to admit I have not read the code in detail, but @Jon, didn't you move the domain check to your dhcpd.conf.local file because it was faster there than in the dhcpScript file? Would it now be faster in the new dhcp-unbound-leases-bridge/script routine as things like the DOMAINNAME only need to be determined once and, because the data has been piped at this point, the dhcpd.conf.local will return quicker?
@Jon, the latest dhcpd.conf.local is only on your github repo. It did not make it to this bug which is probably why @Michael does not know about it.
Created attachment 1547 [details] dhcp settings (In reply to Michael Tremer from comment #137) > (In reply to Jon from comment #135) > > If it is the same code for "dhcpd.conf.local", then there is a parameter > > missing. > > Yeah, I took this from the file that you uploaded in the ZIP container in an > earlier comment. Maybe I messed this up when copy-and-pasting it? > not at all. it is a recent change. > I did experiment with that, because we currently have a rather complicated > solution to determine the full FQDN. However, on my test system the > domain-name field was always empty. The domain-name is also empty in the > leases file. Not sure if this requires some more changes elsewhere in the > configuration? No additional changes. Just adding three lines to dhcpd.conf: ``` set NoDomain = "localdomain"; set ClientDomain = pick-first-value(config-option domain-name, NoDomain); . . . execute("/usr/sbin/dhcpEvent", "commit", ClientIP, ClientName, ClientDomain); ``` maybe a dash is missing from "config-option . . ." or a misspelled name?? It should never be blank since it should grab the variable "NoDomain" This is the current dhcpd.conf: ``` # This file CAN be modified by the user. # If this file is not needed, just delete the text. # But, DO NOT completely delete this file! # # commit # on commit { set ClientIP = binary-to-ascii(10, 8, ".", leased-address); set NoName = ""; set ClientName = pick-first-value(option host-name, config-option-host-name, client-name, NoName); set NoDomain = "localdomain"; set ClientDomain = pick-first-value(config-option domain-name, NoDomain); #log(concat("info: commit-CRE: ClientIP: ", ClientIP, " Name: ", ClientName, " Domain: ", ClientDomain)); if ClientName != "" { execute("/usr/sbin/dhcpEvent", "commit", ClientIP, ClientName, ClientDomain); } else { log(concat("info: commit-CRE: missing client hostname from ", ClientIP, ". Nothing to send to unbound")); } } # # release # on release { set ClientIP = binary-to-ascii(10, 8, ".", leased-address); set NoName = ""; set ClientName = pick-first-value(option host-name, config-option-host-name, client-name, NoName); #log(concat("info: release-CRE: ClientIP: ", ClientIP, " Name: ", ClientName)); execute("/usr/sbin/dhcpEvent", "release", ClientIP, ClientName); } # # expiry # on expiry { set ClientIP = binary-to-ascii(10, 8, ".", leased-address); #log(concat("info: expiry-CRE: ClientIP: ", ClientIP)); execute("/usr/sbin/dhcpEvent", "expiry", ClientIP, ""); } ``` do the domain names appear in the dhcpd.conf file? > The protocol that the bridge is talking over the socket should be easy to > extend without even touching the client. It just pipes the event into the > socket and waits for a response :) curious - How does the unbound-dhcp-leases-bridge know when to grab something from the client / socket?
(In reply to Michael Tremer from comment #137) > (In reply to Jon from comment #135) > The domain-name is also empty in the > leases file. Not sure if this requires some more changes elsewhere in the > configuration? In the dhcpd.leases file, do you see the "on-expiry/on-release" at the bottom of some leases? Or do you also have the "set VariAble" lines? ``` lease 192.168.65.106 { starts 0 2024/05/12 15:30:20; ends 0 2024/05/12 17:30:20; cltt 0 2024/05/12 15:30:20; binding state active; next binding state free; rewind binding state free; hardware ethernet 40:c7:00:01:02:03; uid "\001@\307\000\001\002\003"; set ClientDomain = "localdomain"; set NoDomain = "localdomain"; set ClientName = "SiPhone12"; set ClientIP = "192.168.65.106"; client-hostname "SiPhone12"; on expiry { set ClientIP = binary-to-ascii (10, 8, ".", leased-address) ; execute ("/usr/sbin/dhcpEvent", "expiry", ClientIP, ""); } on release { set ClientIP = binary-to-ascii (10, 8, ".", leased-address) ; set NoName = "" ; set ClientName = pick-first-value (option host-name, config-option-host-name, client-name, NoName) ; execute ("/usr/sbin/dhcpEvent", "release", ClientIP, ClientName); } }
also, I get this error when running `unbound-dhcp-leases-client` [root@ipfireAPU2 ~]# /usr/sbin/unbound-dhcp-leases-client release 192.168.7.250 /usr/sbin/unbound-dhcp-leases-client: line 41: nc: command not found
looks like the client is dependent on `ncat`. I added nat via pakfire, but I get "ERROR": ``` [root@ipfireAPU2 ~]# /usr/sbin/unbound-dhcp-leases-client release 192.168.7.250 ERROR ``` but I am not sure where it comes from... May be the unbound bridge??
I added this to the `dhcp.cgi`: ``` set NoDomain = "localdomain"; set ClientDomain = pick-first-value(config-option domain-name, NoDomain); log(concat("info: commit-CRE: ClientIP: ", ClientIP, " Name: ", ClientName, " Domain: ", ClientDomain)); if (ClientName != "") { execute("/usr/sbin/unbound-dhcp-leases-client", "commit", ClientAddress, ClientName, ClientDomain); } ``` and it worked for getting the domain. It arrives as `execute_statement argv[4]`: ``` May 12 15:25:03 ipfireAPU2 dhcpd: execute_statement argv[4] = localdomain ``` BUT, the get it to "take" I had to reload the dhcp.cgi WebGUI page _AND_ do a `/etc/init.d/dhcp restart`. Otherwise it did not work. And I get these errors: ```` May 12 15:25:03 ipfireAPU2 dhcp[18015]: Could not handle message: No value given May 12 15:25:03 ipfireAPU2 dhcpd: execute: /usr/sbin/unbound-dhcp-leases-client exit status 256 ``` here is the section of the message log: ``` May 12 15:25:01 ipfireAPU2 dhcpd: DHCPDISCOVER from 02:d2:ef:b1:e2:5d via green0 May 12 15:25:02 ipfireAPU2 dhcpd: DHCPOFFER on 192.168.7.104 to 02:d2:ef:b1:e2:5d via green0 May 12 15:25:03 ipfireAPU2 dhcpd: execute_statement argv[0] = /usr/sbin/unbound-dhcp-leases-client May 12 15:25:03 ipfireAPU2 dhcpd: execute_statement argv[1] = commit May 12 15:25:03 ipfireAPU2 dhcpd: execute_statement argv[2] = ADDRESS=192.168.7.104 May 12 15:25:03 ipfireAPU2 dhcpd: execute_statement argv[3] = NAME= May 12 15:25:03 ipfireAPU2 dhcpd: execute_statement argv[4] = localdomain May 12 15:25:03 ipfireAPU2 dhcp[18015]: Could not handle message: No value given May 12 15:25:03 ipfireAPU2 dhcpd: execute: /usr/sbin/unbound-dhcp-leases-client exit status 256 May 12 15:25:03 ipfireAPU2 dhcpd: DHCPREQUEST for 192.168.7.104 (192.168.7.1) from 02:d2:ef:b1:e2:5d via green0 May 12 15:25:03 ipfireAPU2 dhcpd: DHCPACK on 192.168.7.104 to 02:d2:ef:b1:e2:5d via green0 ``` maybe related to adding an extra argument?!?
(In reply to Nick Howitt from comment #139) > @Jon, the latest dhcpd.conf.local is only on your github repo. It did not > make it to this bug which is probably why @Michael does not know about it. I didn't even know there was a GitHub repository. It has not been linked on here. Do you not have one on our infrastructure? (In reply to Jon from comment #140) > do the domain names appear in the dhcpd.conf file? Yes, I had domain names configured. > > The protocol that the bridge is talking over the socket should be easy to > > extend without even touching the client. It just pipes the event into the > > socket and waits for a response :) > > curious - How does the unbound-dhcp-leases-bridge know when to grab > something from the client / socket? It just listens all the time. When a connection comes in, it reads exactly one message from it, processes it and then sends a response. The "processing" part is simply to check if the message is roughly valid. Properly handling happens later and the client will never know if it has been properly processed. But that is in line with dhpcd's fire and forget approach. (In reply to Jon from comment #141) > (In reply to Michael Tremer from comment #137) > > (In reply to Jon from comment #135) > > The domain-name is also empty in the > > leases file. Not sure if this requires some more changes elsewhere in the > > configuration? > > In the dhcpd.leases file, do you see the "on-expiry/on-release" at the > bottom of some leases? Or do you also have the "set VariAble" lines? Yes, my leases have this set. Weird. (In reply to Jon from comment #144) > maybe related to adding an extra argument?!? You seem to have two problems here. First, NAME is empty. That makes the bridge consider that this is an invalid message. Maybe it should not do that. Then you added the domain, but it is missing its key. It should be an argument of the format KEY=VALUE, or in this case DOMAIN=localdomain.
(In reply to Michael Tremer from comment #145) > I didn't even know there was a GitHub repository. It has not been linked > on here. Do you not have one on our infrastructure? The GitHub was private and planned to move it all once you had blessed the proof-of-concept. > (In reply to Jon from comment #144) > > maybe related to adding an extra argument?!? > > You seem to have two problems here. First, NAME is empty. That makes the > bridge consider that this is an invalid message. Maybe it should not do that. > > Then you added the domain, but it is missing its key. It should be an > argument of the format KEY=VALUE, or in this case DOMAIN=localdomain. Got it. I'll change my testing to KEY=VALUE. Is DOMAIN=VALUE in the bridge code? Or is it always set to "localdomain?
(In reply to Jon from comment #146) > (In reply to Michael Tremer from comment #145) > > I didn't even know there was a GitHub repository. It has not been linked > > on here. Do you not have one on our infrastructure? > > The GitHub was private and planned to move it all once you had blessed the > proof-of-concept. If you don't have a repository on it already, we also have a Git server :) > > (In reply to Jon from comment #144) > > > maybe related to adding an extra argument?!? > > > > You seem to have two problems here. First, NAME is empty. That makes the > > bridge consider that this is an invalid message. Maybe it should not do that. > > > > Then you added the domain, but it is missing its key. It should be an > > argument of the format KEY=VALUE, or in this case DOMAIN=localdomain. > > Got it. I'll change my testing to KEY=VALUE. > > Is DOMAIN=VALUE in the bridge code? Or is it always set to "localdomain? Currently the bridge doesn't know about a DOMAIN key. But you can still add more data to the message that the bridge receives and nothing will break. So that should help us to verify whether this works and we can add the part in the bridge after that.
got it! I did added/changed this in the dhcp.cgi for DOMAIN: ``` set ClientDomain = concat( "DOMAIN=", pick-first-value(config-option domain-name, "localdomain"); if (ClientName != "") { execute("/usr/sbin/unbound-dhcp-leases-client", "commit", ClientAddress, ClientName, ClientDomain); }
When an item is released, should it be deleted from unbound cache AND dhcp-leases.conf? I see items gone from the unbound cache (as they should), but they still exist in the /etc/unbound/dhcp-leases.conf file. I am using these commands to check: ``` unbound-control list_local_data | grep <HOSTNAME> cat /etc/unbound/dhcp-leases.conf | grep <HOSTNAME> ```
It is not working like that. The new programs do not maintain dhcp-leases.conf dynamically. It works more like I proposed, but a little differently. When dhcp starts, it launches the bridge routine. In its first pass it generates a new dhcp-leases.conf file then unbound is reloaded. There is then no further attempt to maintain the dhcp-leases.conf until dhcp is next started. Thinking about it, I am not sure it is particularly safe this way as, if unbound restarts, it will revert to the earlier dhcp-leases.conf which is no longer valid. Possibly my approach was safer, running a separate script on dhcp reload or start. Perhaps Michael can comment?
(In reply to Nick Howitt from comment #150) > It is not working like that. The new programs do not maintain > dhcp-leases.conf dynamically. It works more like I proposed, but a little > differently. When dhcp starts, it launches the bridge routine. In its first > pass it generates a new dhcp-leases.conf file then unbound is reloaded. > There is then no further attempt to maintain the dhcp-leases.conf until dhcp > is next started. This is correct. We don't maintain the leases file for Unbound any more. We only generate it once when the bridge starts from the DHCP Leases and then we continue to update Unbound using unbound-control. > Thinking about it, I am not sure it is particularly safe this way as, if > unbound restarts, it will revert to the earlier dhcp-leases.conf which is no > longer valid. Possibly my approach was safer, running a separate script on > dhcp reload or start. Well, this is indeed a problem with this approach (all of them have their downsides). The bridge has to be restarted after either Unbound or DHCP are being restarted, otherwise we risk the two things getting out of sync. A possible solution could be to rewrite the unbound configuration regularly, so that if there is an unexpected restart of Unbound, it won't really start very far in the past. (In reply to Jon from comment #149) > unbound-control list_local_data | grep <HOSTNAME> This should absolutely show you the current status.
Or just use my unbound-dhcp-leases script at start up and reload of unbound and remove that element from your script? Then we cover all bases, I think.
(In reply to Nick Howitt from comment #152) > Or just use my unbound-dhcp-leases script at start up and reload of unbound > and remove that element from your script? Then we cover all bases, I think. Yes, that should cover 99% of all cases. The rest can be fixed with a reboot as a worst-case option.
(In reply to Michael Tremer from comment #151) > We don't maintain the leases file for Unbound any more. I think it should _always_ be maintained only for IPFire reboots and unbound restarts. The proof-of-concept code updates the unbound cache AND updates the unbound dhcp-leases.conf file at every change. Yes, there are problems with two network interface clients (e.g., Ethernet and Wi-Fi on a single desktop computer), and will cause an out-of-sync. But I think this can be fixed. There is no reason to keep parsing the dhcp dhcpd.leases file. Not even on restart if the unbound dhcp-leases.conf is updated as the event handler / bridge finds changes.
Created attachment 1548 [details] overly simple flow This is a (massively) overly simplified view of the dhcp event script proof-of-concept.
I know it is the current state but Michael is concerned at the number of times we open and close the dhcp-leases.conf file. It could be thousands of times a day and will (does) give a high level of wear on SD card installations and possibly even SSD installations. (Michael's use case has 1000-2000 users who come in in the morning and go home at night. With a 1h lease time, that is a minimum of 2000-4000 writes to the file and more if the devices go off or sleep for any period during the day when the leases expires) At the same time there is a risk of the dhcp-leases.conf drifting out of true. Regenerating it on each unbound restart fixes and drift from true. We are continually maintaining a file which is only ever read when unbound (re-)starts so it seems a waste of effort to continually maintain it compared to regenerating it once per (re-start)
Created attachment 1549 [details] compare (In reply to Jon from comment #149) > I see items gone from the unbound cache (as they should), but they still > exist in the /etc/unbound/dhcp-leases.conf file. The above is for 33 devices after about 1 day of use of the new bridge. I wrote a "diff" compare script for: /etc/unbound/dhcp-leases.conf AND /etc/unbound/hosts.conf -vs- unbound-control list_local_data and I see these missing items in the attached image.
Also, to make items easier to compare, each item placed in the /etc/unbound/dhcp-leases.conf file should be an absolute name with a period at the end of the hostname and the arpa name. Example: ``` local-data: "MM2.localdomain. 60 IN A 192.168.60.20" local-data: "20.60.168.192.in-addr.arpa. 60 IN PTR MM2.localdomain." ``` this is how they are stored in unbound cache: ``` [root@ipfire ~] # unbound-control list_local_data | grep MM2 20.60.168.192.in-addr.arpa. 60 IN PTR MM2.localdomain. MM2.localdomain. 60 IN A 192.168.60.20 ```
Created attachment 1550 [details] better flow chart
just checking in - anything new for me to test?
see this post in Dev Mailing List: https://lists.ipfire.org/hyperkitty/list/development@lists.ipfire.org/message/SIMSXLGR7BBMIDXPUHNIGC2ZBR3R64YD/ "Concerning CU186, I may have stumbled across an issue. It took a few days for it to pop up. See the message log below from Jun 17 13:48:34 to the end. I experienced a flurry of Reloads and DNS stalled for ~ 15 minutes."
I was looking through the message log and found another Unbound Rebooting storm on Jun 18. It lasted about 13 minutes and I could not access anything (no DNS).
Since this project has fallen asleep once again, I went ahead and merged my proposed changes from my Git branch. They are at least better than what we had before and I don't want to waste the time and effort that has gone into this. Please test this and report any regressions here.
Sorry, but I became unable to contribute as I could not apply all your patches. I am afraid I struggle with the multiple commits especially when some of them update other commits. I really needed a repo I could pull the full update from. This should now exist. If I do any more work on the updates it will be to regenerate the dhcp-leases.conf on each unbound restart to get round the issue raised in https://bugzilla.ipfire.org/show_bug.cgi?id=13254#c150, and remove the updates to the unbound dhcp-leases.conf file at every change to speed it up and and reduce disk wear as the continual updates are pointless. As stated before, dhcp-leases.conf only ever needs to be valid when unbound starts or restarts and not at any other time. If this is not the direction IPFire wants to take then there is no point in me providing patches. Please advise.
(In reply to Nick Howitt from comment #164) > Sorry, but I became unable to contribute as I could not apply all your > patches. I am afraid I struggle with the multiple commits especially when > some of them update other commits. I really needed a repo I could pull the > full update from. This should now exist. In case of there being so many changes, it is easiest to just download the raw script from the Git web UI. In this case, there were only two or three files that were actually changed. > If I do any more work on the updates it will be to regenerate the > dhcp-leases.conf on each unbound restart to get round the issue raised in > https://bugzilla.ipfire.org/show_bug.cgi?id=13254#c150, and remove the > updates to the unbound dhcp-leases.conf file at every change to speed it up > and and reduce disk wear as the continual updates are pointless. As stated > before, dhcp-leases.conf only ever needs to be valid when unbound starts or > restarts and not at any other time. We should not continuously write a file that is never being read. If anything there could be some check if unbound is running (or has changed PID) and in case we detect that, we can simply call the reload function of the bridge which will rewrite the leases file and reload Unbound. That would put those things back in sync. > If this is not the direction IPFire wants to take then there is no point in > me providing patches. Please advise. I don't quite understand the question. We cannot just always stop when things get complicated and dump things on my table. Not saying that this has happened in this case, but I think we are not very far away from it. It takes me months to pick up on hundreds of these little projects. That does not mean that I don't want the changes, I just move my attention to something else if things are not going. Silence does not send a signal to me.
(In reply to Michael Tremer from comment #165) > > If I do any more work on the updates it will be to regenerate the > > dhcp-leases.conf on each unbound restart to get round the issue raised in > > https://bugzilla.ipfire.org/show_bug.cgi?id=13254#c150, and remove the > > updates to the unbound dhcp-leases.conf file at every change to speed it up > > and and reduce disk wear as the continual updates are pointless. As stated > > before, dhcp-leases.conf only ever needs to be valid when unbound starts or > > restarts and not at any other time. > > We should not continuously write a file that is never being read. If > anything there could be some check if unbound is running (or has changed > PID) and in case we detect that, we can simply call the reload function of > the bridge which will rewrite the leases file and reload Unbound. That would > put those things back in sync. So, if you reload unbound, "we can simply call the reload function of the bridge which will rewrite the leases file and reload Unbound. That would put those things back in sync". That sounds like an infinite loop. It also means creating a watcher on the process. Your update has specifically been designed to continuously write to the leases file. Then. because of the issue I pointed out in https://bugzilla.ipfire.org/show_bug.cgi?id=13254#c150, you seem to be coming up with evermore complex workarounds which are all unnecessary if we give up maintaining the leases file except when unbound starts or restarts. > > > If this is not the direction IPFire wants to take then there is no point in > > me providing patches. Please advise. > > I don't quite understand the question. We cannot just always stop when > things get complicated and dump things on my table. Not saying that this has > happened in this case, but I think we are not very far away from it. It > takes me months to pick up on hundreds of these little projects. That does > not mean that I don't want the changes, I just move my attention to > something else if things are not going. Silence does not send a signal to me. The silence came because of lack of feedback to the mailing list thread and because the direction I am happy to go in does not seem to meet with approval.
I think I must break my 'silence' also. The problem of many restarts of unbound has two implications: 1) a restart reduces the availability of the DNS service. 2) a restart or even reload resets internal timers Topic 1) can be handled by means of reduction of writes to config files. This needs an examination which DHCP information must be kept for restarts of unbound or IPFire, this includes power loss handling. Topic 2) influences unbound features which use automatic update mechanisms. The main task in the development pipe at the moment is jon's approach to include PiHole functionality using RPZ. The files used include a SOA record which defines an update period. Usage of these show, that unbound resets the update counters with each reload of the config files. Michael's new version tries to eliminate the reload by another way to signal DHCP information to unbound. dhcpd allows the use of events for leases changes to start a little program. This program triggers unbound-dhcp-leases-bridge to do unbound-control operations to add/remove local data definitions. This changes the internal information of unbound only, without touching the update timers. This means unbound-leases-bridge must update the settings read at a restart of unbound, also. I test Michael's modifications in parallell to Jon's RPZ addition. I can confirm that the number of reloads went to the necessary amount, and even RPZ files with an update interval of 1d are up-to-date. So I think we should pusue this approach ( Thanks, Michael, for your work! ).
(In reply to Bernhard Bitsch from comment #167) > I test Michael's modifications in parallell to Jon's RPZ addition. > I can confirm that the number of reloads went to the necessary amount, and > even RPZ files with an update interval of 1d are up-to-date. > So I think we should pusue this approach ( Thanks, Michael, for your work! ). Then, I am afraid you have not understood https://bugzilla.ipfire.org/show_bug.cgi?id=13254#c150 and are accepting unnecessary workarounds. My approach uses most of Michael's code and avoids the bridge restarting unbound when the bridge starts (so when dhcpd starts), so makes fractionally less restarts of unbound. It also avoids complicated workarounds for comment #150. It has the same impact on RPZ (or fractionally less as the unbound restart is removed from the bridge start up). Dhcpd can then be restarted as often as you want without impacting unbound and causing it to restart.
You may be wright. But I think to discuss this problem further we should clarify some things: 1. where is what information stored? When is this info used? 2. which version of Michael's bridge program are we speaking of? 3. which version of your program are we speaking of? My state about these three points is 1. don't know exactly, but I think Michael is experienced enough about the bridge. So I can trust, that he transfered the knowledge to his new version 2. I use the version downloaded shortly after his announcement 3. the main goal of your version is to reduce restarts/reloads to a minimum ( dhcpd restarts, new leases information ) I know, this may not be the newest status. But this thread has become very long and did not read all postings ( and the associated post in the dev list ).
An annotation more, 'unbound-control reload' does a restart of the internal state of unbound. I haven't yet verified by code inspection, but experience with both ways of announcing DHCP information shows this behaviour.
(In reply to Nick Howitt from comment #166) > (In reply to Michael Tremer from comment #165) > > > If I do any more work on the updates it will be to regenerate the > > > dhcp-leases.conf on each unbound restart to get round the issue raised in > > > https://bugzilla.ipfire.org/show_bug.cgi?id=13254#c150, and remove the > > > updates to the unbound dhcp-leases.conf file at every change to speed it up > > > and and reduce disk wear as the continual updates are pointless. As stated > > > before, dhcp-leases.conf only ever needs to be valid when unbound starts or > > > restarts and not at any other time. > > > > We should not continuously write a file that is never being read. If > > anything there could be some check if unbound is running (or has changed > > PID) and in case we detect that, we can simply call the reload function of > > the bridge which will rewrite the leases file and reload Unbound. That would > > put those things back in sync. > So, if you reload unbound, "we can simply call the reload function of > the bridge which will rewrite the leases file and reload Unbound. That would > put those things back in sync". That sounds like an infinite loop. > > It also means creating a watcher on the process. There won't be a loop, because this should only be triggered when Unbound is being manually restarted. It will always load the old leases and therefore be out of sync. Depending on when the bridge was started last, this might be very far off. > Your update has specifically been designed to continuously write to the > leases file. Then. because of the issue I pointed out in > https://bugzilla.ipfire.org/show_bug.cgi?id=13254#c150, you seem to be > coming up with evermore complex workarounds which are all unnecessary if we > give up maintaining the leases file except when unbound starts or restarts. Either we want this thing to be very accurate or not. The original motivation was that we have Unbound reload less. This seems to be the only viable way to me. (In reply to Bernhard Bitsch from comment #167) > Topic 2) influences unbound features which use automatic update mechanisms. > The main task in the development pipe at the moment is jon's approach to > include PiHole functionality using RPZ. The files used include a SOA record > which defines an update period. Usage of these show, that unbound resets the > update counters with each reload of the config files. Michael's new version > tries to eliminate the reload by another way to signal DHCP information to > unbound. dhcpd allows the use of events for leases changes to start a little > program. This program triggers unbound-dhcp-leases-bridge to do > unbound-control operations to add/remove local data definitions. This > changes the internal information of unbound only, without touching the > update timers. > This means unbound-leases-bridge must update the settings read at a restart > of unbound, also. I am not interested in anything regarding the RPZs right now. They have not been properly put on the agenda and looking at how much time we have on our hands, this won't make it on the agenda for years. I don't want to build blockers, but this ticket is about a different problem which I want to solve first. (In reply to Bernhard Bitsch from comment #169) > But I think to discuss this problem further we should clarify some things: > 2. which version of Michael's bridge program are we speaking of? Please let's not look at anything else than what is in next. The goal is to get that stable and ready for release.
Ok, do what you want, but please consider this. If IPF has been running a month with no restarts of dhcpd and unbound, if unbound then gets restarted on its own, then all the accumulated DCHP state information from the last month is lost to unbound as it reads back in the dhcp-leases.conf from the last dhcpd restart. As a result the internal DNS resolution could be invalid or just partially valid. I would struggle to find this acceptable, but it is the way the patching has gone.
(In reply to Nick Howitt from comment #172) > Ok, do what you want, but please consider this. If IPF has been running a > month with no restarts of dhcpd and unbound, if unbound then gets restarted > on its own, then all the accumulated DCHP state information from the last > month is lost to unbound as it reads back in the dhcp-leases.conf from the > last dhcpd restart. As a result the internal DNS resolution could be invalid > or just partially valid. > > I would struggle to find this acceptable, but it is the way the patching has > gone. No this is not acceptable indeed: > https://git.ipfire.org/?p=ipfire-2.x.git;a=commitdiff;h=8ead2ddf3d0f1521c2a99b509373615f77a754a0
I've been AFK for a while. Looking at the changes, I can see what has been done and it should work, adding a listener, but it seems to me to be over-complicated. Surely it would be easier if unbound looked after creating the leases file it needs itself during startup? It would avoid a listener and I don't think the code is any more complicated. It removes the leases file update from dhcpd and adds it to unbound which can run it only when it is needed, during startup and restart.
(In reply to Nick Howitt from comment #174) > I've been AFK for a while. Welcome back. > Looking at the changes, I can see what has been done and it should work, > adding a listener, but it seems to me to be over-complicated. It is not *that* complicated, but it is more code than I had anticipated. > Surely it would be easier if unbound looked after creating the leases file > it needs itself during startup? It would avoid a listener and I don't think > the code is any more complicated. It removes the leases file update from > dhcpd and adds it to unbound which can run it only when it is needed, during > startup and restart. As in: Let's start the script once again with a special option like "--one-shot" so that it will update the leases file, then terminate and Unbound is being launched with a fresh leases file? Yes, that would be acceptable to me. We would however have to restart the daemon once again to sync state. There is a small window where the leases could change between the one-off shot and the daemon being relaunched. If people restart Unbound without the script, well, then they are on their own. I had another thought in mind why we would need to watch the process, but I cannot think of it now. Maybe it was just that we will be out of sync again if we call the script more than once.
Ages ago I created a cut-down version of the original python bridge script - https://bugzilla.ipfire.org/attachment.cgi?id=1516 - whose sole purpose is to generate the leases file. Perhaps more can be cut out of it as I am not a python expert. The intent was for it to be packaged with unbound and called from the init script before a start or restart. Then all lease file maintenance can be removed from your new script and the listener can be removed. It also makes unbound self-contained, updating its "own" leases file when it needs to. Is there anything wrong with this approach?
(In reply to Nick Howitt from comment #176) > It also makes unbound self-contained, updating its "own" leases file when it > needs to. > > Is there anything wrong with this approach? How would any continuous updates work? Generating the leases file at the start is a good idea, but what happens after that? We cannot change the DHCP execute statements to call unbound-control as those will take a very long time to run and will block the DHCP server. I think all in all, we are not very close to a very aesthetic solution. This all seems to be very complicated, confusing and over-engineered. However, when ever we want to make things simpler, we will create some disadvantages. Unless we have introduced any new problems, I believe the version that is currently in next has the smallest number of disadvantages (it mainly is complicated), but it does not lack any features like getting Unbound out of sync or stalling the DHCP server.
Continuous updates should still happen from your script (or even Jon's), but remove the leases file updating from it. Package your script with the dhcp package. The leases file should be updated from my script (which anyone is welcome to tweak) and the unbound init.d file should be modified to run it on start and reload. My script should be packaged with unbound. Like this it does not lack any features like getting Unbound out of sync or stalling the DHCP server
So the patches from @Michael's unbound-dhcp-leases-bridge repo branch have been merged into CU188 and that has had its Testing phase and has now been released. Can some feedback data be provided on how this has dealt with, or not, the original issue of 'lots of unbound restarts'.
So far so good for a short test period of almost 4 days. 6 total Unbound Restarts. Restarts: 3 related to CU 188 upgrade (before 1st reboot) 1 related to the 1st reboot (part of 1st reboot) 1 related to "Reading fix leases from /var/ipfire/dhcp/fixleases" 1 related to me testing RPZ --- Summary: 1 total for the testing period. But I don't think this one really counts since it was so close to the 1st reboot. Logs: Sep 19 11:15:00 ipfire syslogd 1.5.1: restart (remote reception). . . . Sep 19 11:15:23 ipfire unbound: [2125:0] info: start of service (unbound 1.21.0). . . . Sep 19 11:15:45 ipfire unbound: [2125:0] info: start of service (unbound 1.21.0). . . . Sep 19 11:16:40 ipfire dhcp[3876]: Reading static hosts from /var/ipfire/main/hosts Sep 19 11:16:40 ipfire dhcp[3876]: Reading DHCP leases from /var/state/dhcp/dhcpd.leases Sep 19 11:16:40 ipfire dhcp[3876]: Reading fix leases from /var/ipfire/dhcp/fixleases Sep 19 11:16:40 ipfire unbound: [2125:0] info: service stopped (unbound 1.21.0). . . . Sep 19 11:17:01 ipfire unbound: [2125:0] info: start of service (unbound 1.21.0). --- So far so good!! I'll continue to keep an eye on it.
Forgot one line: and no unbound restarts, related to dhcp, in the past the past 3-4 days! Yay!
Given this positive feedback, I am taking the liberty of closing this bug as FIXED. If it is not, please feel free to reopen it. Thank you.
This is not done as there is not the best process in place with what happens when unbound is restarted manually. There are also some smaller outstanding problems with static hosts.