Summary: | Stale Conntrack Entries for SIP Devices After Tunnel Drops | ||
---|---|---|---|
Product: | IPFire | Reporter: | Tom Rymes <tomvend> |
Component: | --- | Assignee: | Michael Tremer <michael.tremer> |
Status: | CLOSED FIXED | QA Contact: | |
Severity: | Minor Usability | ||
Priority: | - Unknown - | CC: | daniel.weismueller, michael.tremer, morlix |
Version: | 2 | ||
Hardware: | x86_64 | ||
OS: | Other | ||
Bug Depends on: | 10844 | ||
Bug Blocks: | 10665 |
Description
Tom Rymes
2015-07-30 15:11:25 UTC
Tom, I have been thinking about whether this is a good idea or not. I guess that the solution is probably not a good one because this must be a different problem: The addresses remain the same. It is not that there is a public IP address that changes. Hence I not think that we should flush the connection tracking table when a VPN goes up or down. In the end there would be no point in having SIP connection tracking any more. However I agree that SIP registration is still an issue with IPFire. This is not really an issue with IPFire, but rather a problem in the Linux kernel. For that reason I would like to contact the author of the SIP conntrack module and see if we can work out something together. I am currently travelling and cannot really simulate a test scenario. Would you be able to collect some debugging data? I would be very much interested to see on which site of the VPN the problem is actually happening. So does site A send out the packet with the registration and is it dropped on the other site? Does site A never send the packet through the VPN? Is the REGISTER message received and the reply is just dropped? Or does the conntrack module change the packet in a way that the SIP server will just discard it? Are you able to collect some dumps with tcpdump? Michael, The addresses are not changing, like they were for the earlier fix, but perhaps this issue is due to the routing entries in table 200 not being present when the tunnel is down and these conntrack entries are created? Thus, this traffic may be pointed at the default gateway instead of the proper tunnel? Regardless, the same situation is happening, where the creation of a connection-tracking entry while the tunnel is down results in the blocking of all traffic between those two devices until the admin manually intervenes. I can say that I have seen this problem happen in both directions. My working understanding is that if only one side attempts to transmit data on port 5060 while the tunnel is down, then only that side will have the issue, and deleting the offending entry on that side resolves the problem and the phone almost instantly registers. This is almost always the remote side with the phone, but sometimes it will be the local side with the PBX, and other times we must delete entries from both sides. From what I can tell, though, when it occurs, it seems to me that the traffic originating from the side with the problem is simply routed into the bit bucket, and never reaches the far side. The easiest way to tell the problem exists is to check the conntrack output for [UNREPLIED] instead of [ASSURED]. I will do my best to get you more details to help track down the issue, but I may need a little hand-holding on the specifics of what you would like. I meant to say table 220 in that last comment, not 200... (In reply to Tom Rymes from comment #2) > Michael, > > The addresses are not changing, like they were for the earlier fix, but > perhaps this issue is due to the routing entries in table 200 not being > present when the tunnel is down and these conntrack entries are created? > Thus, this traffic may be pointed at the default gateway instead of the > proper tunnel? The Linux kernel does not use routes for that. It uses the XFRM framework. A rather complicated approach for everyone who does not deal with the internals of that. > Regardless, the same situation is happening, where the creation of a > connection-tracking entry while the tunnel is down results in the blocking > of all traffic between those two devices until the admin manually intervenes. I misunderstood your description then. The issue just happens when the tunnel is down? That changes things... We could flush everything when the tunnel comes up again, but we cannot know which connections where meant to use the tunnel and not the default gateway. The more I think about it the more do I hate the SIP connection tracking. With the dynamic setups that we have, it really doesn't always do good. Yes, the problem only presents itself when the tunnels drop. In other words, whenever we have a connection interruption on either end and almost every time we upgrade from one version of IPFire to another. Just another data point, I just rebooted our central office server (the one with the PBX behind it) and experienced the problem. All remote extensions were unreachable until I deleted the connection tracking entries on the freshly rebooted machine. Once that was done, all of the phones came right up and started working. I am not really sure what a solution would be. We cannot simply drop *all* connections that use port 5060. That does not make any sense I guess. Could you please try the following changes? http://git.ipfire.org/?p=people/ms/ipfire-2.x.git;a=shortlog;h=refs/heads/iptables-conntrack Please copy src/initscripts/init.d/firewall from that branch to /etc/init.d/firewall on your system(s), add "CONNTRACK_SIP=off" to /var/ipfire/optionsfw/settings and execute "/etc/init.d/firewall restart". That will disable the SIP connection tracking and I guess the issue will be gone. Michael, I will give that a shot, thank you. Another possible solution might be to modify the StrongSwan up down script to include a conntrack command that would only affect the subnet(s) served by the tunnel. I presume a separate script would need to be created and defined per-tunnel. Tom (In reply to Tom Rymes from comment #9) > Michael, > > I will give that a shot, thank you. Another possible solution might be to > modify the StrongSwan up down script to include a conntrack command that > would only affect the subnet(s) served by the tunnel. I presume a separate > script would need to be created and defined per-tunnel. That would be the obvious solution, but it won't work. The connections are in the table but not with the correct destination. Hence we cannot identify them correctly. Michael, My apologies for not getting back to you yet. Time has not allowed for me to test your fix, but I had another thought about the IPSec updown script, and it seems to me that we do already know the correct IPAddress, at least in the IPSec setting (it's specified in the tunnel config). In other words, delete any connections that are bound for the other end of the tunnel. Here is a transcript of a console session where I log into the phone, show the registration status, list the relevant connection tracking entry, then delete it. Lastly, I log back onto the phone to show that it is now registered (look for the "1-BU" entry). [root@ipfire ~]# telnet 10.100.1.14 Trying 10.100.1.14... Connected to 10.100.1.14. Escape character is '^]'. Password :*** Cisco Systems, Inc. Copyright 2000-2005 Cisco IP phone MAC: 0017:9486:b201 Loadid: SW: P0S3-8-12-00 ARM: PAS3ARM1 Boot: PC030301 DSP: 4.0(5.0)[A0] 7960> > sho reg LINE REGISTRATION TABLE Proxy Registration: ENABLED, state: REGISTERED line APR state timer expires proxy:port ---- --- ------------- ---------- ---------- ---------------------------- 1 111 REGISTERED 3595 2936 10.100.1.2:5060 2 111 REGISTERED 3595 2936 10.100.0.53:5060 3 ... NONE 0 0 undefined:0 4 ... NONE 0 0 undefined:0 5 ... NONE 0 0 undefined:0 6 ... NONE 0 0 undefined:0 1-BU .1x REGISTERING 3600 18 192.168.0.3:5060 Note: APR is Authenticated, Provisioned, Registered 7960> > exitConnection closed by foreign host. [root@ipfire ~]# conntrack -L -p udp --orig-port-src 5060 --dst=192.168.0.3 udp 17 26 src=10.100.1.14 dst=192.168.0.3 sport=5060 dport=5060 packets=41752 bytes=23422839 [UNREPLIED] src=192.168.0.3 dst=70.90.103.89 sport=5060 dport=5060 packets=0 bytes=0 mark=0 use=1 conntrack v1.4.2 (conntrack-tools): 1 flow entries have been shown. [root@ipfire ~]# conntrack -D -p udp --orig-port-src 5060 --dst=192.168.0.3 udp 17 27 src=10.100.1.14 dst=192.168.0.3 sport=5060 dport=5060 packets=41758 bytes=23426205 [UNREPLIED] src=192.168.0.3 dst=70.90.103.89 sport=5060 dport=5060 packets=0 bytes=0 mark=0 use=1 conntrack v1.4.2 (conntrack-tools): 1 flow entries have been deleted. [root@ipfire ~]# telnet 10.100.1.14 Trying 10.100.1.14... Connected to 10.100.1.14. Escape character is '^]'. Password :*** Cisco Systems, Inc. Copyright 2000-2005 Cisco IP phone MAC: 0017:9486:b201 Loadid: SW: P0S3-8-12-00 ARM: PAS3ARM1 Boot: PC030301 DSP: 4.0(5.0)[A0] 7960> > sho reg LINE REGISTRATION TABLE Proxy Registration: ENABLED, state: REGISTERED line APR state timer expires proxy:port ---- --- ------------- ---------- ---------- ---------------------------- 1 111 REGISTERED 3595 2905 10.100.1.2:5060 2 111 REGISTERED 3595 2905 10.100.0.53:5060 3 ... NONE 0 0 undefined:0 4 ... NONE 0 0 undefined:0 5 ... NONE 0 0 undefined:0 6 ... NONE 0 0 undefined:0 1-BU 111 REGISTERED 3595 3590 192.168.0.3:5060 Note: APR is Authenticated, Provisioned, Registered Is that after you applied the changes? No, that was before I applied the patch. I was actually replying to your earlier comment saying that using the IPSec updown script would have made sense, but you didn't know what IP address to use. Technically, we do know what address to use, because the subnet(s) are defined in the IPSec configuration, so dropping the entries for that subnet would make sense. Unfortunately, the conntrack command does not support dropping connection to an entire subnet, so it might do us no good to have that information. I'm also confused as to your comment about "The connections are in the table but not with the correct destination. Hence we cannot identify them correctly." It appears to me (in my last post) that the Destination is correct (192.168.0.3), just not the route to get there. Is this what you meant, or is something different happening here? I'm guessing that what we're running into here is the fact that StrongSwan uses table 220 policy routing, and connection tracking does not do well with that (hence the issues with the tftp and ftp modules I have run into). I will carve out some time to test the change you posted ealirer, sorry for not doing that already, but I also wonder if adding routes to the general routing table would somehow help connection tracking work with Strongswan? Michael, I don't know if I did something wrong, but I just made the changes you suggested and rebooted the server (I needed to add another NIC, which I have not configured yet). Anyhow, I had to drop all of the SIP conntrack entries (43 of them) to get my remote extensions back up and working. Perhaps that was only a one-time problem? Some output that may help: [root@ipfire ~]# grep -A 6 SIP /etc/init.d/firewall # SIP if [ "${CONNTRACK_SIP}" = "on" ]; then iptables -A CONNTRACK -m conntrack --ctstate RELATED \ -m helper --helper sip -j ACCEPT for proto in udp tcp; do iptables -t raw -A CONNTRACK -p "${proto}" --dport 5060 -j CT --helper sip done fi [root@ipfire ~]# cat /var/ipfire/optionsfw/settings DROPNEWNOTSYN=on DROPINPUT=on DROPFORWARD=on FWPOLICY=DROP FWPOLICY1=DROP FWPOLICY2=DROP DROPPORTSCAN=on DROPOUTGOING=on DROPSAMBA=off DROPPROXY=off SHOWREMARK=on SHOWCOLORS=on SHOWTABLES=off SHOWDROPDOWN=off DROPWIRELESSINPUT=on DROPWIRELESSFORWARD=on CONNTRACK_SIP=off Michael, Any updates on this? I am really hitting my max when it comes to dealing with issues surrounding IPSec and connection tracking. Between TFTP, FTP, and SIP, it has become overwhelming to deal with. Given that all of the issues seem to revolve around connection tracking and its interaction with IPSec, I'm thinking that there must be some common link here that has been overlooked?. Tom iptables -A CUSTOMFORWARD -d 10.9.0.0/16 -m policy --dir out --pol none -j REJECT Michael: The IPTable rule you proposed (see above) to prevent traffic destined for subnets on the far side of IPSec tunnels from being sent out the Red interface when the tunnel is down does seem to have resolved our issues. I have applied it to all subnets in our central office, and I plan to apply similar rules to all of the remote sites as well. I will continue testing and see what comes up, but this does seem to be a solution. (In reply to Tom Rymes from comment #17) > I will continue testing and see what comes up, but this does seem to be a > solution. Thanks for the feedback and for the nice phone call again. I will put this in a nice script, give it some sugar coating and integrate it into the firewall engine. Will send you all this for testing soon. http://lists.ipfire.org/pipermail/development/2015-October/001043.html Please test. I won't merge this into next before I have really good feedback. My apologies, but I am going to need more specific instructions for how to apply the proposed patch. I started to work it out on my own but I cannot be confident that I am doing it correctly. I am not certain if this error is due to the fix implemented for this bug, but if one tries to ping one of the remote IPSec networks when that network is down, the result is a massive number of ICMP "Destination Net Unreachable" errors. The output of the cmmand looks like this: From 10.100.0.1 icmp_seq=1 Destination Net Unreachable [snip 40,000+ entries] From 10.100.0.1 icmp_seq=1 Destination Net Unreachable From 10.100.0.1 icmp_seq=1 Destination Net Unreachable From 10.100.0.1 icmp_seq=1 Destination Net Unreachable From 10.100.0.1 icmp_seq=1 Destination Net Unreachable From 10.100.0.1 icmp_seq=1 Destination Net Unreachable From 10.100.0.1 icmp_seq=1 Destination Net Unreachable From 10.100.0.1 icmp_seq=1 Destination Net Unreachable From 10.100.0.1 icmp_seq=1 Destination Net Unreachable From 10.100.0.1 icmp_seq=1 Destination Net Unreachable From 10.100.0.1 icmp_seq=1 Destination Net Unreachable From 10.100.0.1 icmp_seq=1 Destination Net Unreachable From 10.100.0.1 icmp_seq=1 Destination Net Unreachable ^CFrom 10.100.0.1 icmp_seq=1 Destination Net Unreachable --- 10.253.0.1 ping statistics --- 0 packets transmitted, 0 received, +43909 errors This all occurs in a matter of a second or two, certainly no more than five. Tom (In reply to Tom Rymes from comment #21) > I am not certain if this error is due to the fix implemented for this bug, > but if one tries to ping one of the remote IPSec networks when that network > is down, the result is a massive number of ICMP "Destination Net > Unreachable" errors. > > The output of the cmmand looks like this: > > From 10.100.0.1 icmp_seq=1 Destination Net Unreachable > > [snip 40,000+ entries] > > From 10.100.0.1 icmp_seq=1 Destination Net Unreachable > From 10.100.0.1 icmp_seq=1 Destination Net Unreachable > From 10.100.0.1 icmp_seq=1 Destination Net Unreachable > From 10.100.0.1 icmp_seq=1 Destination Net Unreachable > From 10.100.0.1 icmp_seq=1 Destination Net Unreachable > From 10.100.0.1 icmp_seq=1 Destination Net Unreachable > From 10.100.0.1 icmp_seq=1 Destination Net Unreachable > From 10.100.0.1 icmp_seq=1 Destination Net Unreachable > From 10.100.0.1 icmp_seq=1 Destination Net Unreachable > From 10.100.0.1 icmp_seq=1 Destination Net Unreachable > From 10.100.0.1 icmp_seq=1 Destination Net Unreachable > From 10.100.0.1 icmp_seq=1 Destination Net Unreachable > ^CFrom 10.100.0.1 icmp_seq=1 Destination Net Unreachable > > --- 10.253.0.1 ping statistics --- > 0 packets transmitted, 0 received, +43909 errors > > This all occurs in a matter of a second or two, certainly no more than five. We could potentially limit the firewall to only respond to a certain number of packets per second and then just drop the rest. But actually this is the desired outcome. Michael, I have no particular opinion here, other than to say it sure is jarring as a user. Thanks for your helping getting this resolved. Tom (In reply to Tom Rymes from comment #23) > Michael, > > I have no particular opinion here, other than to say it sure is jarring as a > user. > > Thanks for your helping getting this resolved. > > Tom You will only get one rejection per packet you send. That is actually quite nice because the user gets an immediate notification that the connection is not up instead of a timeout. One last thought on this one: The solution in place works well, but I thought of another possible solution, and I will leave it to others to determine if it is works and if so, whether it is preferable for any reason: Instead of a firewall rule, add a static route for all IPSec subnets that points to the green IP. This route will have no effect when the tunnels are up because IPSec policy routing rules seem to trump the static routes configured via the WUI. Just a thought, but I figured I would throw it out there on the off-chance that someoen might like the idea. |