Bug 12725 - Some IPsec N2N connections become unstable after upgrading to Core Update 161 (testing)
Summary: Some IPsec N2N connections become unstable after upgrading to Core Update 161...
Status: CLOSED CANTFIX
Alias: None
Product: IPFire
Classification: Unclassified
Component: --- (show other bugs)
Version: 2
Hardware: unspecified Unspecified
: Will only affect a few users Minor Usability
Assignee: Peter Müller
QA Contact: Michael Tremer
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-11-13 12:26 UTC by Peter Müller
Modified: 2021-12-18 15:47 UTC (History)
1 user (show)

See Also:


Attachments
Output of "ipsec statusall" on IPFire (8.70 KB, text/plain)
2021-11-30 20:44 UTC, Peter Müller
Details
Output of "ipsec statusall" on the remote system (2.19 KB, text/plain)
2021-11-30 20:45 UTC, Peter Müller
Details
Relevant log excerpt on IPFire while reauthentication/reestablishing affected IPsec connection (3.44 KB, text/plain)
2021-11-30 20:45 UTC, Peter Müller
Details
attachment-1111325-0.html (605 bytes, text/html)
2021-12-15 19:44 UTC, Michael Tremer
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Müller 2021-11-13 12:26:06 UTC
I am currently investigating into this, suspecting a regression in C161.

Further information will be provided shortly.
Comment 1 Peter Müller 2021-11-14 11:19:50 UTC
Preliminary findings:

- Only observed on N2N connections where IPFire is set into "waiting" mode.
- Restarting flapping connections on the remote site (also strongSwan) solves the issue until IPFire reboots.
- Corresponding strongSwan/charon log message: "schedule delete of duplicate IKE_SA for peer 'C=x, O=x, CN=x' due to uniqueness policy and suspected reauthentication"
- Flapping connections are generally usable, but reauthenticated every 10 seconds or so, making them flaky.
- The same remote machines also maintain IPsec connections to a IPFire machine running Core Update 159. These connections are not affected.

Remote:

$ ipsec --version
FreeBSD strongSwan U5.9.4/K13.0-STABLE-HBSD
University of Applied Sciences Rapperswil, Switzerland
See 'ipsec --copyright' for copyright information.

IPFire:

# ipsec --version
Linux strongSwan U5.9.4/K5.10.76-ipfire
University of Applied Sciences Rapperswil, Switzerland
See 'ipsec --copyright' for copyright information.

@Michael: Do you have an idea what causes this? It does not seem to be solely related to strongSwan.
Comment 2 Michael Tremer 2021-11-19 11:24:36 UTC
Since a forgotten change in the nightly build cleanup script has removed the latest testing release from master, I cannot install the latest version.

Arne has confirmed to me that his IPsec tunnels are stable as usual. Do you have any log files you can submit?
Comment 3 Michael Tremer 2021-11-23 09:50:46 UTC
I am now upgraded to the latest release and I cannot confirm this problem. My tunnels are as stable as they were before. Please post logs.
Comment 4 Peter Müller 2021-11-27 18:05:47 UTC
Happens on a second machine upgraded to Core Update 161 as well.

Will send logs next week.
Comment 5 Michael Tremer 2021-11-28 12:02:04 UTC
(In reply to Peter Müller from comment #4)
> Will send logs next week.

I have to state that I am very disappointed how long it is taking to actually get to the bottom of this.

How is this so complicated to get logs about a simple problem in over two weeks?

This problem now went into production and is going to cause us a lot of pain with users and will put more stress on everyone.

In the meantime, the next core update is about to close and we do not have a fix. We will have introduced a problem into a release and we need months to actually fix it. I find this very unnecessary and disappointing.
Comment 6 Peter Müller 2021-11-28 12:27:41 UTC
In my defense, I am currently at a customers' site (IR, hence busy around the clock), having a lot of other things on my plate and no SSH access to my infrastructure.

Sorry to disappoint - I'd wish this went more smooth as well.
Comment 7 Peter Müller 2021-11-30 20:44:22 UTC
Enclosed are the outputs of "ipsec statusall" on both IPFire and the remote machine belonging to an affected IPsec connection, as well as the relevant log messages from IPFire - the remote log stays empty during this time, only displaying the deletion and recreation of an IKE_SA and CHILD_SA.

Let me know if you need further information. Again, sorry for my tardy reply.
Comment 8 Peter Müller 2021-11-30 20:44:36 UTC
Created attachment 955 [details]
Output of "ipsec statusall" on IPFire
Comment 9 Peter Müller 2021-11-30 20:45:03 UTC
Created attachment 956 [details]
Output of "ipsec statusall" on the remote system
Comment 10 Peter Müller 2021-11-30 20:45:35 UTC
Created attachment 957 [details]
Relevant log excerpt on IPFire while reauthentication/reestablishing affected IPsec connection
Comment 11 Peter Müller 2021-11-30 20:47:26 UTC
At this time, Fireinfo reports 17.75% of all installations running on Core Update 161. Since I am unaware of any similar complaints, I am lowering the priority for this one.
Comment 12 Peter Müller 2021-12-15 09:28:32 UTC
*sigh*

Well, it was uniqueids=yes - apparently, this is now the default value in strongSwan. Setting it to "no" in /etc/ipsec.user.conf solved the problem on all affected machines and connections.

While uniqueids=yes generally makes sense, it causes IPsec connections to become unstable if participants are behind flaky lines, and they never recover from this state. I therefore believe it makes sense to set this to "no" globally.

If you agree, I'll hand in a patch.

Sorry for bothering you in the first place. I _did_ set this to "no", but the configuration got lost while restoring a backup, so I never realised this before.
Comment 13 Michael Tremer 2021-12-15 11:13:24 UTC
I cannot find when this default value has changed. Did you find any note in a changelog?
Comment 14 Peter Müller 2021-12-15 17:30:52 UTC
I didn't find anything regarding "uniqueids" in the changelog either.

In https://github.com/strongswan/strongswan/releases/tag/5.9.4, "Several corner cases with reauthentication have been fixed (48fbe1d, 36161fe, 0d373e2)." sounds suspiciously related, but I am not familiar enough with strongSwan to tell if any of these commits causes the behaviour observed.

Anyway, setting "uniqueids=no" on both ends (IPFire and the remote servers) causes the connections to be stable again, but after reauthentication, there might be up to four IPsec SAs for the same connection, which is unnecessary. Setting "uniqueids" back to "yes" on the remote systems, while keeping it set to "no" on IPFire (via /etc/ipsec.user.conf) solves this problem.

Right now, I am unsure whether we should interfere with the default. After all, it used to work perfectly fine for years until C161, but on the other hand "uniqueids=yes" makes sense in theory as well and I have little insight in more complex or advanced IPsec setups built with IPFire.

What do you think?
Comment 15 Michael Tremer 2021-12-15 19:44:41 UTC
Created attachment 967 [details]
attachment-1111325-0.html

I don’t find anything here either.

However, „no“ should be the default setting because yes is breaking too much stuff.

We discovered that in IPFire 3 but it would be generally a very good idea to have this set to yes. But that is only happening in a very ideal world.
Comment 16 Peter Müller 2021-12-16 09:19:31 UTC
On two machines having six IPsec connections configured, I currently observe 8 and 10 tunnels actually being in operation. All of them are stable, so this is merely a nuisance, but before I hand in a patch, I would give "uniqueids={replace,keep}" a try, just to see how thing behave then.

"replace" though, does not seem to have any effect ("The daemon also accepts the value replace which is identical to yes" - O RLY?), but given the sketchy documentation at https://wiki.strongswan.org/projects/strongswan/wiki/ConfigSetupSectionc, it might make some difference anyways.

See also, by the way, https://wiki.strongswan.org/issues/463.
Comment 17 Michael Tremer 2021-12-16 11:03:51 UTC
Since uniqueids is set to "yes" since 5.0.0 (and we manually configured it to that before), we should not change this.

I therefore believe this is a strongSwan regression introduced in the last release. Would you please open a ticket upstream and link it here?
Comment 18 Peter Müller 2021-12-18 15:47:37 UTC
See https://github.com/strongswan/strongswan/issues/823 for the bug report at the strongSwan upstream.

Closing this as CANTFIX then, since we cannot fix anything here, but will have to wait for this to get fixed in the upstream.