Bug 12488 - Update to core 149 gives 'invalid opcode' traps on ALIX (i586)
Summary: Update to core 149 gives 'invalid opcode' traps on ALIX (i586)
Status: CLOSED FIXED
Alias: None
Product: IPFire
Classification: Unclassified
Component: --- (show other bugs)
Version: 2
Hardware: other Unspecified
: Will only affect a few users Crash
Assignee: Arne.F
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-09-18 21:16 UTC by Bernhard Bitsch
Modified: 2020-10-13 19:27 UTC (History)
3 users (show)

See Also:


Attachments
Documents and data on Intel NOPL instruction (5.22 MB, application/x-zip-compressed)
2020-09-30 14:02 UTC, bitbanger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bernhard Bitsch 2020-09-18 21:16:34 UTC
After update to core 149 there are many traps 'invalid opcode'. The system doesn't bootup.
Same behaviour for a clean core 149 install.

Reinstalled core 148 from the image file.
System runs.
After initial setup, I installed as first addon hostapd ( all systems in the household are connected by WLAN ).
Same errors:
Sep 18 19:31:41 BitschCop kernel: traps: hostapd[8705] trap invalid opcode ip:8052300 sp:bfd07210 error:0 in hostapd[804e000+c2000]
Sep 18 19:31:44 BitschCop kernel: traps: hostapd[8761] trap invalid opcode ip:8052300 sp:bfcd1050 error:0 in hostapd[804e000+c2000]
Sep 18 19:34:11 BitschCop kernel: traps: wlanapctrl[9379] trap invalid opcode ip:8049420 sp:bfdd60b0 error:0 in wlanapctrl[8049000+1000]
Sep 18 19:37:56 BitschCop kernel: traps: hostapd[10326] trap invalid opcode ip:8052300 sp:bfe36280 error:0 in hostapd[804e000+c2000]
Sep 18 19:39:19 BitschCop kernel: traps: wlanapctrl[10656] trap invalid opcode ip:8049420 sp:bfefdc80 error:0 in wlanapctrl[8049000+1000]
Sep 18 19:40:11 BitschCop kernel: traps: wlanapctrl[10937] trap invalid opcode ip:8049420 sp:bfbccee0 error:0 in wlanapctrl[8049000+1000]
Sep 18 19:43:22 BitschCop kernel: traps: hostapd[1434] trap invalid opcode ip:8052300 sp:bfd1f490 error:0 in hostapd[804e000+c2000]
Sep 18 19:43:25 BitschCop kernel: traps: hostapd[1453] trap invalid opcode ip:8052300 sp:bfff0dc0 error:0 in hostapd[804e000+c2000]
Sep 18 19:53:02 BitschCop kernel: traps: wlanapctrl[4468] trap invalid opcode ip:8049420 sp:bf9bfcd0 error:0 in wlanapctrl[8049000+1000]
Sep 18 20:29:12 BitschCop kernel: traps: wlanapctrl[11299] trap invalid opcode ip:8049420 sp:bfee4ea0 error:0 in wlanapctrl[8049000+1000]
Sep 18 20:34:46 BitschCop kernel: traps: wlanapctrl[12454] trap invalid opcode ip:8049420 sp:bfc84c90 error:0 in wlanapctrl[8049000+1000]
Sep 18 20:50:59 BitschCop kernel: traps: joe[15626] trap invalid opcode ip:804c610 sp:bf9406f0 error:0 in joe[804a000+6b000]
Sep 18 20:51:40 BitschCop kernel: traps: joe[15758] trap invalid opcode ip:804c610 sp:bfe19b80 error:0 in joe[804a000+6b000]
Sep 18 21:34:32 BitschCop kernel: traps: hostapd[24261] trap invalid opcode ip:8052300 sp:bff4e660 error:0 in hostapd[804e000+c2000]

As you can see, I also tried installing of joe.

I suppose the addons are compiled with same settings as core 149, thus all do not function on a system with problems with core 149.

It should be possible to go back to the last core update with a fresh install of a core image and matching addons.

Installing hostapd (same version) from the core 141 repository did the job.
Comment 1 Michael Tremer 2020-09-21 09:20:21 UTC
Arne and I have been looking at this over the weekend.

It does not look like this is caused by any of the changes that we recently made to the CFLAGS. Arne verified that with various builds on an AMD Geode platform.

So we currently think that the problem lies with the update to GCC 10, which generates code that uses the NOPL instruction which is not supported by AMD Geode.

We have pulled the update and we believe that the installation ISO image will not work on any non-real i686 platforms like AMD Geode and some other VIA C* processors. However, those are only a very small chunk of systems used by IPFire users.
Comment 2 Bernhard Bitsch 2020-09-21 10:12:13 UTC
Thanks Arne and Michael for your quick reaction!

I know, there are not many systems like mine anymore. But I suppose, many of these systems are 'inherited' and maintained by people without very deep knowledge of Linux and the IPFire infrastructure.
Thus it would be desirable, to know how one can manage the situation:
- how can one safely go back to core 148?
- are the addons compatible to i586?

I can do limited tests. I have an old CF card, which can run a test system, and there are times, where I'm the only user in my household.
Comment 3 Michael Tremer 2020-09-22 08:12:51 UTC
(In reply to Bernhard Bitsch from comment #2)
> I know, there are not many systems like mine anymore. But I suppose, many of
> these systems are 'inherited' and maintained by people without very deep
> knowledge of Linux and the IPFire infrastructure.

Indeed. And we have had a conversation about removing support for i586 for exactly that reason. As of today, only 1.2% of systems on fireinfo are running on i586.

Our time is very valuable and investing it in fixing some issues that are only being experienced by a very small number of users is a tough decision.

We have some ideas on how to find out what is causing generating those invalid instructions. This will probably take a couple of days until we find out, because I will have to run multiple builds to get to the bottom of this.

That however means that we might not be able to fix it, or that that fix comes at a high cost. We will make that decision when we get there.

> Thus it would be desirable, to know how one can manage the situation:
> - how can one safely go back to core 148?

You will have to re-install the system from scratch from an installation image.

> - are the addons compatible to i586?

Arne rebuilt them and uploaded them so that they should be working now.

> I can do limited tests. I have an old CF card, which can run a test system,
> and there are times, where I'm the only user in my household.

I might need your help and if so contact you here to test an image.
Comment 4 Michael Tremer 2020-09-24 10:36:11 UTC
Arne has found the issue in a new "hardening" feature in glibc developed by Intel. It is designed for modern processors, however did not check if it was compatible with i586 and therefore generated binary code which could not have been executed on those processors.

This again shows that we no longer can support outdated architectures. We have wasted the best part of this week on a bug that was affecting very few people who run hardware that is older than 25 years. We can no longer justify spending our time on this if we want to bring IPFire forward.

The core update will now be updated and can then be installed on i586 systems, too. We will not change the installation images.

https://git.ipfire.org/?p=ipfire-2.x.git;a=commitdiff;h=cf58f6593148f55b2016e2a18af75925271ef1b7
Comment 5 Bernhard Bitsch 2020-09-24 11:08:46 UTC
Many thanks to two both, Michael and Arne, for your big effort!!

You are right, it is time to change from old i586 boards to 'modern' x64 machines.
But on the other side there are many tiny 'nice' boards in the wide, which seem to be usable for little firewall appliances.
If I got the issue right, the compiler didn't produce code compatible for the chosen architecture. This is another issue. Can we really trust in such a tool?
I know there is no easy alternative for compiling Linux. But gcc is designed to produce code for many processors/architectures, it is the 'standard' in the not-MS world. How can this issue arise? Whether i586 nor the long NOPs are brand-new, thus one should expect no problems with that features.
Just my opinion.

Personally, I'll try to replace my very old Alix as soon as possible.
The system will be remain as test system for small IPFire installations.
I'll check the update to core 149 and the addons today.
Comment 6 Michael Tremer 2020-09-24 15:33:19 UTC
(In reply to Bernhard Bitsch from comment #5)
> You are right, it is time to change from old i586 boards to 'modern' x64
> machines.
> But on the other side there are many tiny 'nice' boards in the wide, which
> seem to be usable for little firewall appliances.

No, there is plenty of hardware out there. Most of of it is simply shit. Don't buy it just because it is "small" or "pretty".

> If I got the issue right, the compiler didn't produce code compatible for
> the chosen architecture. This is another issue. Can we really trust in such
> a tool?
> I know there is no easy alternative for compiling Linux. But gcc is designed
> to produce code for many processors/architectures, it is the 'standard' in
> the not-MS world. How can this issue arise? Whether i586 nor the long NOPs
> are brand-new, thus one should expect no problems with that features.
> Just my opinion.

This is not the place for discussions. But simply: Nobody builds for anything lower than i686 with SSE any more. We are on our own. The hardware is not being sold for 25 years now. I think it is fair that nobody tests on that any more.

> Personally, I'll try to replace my very old Alix as soon as possible.
> The system will be remain as test system for small IPFire installations.
> I'll check the update to core 149 and the addons today.

I can only recommend this: https://store.lightningwirelabs.com/products/IPFIRE-MINI-EU-R1
Comment 7 bitbanger 2020-09-29 22:09:50 UTC
> Indeed. And we have had a conversation about removing support for i586 for
> exactly that reason. As of today, only 1.2% of systems on fireinfo are
> running on i586.

If NOPL is in fact the problem then this affects i686 as well, so now we are talking about 25.43 + 1.18 = 26.61% of the systems in fireinfo. My nice HP t5745 with an Atom N280, 2GB RAM and I340-T2 NIC is one of those systems. The N280 was introduced 11.5 years ago. Output from lscpu supplied upon request.
Comment 8 bitbanger 2020-09-30 00:49:32 UTC
(In reply to bitbanger from comment #7)
> > Indeed. And we have had a conversation about removing support for i586 for
> > exactly that reason. As of today, only 1.2% of systems on fireinfo are
> > running on i586.
> 
> If NOPL is in fact the problem then this affects i686 as well, so now we are
> talking about 25.43 + 1.18 = 26.61% of the systems in fireinfo. My nice HP
> t5745 with an Atom N280, 2GB RAM and I340-T2 NIC is one of those systems.
> The N280 was introduced 11.5 years ago. Output from lscpu supplied upon
> request.

I dug deeper into this. NOPL is an undocumented i686 instruction. Most i686 CPUs implement NOPL. Atom i686 CPUs do not implement NOPL. Therefore this affects less than 26.61% of the systems in fireinfo. I don't know what the actual percentage is.

Good discussion here:

https://bugzilla.redhat.com/show_bug.cgi?id=579838
Comment 9 Michael Tremer 2020-09-30 09:51:27 UTC
(In reply to bitbanger from comment #7)
> > Indeed. And we have had a conversation about removing support for i586 for
> > exactly that reason. As of today, only 1.2% of systems on fireinfo are
> > running on i586.
> 
> If NOPL is in fact the problem then this affects i686 as well, so now we are
> talking about 25.43 + 1.18 = 26.61% of the systems in fireinfo. My nice HP
> t5745 with an Atom N280, 2GB RAM and I340-T2 NIC is one of those systems.
> The N280 was introduced 11.5 years ago. Output from lscpu supplied upon
> request.

No, *all* i686 processors implement NOPL. In fact the AMD/NSC Geode processors implement all i686 instructions apart from NOPL and therefore it is not a full i686 processor.

(In reply to bitbanger from comment #8)
> I dug deeper into this. NOPL is an undocumented i686 instruction.

Where do you even get this from? It is very well documented.

> Atom i686 CPUs do not implement NOPL.

Which ones? They all are i686 or x86_64 processors.

> Therefore this affects less than 26.61% of the systems in fireinfo.

This problem only affected the percentage of systems showing as i586. That is around one percent right now.
Comment 10 bitbanger 2020-09-30 14:02:58 UTC
Created attachment 780 [details]
Documents and data on Intel NOPL instruction
Comment 11 bitbanger 2020-09-30 14:04:52 UTC
(In reply to Michael Tremer from comment #9)
> (In reply to bitbanger from comment #7)
> > > Indeed. And we have had a conversation about removing support for i586 for
> > > exactly that reason. As of today, only 1.2% of systems on fireinfo are
> > > running on i586.
> > 
> > If NOPL is in fact the problem then this affects i686 as well, so now we are
> > talking about 25.43 + 1.18 = 26.61% of the systems in fireinfo. My nice HP
> > t5745 with an Atom N280, 2GB RAM and I340-T2 NIC is one of those systems.
> > The N280 was introduced 11.5 years ago. Output from lscpu supplied upon
> > request.
> 
> No, *all* i686 processors implement NOPL. In fact the AMD/NSC Geode
> processors implement all i686 instructions apart from NOPL and therefore it
> is not a full i686 processor.

You are telling someone who runs an i686 CPU (Atom N280) that my CPU has NOPL. It does not. See attached file Atom_N280_lscpu.txt. Compare to file Atom_D2500_lscpu.txt (x86_64).

> (In reply to bitbanger from comment #8)
> > I dug deeper into this. NOPL is an undocumented i686 instruction.
> 
> Where do you even get this from? It is very well documented.

I get this from the linked post in Comment 8, which you obviously did not look at. NOPL was first introduced in the Pentium Pro, the first i686 CPU, released on 11/01/95 (see attached Wikipedia page). In Intel's 1997 Instruction Set Reference (24319101.pdf, attached), which I got from the link you did not look at in Comment 8, NOPL is not documented. In 2011 Intel still hadn't documented it (253667.pdf, attached).

> > Atom i686 CPUs do not implement NOPL.
> 
> Which ones? They all are i686 or x86_64 processors.
 
See attached file Atom_N280_lscpu.txt. AFAIK the same is true of all Atom i686 CPUs.

> > Therefore this affects less than 26.61% of the systems in fireinfo.
> 
> This problem only affected the percentage of systems showing as i586. That
> is around one percent right now.

As the attached files prove, you are mistaken: you need to count Atom i686 CPUs.

If you wish to dispute anything I have said above please do your homework and include supporting documents.
Comment 12 Peter Müller 2020-10-13 19:27:46 UTC
If I got the discussion here right, this issue has been fixed meanwhile.

In case it is not, please reopen. Thank you.