Bug 12862 - Upgrade from Core 166 to 167 does not use RAID anymore
Summary: Upgrade from Core 166 to 167 does not use RAID anymore
Status: CLOSED FIXED
Alias: None
Product: IPFire
Classification: Unclassified
Component: --- (show other bugs)
Version: 2
Hardware: x86_64 Linux
: - Unknown - Major Usability
Assignee: Michael Tremer
QA Contact:
URL:
Keywords: Blocker
Depends on:
Blocks:
 
Reported: 2022-05-13 10:32 UTC by Dirk Sihling
Modified: 2022-06-13 14:25 UTC (History)
2 users (show)

See Also:


Attachments
Three pictures showing the media information (202.59 KB, application/x-zip-compressed)
2022-05-13 10:32 UTC, Dirk Sihling
Details
Bootlog from CU166 vm (28.18 KB, text/plain)
2022-05-14 16:17 UTC, Adolf Belka
Details
Bootlog from CU167 vm (28.38 KB, text/plain)
2022-05-14 16:18 UTC, Adolf Belka
Details
Bootlog from CU167 vm with rd.debug on kernel command line (29.18 KB, text/plain)
2022-05-14 16:18 UTC, Adolf Belka
Details
Bootlog from CU167 with rd.debug (55.32 KB, text/plain)
2022-05-16 12:47 UTC, Dirk Sihling
Details
Complete boot log from serial console with md.rd=1 on command line (111.64 KB, text/plain)
2022-05-16 13:33 UTC, Dirk Sihling
Details
Bootlog from CU167 vm with rd.md=1 & rd.debug on kernel command line (29.19 KB, text/plain)
2022-05-16 13:34 UTC, Adolf Belka
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dirk Sihling 2022-05-13 10:32:11 UTC
Created attachment 1043 [details]
Three pictures showing the media information

Core 166 was installed on a RAID1 built on two disks (sda and sdb). After the upgrade to 167 the installation was on /dev/sda only, /dev/sdb is not used anymore and the RAID is broken.
Attached are media information pictures showing the change after the upgrade on May 11th 6 pm.
Comment 1 Dirk Sihling 2022-05-13 12:49:27 UTC
I tried to install the OS image created from the running Core 167 system on another system with two identical disks used as RAID1. The installation stalled when trying to install the boot loader. After interrupting and rebooting the system I found the installation on /dev/sda only, again.
Comment 2 Michael Tremer 2022-05-13 12:53:17 UTC
Looks like Dracut cannot initialise the RAID properly.

However what do you see on the second and third terminal when the bootloader cannot be installed?
Comment 3 Dirk Sihling 2022-05-13 13:54:36 UTC
Thanks for your quick response.

I tried it again, and this time the installation was completed (installing the boot loader took about 3 to 5 minutes, maybe I was not patient enough the first time).
The 2nd terminal showed:
Installing GRUB on /dev/sda...
Installing GRUB on /dev/sdb...

The 3rd terminal showed:
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

After installation completed and before the following reboot I could see
the RAID device /dev/md0 being resynced

md0: active raid1 sdb[1] sda[0]

Then I rebooted from disk, and the RAID is not used anymore, again:

[root@ipfirebkp ~]# cat /proc/mdstat
Personalities :
md127 : inactive sdb[1](S)
      488386568 blocks super 1.0

unused devices: <none>

I noticed the name of the device changed from md0 to md127.

[root@ipfirebkp ~]# lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda       8:0    0 465.8G  0 disk
├─sda1    8:1    0     4M  0 part
├─sda2    8:2    0   128M  0 part /boot
├─sda3    8:3    0    32M  0 part /boot/efi
├─sda4    8:4    0     1G  0 part [SWAP]
└─sda5    8:5    0 464.6G  0 part /
sdb       8:16   0 465.8G  0 disk
└─md127   9:127  0     0B  0 md

I will be happy to test anything you want me to on this backup system.
Comment 4 Adolf Belka 2022-05-13 16:36:58 UTC
I also have the same effect but I had missed this on my CU167 VirtualBox vm testbed machine until I read this bug entry.



cat /proc/mdstat 
Personalities : 
md127 : inactive sdb[1](S)
      52428724 blocks super 1.0
       
unused devices: <none>


lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda       8:0    0    50G  0 disk 
├─sda1    8:1    0     4M  0 part 
├─sda2    8:2    0   128M  0 part /boot
├─sda3    8:3    0    32M  0 part /boot/efi
├─sda4    8:4    0 245.4M  0 part [SWAP]
└─sda5    8:5    0  49.6G  0 part /
sdb       8:16   0    50G  0 disk 
└─md127   9:127  0     0B  0 md   
sr0      11:0    1  1024M  0 rom
Comment 5 Michael Tremer 2022-05-13 20:06:41 UTC
(In reply to Dirk Sihling from comment #3)
> I tried it again, and this time the installation was completed (installing
> the boot loader took about 3 to 5 minutes, maybe I was not patient enough
> the first time).
> The 2nd terminal showed:
> Installing GRUB on /dev/sda...
> Installing GRUB on /dev/sdb...

This should not take that long. It should normally be done within 5 seconds.

Is there any chance that the storage device is incredibly slow.
> 
> After installation completed and before the following reboot I could see
> the RAID device /dev/md0 being resynced
> 
> md0: active raid1 sdb[1] sda[0]
> 
> Then I rebooted from disk, and the RAID is not used anymore, again:
> 
> [root@ipfirebkp ~]# cat /proc/mdstat
> Personalities :
> md127 : inactive sdb[1](S)
>       488386568 blocks super 1.0
> 
> unused devices: <none>
> 
> I noticed the name of the device changed from md0 to md127.

Hmm, this is extra weird. I would have expected that the kernel did not find any RAID device at all. That it finds it, and assembles the RAID is unusual. There shouldn't be any reason why that happens.

> [root@ipfirebkp ~]# lsblk
> NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
> sda       8:0    0 465.8G  0 disk
> ├─sda1    8:1    0     4M  0 part
> ├─sda2    8:2    0   128M  0 part /boot
> ├─sda3    8:3    0    32M  0 part /boot/efi
> ├─sda4    8:4    0     1G  0 part [SWAP]
> └─sda5    8:5    0 464.6G  0 part /
> sdb       8:16   0 465.8G  0 disk
> └─md127   9:127  0     0B  0 md
> 
> I will be happy to test anything you want me to on this backup system.

I will have a look at some source and get back to you.
Comment 6 Michael Tremer 2022-05-13 20:28:03 UTC
I tried to recreate this setup with Virtualbox, but sadly that won't allow me to capture much of the boot log.

Could you please add "rd.debug" to the kernel command line and post the entire boot log if you are able to capture it somehow?
Comment 7 Adolf Belka 2022-05-14 16:17:09 UTC
I have got three log files from my virtualbox vm

One is the bootlog from when the vm was on CU166 and the raid array was built an there are several lines related to md127

Another is the standard bootlog from after the vm was upgraded to CU167 and the raid array is not built. No mention of md127 at all.

The third is the bootlog from booting with rd.debug added to the kernel command line.
Comment 8 Adolf Belka 2022-05-14 16:17:52 UTC
Created attachment 1048 [details]
Bootlog from CU166 vm
Comment 9 Adolf Belka 2022-05-14 16:18:13 UTC
Created attachment 1049 [details]
Bootlog from CU167 vm
Comment 10 Adolf Belka 2022-05-14 16:18:51 UTC
Created attachment 1050 [details]
Bootlog from CU167 vm with rd.debug on kernel command line
Comment 11 Adolf Belka 2022-05-16 10:54:34 UTC
On my CU167 system the directory /dev/disk/by-uuid shows the following

drwxr-xr-x 2 root root 120 May 16 10:35 .
drwxr-xr-x 7 root root 140 May 16 10:35 ..
lrwxrwxrwx 1 root root  10 May 16 10:35 05b60374-d025-4f0b-a003-b16f49204e34 -> ../../sda5
lrwxrwxrwx 1 root root  10 May 16 10:35 2d65d8f0-0b11-4818-b03f-d7e9536e10a6 -> ../../sda2
lrwxrwxrwx 1 root root  10 May 16 10:35 8AAB-E7E8 -> ../../sda3
lrwxrwxrwx 1 root root  10 May 16 10:35 cdb826fa-6921-4208-ac8a-51f35bb9386e -> ../../sda4


mdadm --detail --scan
INACTIVE-ARRAY /dev/md127 metadata=1.0 name=ipfire:0 UUID=9f728045:9fea7c5b:c2582abf:290e7f13

mdadm --misc --detail /dev/md127 
/dev/md127:
           Version : 1.0
        Raid Level : raid1
     Total Devices : 1
       Persistence : Superblock is persistent

             State : inactive
   Working Devices : 1

              Name : ipfire:0
              UUID : 9f728045:9fea7c5b:c2582abf:290e7f13
            Events : 349

    Number   Major   Minor   RaidDevice

       -       8       16        -        /dev/sdb



On the CU167 system there is only this entry under dev related to an md device

brw-rw---- 1 root disk 9, 127 May 16 10:35 /dev/md127


On a system with working raid the following is present

drwxr-xr-x  2 root root         160 May 16 12:50 md
brw-rw----  1 root disk      9, 127 May 16 12:50 md127
brw-rw----  1 root disk    259,   0 May 16 12:50 md127p1
brw-rw----  1 root disk    259,   1 May 16 12:50 md127p2
brw-rw----  1 root disk    259,   2 May 16 12:50 md127p3
brw-rw----  1 root disk    259,   3 May 16 12:50 md127p4
brw-rw----  1 root disk    259,   4 May 16 12:50 md127p5


and the contents of the md directory are

drwxr-xr-x  2 root root  160 May 16 12:50 .
drwxr-xr-x 16 root root 3.5K May 16 12:50 ..
lrwxrwxrwx  1 root root    8 May 16 12:50 ipfire:0 -> ../md127
lrwxrwxrwx  1 root root   10 May 16 12:50 ipfire:0p1 -> ../md127p1
lrwxrwxrwx  1 root root   10 May 16 12:50 ipfire:0p2 -> ../md127p2
lrwxrwxrwx  1 root root   10 May 16 12:50 ipfire:0p3 -> ../md127p3
lrwxrwxrwx  1 root root   10 May 16 12:50 ipfire:0p4 -> ../md127p4
lrwxrwxrwx  1 root root   10 May 16 12:50 ipfire:0p5 -> ../md127p5


I am going to build a new CU166 vm and see if I can find what is happening when the upgrade to CU167 is occurring.
Comment 12 Dirk Sihling 2022-05-16 12:47:15 UTC
Created attachment 1052 [details]
Bootlog from CU167 with rd.debug

File /var/log/bootlog from my CU167 system with rd.debug on the kernel command line
Comment 13 Dirk Sihling 2022-05-16 12:51:16 UTC
(In reply to Michael Tremer from comment #6)
> I tried to recreate this setup with Virtualbox, but sadly that won't allow
> me to capture much of the boot log.
> 
> Could you please add "rd.debug" to the kernel command line and post the
> entire boot log if you are able to capture it somehow?

I added my bootlog from today with rd.debug on the command line.

Something strange I noticed: on one boot the installation on /dev/sdb was used:

[root@ipfirebkp log]# lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda       8:0    0 465.8G  0 disk
└─md127   9:127  0     0B  0 md
sdb       8:16   0 465.8G  0 disk
├─sdb1    8:17   0     4M  0 part
├─sdb2    8:18   0   128M  0 part /boot
├─sdb3    8:19   0    32M  0 part /boot/efi
├─sdb4    8:20   0     1G  0 part [SWAP]
└─sdb5    8:21   0 464.6G  0 part /

[root@ipfirebkp log]# blkid /dev/sda
/dev/sda: UUID="05b8873e-60a5-a6e2-8ac5-fcbabb83a5b7" UUID_SUB="9248c5ee-e150-fde0-305d-39c5fa4efa86" LABEL="ipfire:0" TYPE="linux_raid_member"
[root@ipfirebkp log]# blkid /dev/sdb
/dev/sdb: UUID="05b8873e-60a5-a6e2-8ac5-fcbabb83a5b7" UUID_SUB="88c9ba11-84ec-3990-1d64-2bb747d70d9a" LABEL="ipfire:0" TYPE="linux_raid_member"
[root@ipfirebkp log]# blkid /dev/md127

And the next time I booted the installation on /dev/sda was used, which I then had to setup as well:

[root@ipfirebkp log]# lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda       8:0    0 465.8G  0 disk
├─sda1    8:1    0     4M  0 part
├─sda2    8:2    0   128M  0 part /boot
├─sda3    8:3    0    32M  0 part /boot/efi
├─sda4    8:4    0     1G  0 part [SWAP]
└─sda5    8:5    0 464.6G  0 part /
sdb       8:16   0 465.8G  0 disk
└─md127   9:127  0     0B  0 md

[root@ipfirebkp log]# blkid /dev/sda
/dev/sda: UUID="05b8873e-60a5-a6e2-8ac5-fcbabb83a5b7" UUID_SUB="9248c5ee-e150-fde0-305d-39c5fa4efa86" LABEL="ipfire:0" TYPE="linux_raid_member"
[root@ipfirebkp log]# blkid /dev/sdb
/dev/sdb: UUID="05b8873e-60a5-a6e2-8ac5-fcbabb83a5b7" UUID_SUB="88c9ba11-84ec-3990-1d64-2bb747d70d9a" LABEL="ipfire:0" TYPE="linux_raid_member"
Comment 14 Dirk Sihling 2022-05-16 12:52:46 UTC
> 
> This should not take that long. It should normally be done within 5 seconds.
> 
> Is there any chance that the storage device is incredibly slow.

No, I don't think so, when resyncing the RAID the writing speed was 170M/s
Comment 15 Michael Tremer 2022-05-16 13:08:37 UTC
Thanks for providing the logs. There is this line in it:

> [    0.732387] dracut: rd.md=0: removing MD RAID activation

That is only part of c167 and it should not be there.

> https://git.ipfire.org/?p=thirdparty/dracut.git;a=blob;f=modules.d/90mdraid/parse-md.sh;hb=631d5f72a223288aa1f48bb8e8d0313e75947400#l11

This is where this happens in dracut. Could you please add rd.md=1 to the kernel command line and boot up with that? I suppose that should fix it and assemble the RAID correctly at boot time. Now we only need to find out why dracut thinks that it should not do this automatically.
Comment 17 Adolf Belka 2022-05-16 13:20:23 UTC
Added rd.md=1 to the kernel line but it didn't make any difference.

cat /proc/mdstat 
Personalities : 
md127 : inactive sda[0](S)
      52428724 blocks super 1.0
       
unused devices: <none>
Comment 18 Michael Tremer 2022-05-16 13:21:18 UTC
Could you send the entire log again? Did the line disappear?

I suppose that dracut is parsing the configuration incorrectly. So this was just an attempt to convince differently.
Comment 19 Dirk Sihling 2022-05-16 13:27:02 UTC
I tried that, too, and I did not get the RAID back as well.
But maybe both disks are not "similar" enough anymore after I booted and setup both systems on /dev/sda and /dev/sdb.
I'll try a fresh install and add the kernel command line before the reboot at the end of install.
Comment 20 Michael Tremer 2022-05-16 13:28:16 UTC
The RAID seems to be absolutely fine and healthy. Dracut just thinks it has been configured to not turn on and RAIDs.

That is my hypothesis at the moment.
Comment 21 Dirk Sihling 2022-05-16 13:30:06 UTC
(In reply to Michael Tremer from comment #18)
> Could you send the entire log again? Did the line disappear?
> 
> I suppose that dracut is parsing the configuration incorrectly. So this was
> just an attempt to convince differently.

I still have the line with rd.md=0:

dracut: //usr/lib/initrd-release@5(): VERSION='2.27 dracut-056'
dracudevd[339]: starting eudev-3.2.6
ut: //usr/lib/initrd-release@6(): PRETTY_NAME='IPFire 2.27 (x86_64) - core167 dracut-056 (Initradracut: rd.md=0: removing MD RAID activation
mfs)'
Comment 22 Dirk Sihling 2022-05-16 13:33:25 UTC
Created attachment 1053 [details]
Complete boot log from serial console with md.rd=1 on command line

I added the complete boot log grabbed from the serial console with rd.debug and rd.md=1 on the kernel command line.
Comment 23 Adolf Belka 2022-05-16 13:34:37 UTC
Created attachment 1054 [details]
Bootlog from CU167 vm with rd.md=1 & rd.debug on kernel command line

As for Dirk I also have the same line still.
Attached is the bootlog with rd.md=1 and rd.debug both set
Comment 24 Adolf Belka 2022-05-16 13:36:04 UTC
I will leave any further actions as Dirk is working on it so that we don't end up with too many duplicate entries
Comment 25 Dirk Sihling 2022-05-16 13:40:32 UTC
(In reply to Adolf Belka from comment #24)
> I will leave any further actions as Dirk is working on it so that we don't
> end up with too many duplicate entries

Thanks for your help so far. :-)
Comment 26 Michael Tremer 2022-05-16 14:50:18 UTC
> https://lists.ipfire.org/pipermail/development/2022-May/013494.html

I believe that this is not a bug in dracut. It is just some unexpected behaviour on my part.

Could you please add "rd.auto" to /etc/default/grub as I did in the patch, then run "grub-mkconfig -o /boot/grub/grub.cfg" and reboot.

That should assemble the RAID correctly at boot time. It also fixed itself on my system but that resulted in the data of the device that wasn't mounted before (i.e. the older state).
Comment 27 Dirk Sihling 2022-05-16 15:34:22 UTC
Looks good, now the RAID is assembled again:

[root@ipfirebkp ~]# cat /proc/mdstat
Personalities : [raid1]
md127 : active raid1 sdb[1] sda[0]
      488386432 blocks super 1.0 [2/2] [UU]
      [===>.................]  resync = 19.0% (92943744/488386432) finish=51.7min speed=127278K/sec
      bitmap: 4/4 pages [16KB], 65536KB chunk

unused devices: <none>
[root@ipfirebkp ~]# lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda           8:0    0 465.8G  0 disk
└─md127       9:127  0 465.8G  0 raid1
  ├─md127p1 259:0    0     4M  0 part
  ├─md127p2 259:1    0   128M  0 part  /boot
  ├─md127p3 259:2    0    32M  0 part  /boot/efi
  ├─md127p4 259:3    0     1G  0 part  [SWAP]
  └─md127p5 259:4    0 464.6G  0 part  /
sdb           8:16   0 465.8G  0 disk
└─md127       9:127  0 465.8G  0 raid1
  ├─md127p1 259:0    0     4M  0 part
  ├─md127p2 259:1    0   128M  0 part  /boot
  ├─md127p3 259:2    0    32M  0 part  /boot/efi
  ├─md127p4 259:3    0     1G  0 part  [SWAP]
  └─md127p5 259:4    0 464.6G  0 part  /

Thank you and Adolf very much for your help, now I can plan for getting my production system back on a RAID, too.
Comment 28 Michael Tremer 2022-05-16 15:36:54 UTC
Thank you for confirming.

Getting things back up is now probably a difficult thing. The re-synced RAID will have the state from before the last boot into only a single device. That should generally be fine, but if a user now installs an update, those changes will be wiped when the RAID is resyncing.

This is a major headache and I am not quite sure what the best way forward would be. 

One of the options is to have people do a backup, reinstall and restore the backup.

Another option could be that we just update everything and hope for the best. But I generally do not like this strategy.
Comment 29 Dirk Sihling 2022-05-16 15:43:47 UTC
(In reply to Michael Tremer from comment #28)
> 
> One of the options is to have people do a backup, reinstall and restore the
> backup.
> 

That's what I had in mind. My CU167 production system is basically running fine, just not on a RAID. I'll do a backup and clean install, then restore the configuration. My backup (or test) system will take over in the meantime.

But I agree, it's not as simple as just doing an upgrade.

Thanks again.
Comment 30 Adolf Belka 2022-05-16 16:28:47 UTC
On my CU167 vm testbed machine the patch fixed the raid but then when the vm restarted unbound failed to start.

I then ran the unbound start command and got the following message with the message about not being able to open the include file repeated many many times.

/etc/init.d/unbound start
Starting Unbound DNS Proxy...
/etc/unbound/unbound.conf:64: error: cannot open include file '/etc/unbound/dhcp-leases.conf': Structure needs cleaning
read /etc/unbound/unbound.conf failed: 1 errors in configuration file
[1652716903] unbound[5823:0] fatal error: Could not read config file: /etc/unbound/unbound.conf. Maybe try unbound -dd, it stays on the commandline to see more errors, or unbound-ch[ FAIL ]
/etc/unbound/unbound.conf:64: error: cannot open include file '/etc/unbound/dhcp-leases.conf': Structure needs cleaning

I checked and there is no dhcp-leases.conf file in /etc/unbound
Not sure if this is a consequence of the rebuild of the raid array or not.

Any clues how to recreate a dhcp-leases.conf file for unbound. I haven't been able to figure anything out yet.

If it comes to it I can create a backup from my CU167 vm and then re-install and restore. The vm with the above error messages was a clone of my standard CU167 vm.
Comment 31 Michael Tremer 2022-05-16 16:33:39 UTC
(In reply to Adolf Belka from comment #30)
> On my CU167 vm testbed machine the patch fixed the raid but then when the vm
> restarted unbound failed to start.

Can you try running a filesystem check?

I am not surprised that the RAID is broken. In a RAID-1 configuration, you cannot decide which one is the "correct" data because you have a 1:1 decision. There is no majority in it.
Comment 32 Adolf Belka 2022-05-16 17:25:48 UTC
I tried running an filesystem check and it found a lot of errors which were fixed. 

When rebooting the system then stopped and said that errors had been found. These looked to be different to before.

Ran e2fsck on /dev/md127p5 and these were fixed. Rebooted and again stopped with filesystem errors but fewer again than previous.

After about 6 e2fsck and reboots on the system then I have ended up with seven
INODE yyyyy ref counts x should be x-1 FIXED

Message is that these have been fixed but the system need to reboot to finalise. It rebooted and another set of INODE ref count error come up that have been fixed.

This basically is just continuing on and on with different Inode numbers each time so I have decided to do a backup on my active CU167 system and then do a re-install of CU167 and restore
Comment 35 Adolf Belka 2022-05-16 19:09:11 UTC
I did a fresh install of CU167 and when it rebooted after the initial install I edited the kernel command line to include rd.auto and after setup was complated I had a working raid array.

I the did the grub-mkconfig command to make the change persistent.

After the sync had completed I restored my CU167 backup and rebooted. The raid array stayed working and I had my CU167 status as previously. but on a working raid array.
Comment 36 Dirk Sihling 2022-05-17 08:31:44 UTC
That was my plan for the scheduled downtime tomorrow.
But currently I am trying to find a way to mark the "right" disk as faulty and force the rebuild from the running installation.
Comment 37 Michael Tremer 2022-05-17 11:17:29 UTC
(In reply to Dirk Sihling from comment #36)
> That was my plan for the scheduled downtime tomorrow.
> But currently I am trying to find a way to mark the "right" disk as faulty
> and force the rebuild from the running installation.

I would be very much interested in that. Until then we will have to block the release so that we do not entirely destroy people's installations.
Comment 38 Dirk Sihling 2022-05-17 18:47:59 UTC
Today I tried to erase the partition table of the disk that did not contain the running installation. My idea was to rebuild the RAID with the disk that has valid data after reboot with rd.auto. Unfortunately the RAID still thought it was complete and consistent. The requested file system check left my system without /boot.
Tomorrow I'll give it another try.
Comment 39 Dirk Sihling 2022-05-18 12:33:33 UTC
Today I tried to invalidate one disk with --zero-superblock, but that didn't work either. The system just booted off the "invalidated" disk anyway, and the other one was assigned to the inactive RAID.
I am afraid I have no more ideas what I could try. I'll use Adolf's description to get my production system going again later today.
Comment 40 Michael Tremer 2022-05-19 08:57:56 UTC
> https://lists.ipfire.org/pipermail/development/2022-May/013508.html

I believe that this script works. Please see the header for details about how it works.

Thank you for giving me the clue. Does anyone have any systems that we could use to test this?
Comment 41 Dirk Sihling 2022-05-19 09:13:58 UTC
I am working from home today, and I am not sure I will have the time tomorrow, but if I do I can test it with my broken CU167 RAID installation (I switched my backup system into the production one yesterday, which was faster and left me with a running system as backup in case something went wrong).
On Monday I will definitely find time for testing.
Comment 42 Adolf Belka 2022-05-19 09:15:50 UTC
I can create another CU166 vm and update it to CU167 and then use it for a variety of things, including changing what is active and what not and then I can try the patches out.

Hopefully I will be able to try later today.
Comment 43 Adolf Belka 2022-05-19 20:21:26 UTC
Not a good result I am afraid.

I ran the rd.auto patch and rd.auto was added to /etc/default/grub

It wasn't mentioned about running grub-mkconfig -o /boot/grub/grub.cfg and I had to run that previously to get rd.auto persistent so I ran that here. Not sure if that was correct or not.

I ran the script and cat /proc/mdstat came back with no personalities.

I then rebooted and after stopping everything the screen went to restart but then came up with the following message:-
FATAL: No bootable medium found! System halted


I have a vm clone of the CU167 vm with the failed raid array and with various changes done with only one disk connected so I can clone from this vm and run whatever further changes that are identified as being needed.
Comment 44 Dirk Sihling 2022-05-25 11:59:00 UTC
Today I tried the following: I put 166 into /opt/pakfire/db/core/mine and
did the upgrade again, because I thought the script would be executed at the end of the upgrade process or after reboot. That didn't happen. My system booted into
the /dev/sdb installation again.

How can I test your script?
Comment 45 Dirk Sihling 2022-05-25 13:43:03 UTC
Ok, I probably needed to upgrade to CU168 Testing (master/bbd4767f), which I did (twice, once for /dev/sdb, and also for /dev/sda).
Both times I ended up with a system running on one disk only. At least it seems to be running without any problems.
Current state:
[root@ipfire ~]# lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda       8:0    0 931.5G  0 disk
└─md127   9:127  0     0B  0 md
sdb       8:16   0 931.5G  0 disk
├─sdb1    8:17   0     4M  0 part
├─sdb2    8:18   0   128M  0 part /boot
├─sdb3    8:19   0    32M  0 part /boot/efi
├─sdb4    8:20   0     1G  0 part [SWAP]
└─sdb5    8:21   0 930.3G  0 part /
[root@ipfire ~]# cat /opt/pakfire/db/core/mine
168
Comment 46 Adolf Belka 2022-05-25 14:27:44 UTC
Hallo @Dirk. As @Michael was u7nsure if the changes he had made would fix things or not he did not put them into the CU168 Testing repository.

Yo have to run the changes manually on your system.
Comment 47 Michael Tremer 2022-05-25 15:28:45 UTC
(In reply to Adolf Belka from comment #46)
> Hallo @Dirk. As @Michael was u7nsure if the changes he had made would fix
> things or not he did not put them into the CU168 Testing repository.
> 
> Yo have to run the changes manually on your system.

Peter is away and I wasn't sure if it would not be a good idea to wait for a little bit more feedback.

As soon as we think it is okay, I will merge it.
Comment 48 Dirk Sihling 2022-05-25 15:51:22 UTC
How can I get the script so I can test it on my system?
Comment 49 Adolf Belka 2022-05-25 16:00:57 UTC
(In reply to Dirk Sihling from comment #48)
> How can I get the script so I can test it on my system?

See link in comment 40 from @Michael.
Comment 50 Adolf Belka 2022-05-25 16:04:25 UTC
(In reply to Michael Tremer from comment #47)
> 
> Peter is away and I wasn't sure if it would not be a good idea to wait for a
> little bit more feedback.
> 
> As soon as we think it is okay, I will merge it.

Hi @Michael, In case you haven't seen my feedback from testing out the script on my vm. I ended up with a Fatal Error message in my test.

The feedback is in Comment 43.
Comment 51 Michael Tremer 2022-05-25 17:56:19 UTC
(In reply to Adolf Belka from comment #50)
> The feedback is in Comment 43.

Oh, I seem to have missed that.

Yes, you would need to update the bootloader configuration.

So, first you run the RAID repair script; then you add the "rd.auto" parameter and then rebuild the GRUB configuration. That definitely wasn't clear from my patch.

Do you have the chance to test again, or how can we get this tested any further?
Comment 52 Adolf Belka 2022-05-25 19:49:58 UTC
I retried the vm test. Followed the opposite order to what I did the first time.

So upgraded CU167 to CU168 and then ran the repair-mdraid script, then added the rd.auto to default/grub and then ran grub-mkconfig and then rebooted but got exactly the same message.

I am cloning another version of my CU167 right now and will then run the CU168 update and the same script etc but not do the reboot so I can check any files etc that you need info on.
Comment 53 Adolf Belka 2022-05-25 20:35:50 UTC
A quick check or order of running the patches.

Should I run repair-mdraid followed by rd.auto fix and grub-mkconfig after doing the CU168 upgrade or should I run them on the CU167 version before doing the CU168 upgrade.

I have been doing the first sequence where the patch changes are applied after doing the CU168 upgrade but before rebooting.
Comment 54 Adolf Belka 2022-05-25 20:48:23 UTC
I created a new clone of my CU167 vm and updated to CU168 and then ran repair-mdraid then added rd.auto and then ran grub-mkconfig

Checking the disks with fdisk I got the following for sda and sdb, the two disks for the raid array.

Disk /dev/sda: 50 GiB, 53687091200 bytes, 104857600 sectors
Disk model: VBOX HARDDISK   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
GPT PMBR size mismatch (104857343 != 104857599) will be corrected by write.
The backup GPT table is not on the end of the device.


Disk /dev/sdb: 50 GiB, 53687091200 bytes, 104857600 sectors
Disk model: VBOX HARDDISK   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: B47388B7-F5D4-49D0-A44A-4F9E62EC3697

Device      Start       End   Sectors   Size Type
/dev/sdb1    2048     10239      8192     4M BIOS boot
/dev/sdb2   10240    272383    262144   128M Linux filesystem
/dev/sdb3  272384    337919     65536    32M EFI System
/dev/sdb4  337920    838109    500190 244.2M Linux swap
/dev/sdb5  838110 104851455 104013346  49.6G Linux filesystem


So for sda which was the disk being shown in the /proc/mdstat message with the line
md127 : inactive sda[0](S)

there are no partitions shown but with a GPT PMBR size mismatch message which is not shown on sdb.

Not sure if this is linked to the "FATAL: No bootable medium found! System halted" error message that I am still getting or not. It doesn't look like what I would have expected to see for sda.
Comment 55 Adolf Belka 2022-05-25 21:07:18 UTC
I also tried parted in case that gave different results as the disks are gpt based but it also doesn't recognise the partition table for sda, it has unknown instead of gpt.

parted -l
Error: /dev/sda: unrecognised disk label
Model: ATA VBOX HARDDISK (scsi)                                           
Disk /dev/sda: 53.7GB
Sector size (logical/physical): 512B/512B
Partition Table: unknown
Disk Flags: 

Warning: Not all of the space available to /dev/sdb appears to be used, you can
fix the GPT to use all of the space (an extra 256 blocks) or continue with the
current setting? 
Fix/Ignore? I                                                             
Model: ATA VBOX HARDDISK (scsi)
Disk /dev/sdb: 53.7GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system     Name     Flags
 1      1049kB  5243kB  4194kB                  BOOTLDR  bios_grub
 2      5243kB  139MB   134MB   ext4            BOOT
 3      139MB   173MB   33.6MB  fat16           ESP      boot, esp
 4      173MB   429MB   256MB   linux-swap(v1)  SWAP     swap
 5      429MB   53.7GB  53.3GB  ext4            ROOT
Comment 56 Adolf Belka 2022-05-26 07:52:58 UTC
I will probably be able to run a few further tests on my vm setup today but after today my vm setup won't be available for a while so I won't be able to do any further testing or evaluation.
Comment 57 Adolf Belka 2022-05-26 11:25:45 UTC
I ran parted -l on a clone of the CU167 vm.
This has the other disk drive missed from the raid array.

cat /proc/mdstat 
Personalities : 
md127 : inactive sdb[1](S)
      52428724 blocks super 1.0


parted -l gave

parted -l
Warning: Not all of the space available to /dev/sda appears to be used, you can
fix the GPT to use all of the space (an extra 256 blocks) or continue with the
current setting? 
Fix/Ignore? I                                                             
Model: ATA VBOX HARDDISK (scsi)
Disk /dev/sda: 53.7GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system     Name     Flags
 1      1049kB  5243kB  4194kB                  BOOTLDR  bios_grub
 2      5243kB  139MB   134MB   ext4            BOOT
 3      139MB   173MB   33.6MB  fat16           ESP      boot, esp
 4      173MB   429MB   256MB   linux-swap(v1)  SWAP     swap
 5      429MB   53.7GB  53.3GB  ext4            ROOT


Warning: Not all of the space available to /dev/sdb appears to be used, you can fix the GPT to
use all of the space (an extra 256 blocks) or continue with the current setting? 
Fix/Ignore? I                                                             
Model: ATA VBOX HARDDISK (scsi)
Disk /dev/sdb: 53.7GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system     Name     Flags
 1      1049kB  5243kB  4194kB                  BOOTLDR  bios_grub
 2      5243kB  139MB   134MB   ext4            BOOT
 3      139MB   173MB   33.6MB  fat16           ESP      boot, esp
 4      173MB   429MB   256MB   linux-swap(v1)  SWAP     swap
 5      429MB   53.7GB  53.3GB  ext4            ROOT

So both drives are still showing the partition tables as would be expected.

Then I ran the repair-mdraid script on this and ran parted-l again and got

parted -l
Warning: Not all of the space available to /dev/sda appears to be used, you can
fix the GPT to use all of the space (an extra 256 blocks) or continue with the
current setting? 
Fix/Ignore? I                                                             
Model: ATA VBOX HARDDISK (scsi)
Disk /dev/sda: 53.7GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system     Name     Flags
 1      1049kB  5243kB  4194kB                  BOOTLDR  bios_grub
 2      5243kB  139MB   134MB   ext4            BOOT
 3      139MB   173MB   33.6MB  fat16           ESP      boot, esp
 4      173MB   429MB   256MB   linux-swap(v1)  SWAP     swap
 5      429MB   53.7GB  53.3GB  ext4            ROOT


Error: /dev/sdb: unrecognised disk label
Model: ATA VBOX HARDDISK (scsi)                                           
Disk /dev/sdb: 53.7GB
Sector size (logical/physical): 512B/512B
Partition Table: unknown
Disk Flags: 

So it looks like running the repair-mdraid script is causing one of the drives to end up with an unknown Partition Table and is the drive that was listed in the mdstat as being part of the raid array but inactive.
Comment 58 Adolf Belka 2022-05-26 11:40:28 UTC
Having read through the repair-mdraid script a bit more closely then what I am finding with parted probably is what would be expected because part of the script is to remove the partition table from the bad device.

In the script there is a section raid-rebuild which should run right at the end of the script.

That looks to be searching for any raid elements in /dev/md/ and looking for ipfire:0 but directory md does not exist under /dev/

It could well be that I am not understanding well enough what the script should be doing.

I have not been able to figure out how to make the required drive show up as bootable to my vm so I keep getting the "FATAL: No bootable medium found! System halted" message
Comment 59 Peter Müller 2022-05-30 18:51:45 UTC
Hello all,

first, thanks for all the testing feedback and reporting back.

Based from what I understood, the RAID repair script has one important pitfall: If for some reason the machine cannot boot from the second SSD/HDD - which appears to be the case in at least one of Adolf's testing attempts -, a reboot will cause the IPFire installation not to come back up again. Manual interaction is then required to fix this, with physical presence at worst.

Core Update 168 has been in testing for three weeks now, and contains some security-relevant items. Therefore, I would like to propose the following procedure:

(a) Michael's RAID repair script will go into the update and is executed automatically during the upgrade procedure, if the installation is found to run on a RAID.

(b) The release announcement for Core Update 168 will come with a very strong note regarding the situation, and urge IPFire users running their installations on a RAID to manually check things are going to be okay _before_ conducting the first reboot after having upgraded.

Does this read reasonable to you?

Thanks, and best regards,
Peter Müller
Comment 61 Dirk Sihling 2022-05-31 08:26:51 UTC
(In reply to Peter Müller from comment #59)
> Core Update 168 has been in testing for three weeks now, and contains some
> security-relevant items. Therefore, I would like to propose the following
> procedure:
> 
> (a) Michael's RAID repair script will go into the update and is executed
> automatically during the upgrade procedure, if the installation is found to
> run on a RAID.
> 
> (b) The release announcement for Core Update 168 will come with a very
> strong note regarding the situation, and urge IPFire users running their
> installations on a RAID to manually check things are going to be okay
> _before_ conducting the first reboot after having upgraded.
> 
> Does this read reasonable to you?

I am probably the least experienced one here, but for me this sounds ok and I can't think of any other applicable procedure.
My problem was solved by a fresh install and importing a backup of the configuration, which worked well.
Comment 62 Michael Tremer 2022-05-31 11:37:18 UTC
(In reply to Dirk Sihling from comment #61)
> (In reply to Peter Müller from comment #59)
> > Core Update 168 has been in testing for three weeks now, and contains some
> > security-relevant items. Therefore, I would like to propose the following
> > procedure:
> > 
> > (a) Michael's RAID repair script will go into the update and is executed
> > automatically during the upgrade procedure, if the installation is found to
> > run on a RAID.
> > 
> > (b) The release announcement for Core Update 168 will come with a very
> > strong note regarding the situation, and urge IPFire users running their
> > installations on a RAID to manually check things are going to be okay
> > _before_ conducting the first reboot after having upgraded.
> > 
> > Does this read reasonable to you?
> 
> I am probably the least experienced one here, but for me this sounds ok and
> I can't think of any other applicable procedure.

Skilled enough to find this in the first place :)

I would like to thank everyone who has been working on this. This is one of the nastiest bugs that we have had in a while, because it isn't very easy to correct.

The script is kind of best effort.

I absolutely would say that if your system is configured as a RAID but does not come up at all if the first hard drive fails, then it is configured incorrectly. However, I do not remember testing this extensively on my own systems either.

This is therefore something that we will have to highlight, since it will cause a problem even though our script has been working fine. That would be a shame.

> My problem was solved by a fresh install and importing a backup of the
> configuration, which worked well.

Yes, I would say that at this point we should consider all affected systems broken and the script is a bit of a shot in the dark with a certain probability to fix it for good. I suppose that probability is rather high, but the factors that might bring it further down are outside of our control. Those can only be handled/mitigated by the person right in front of the system, so we need to make them aware.