Created attachment 1043 [details] Three pictures showing the media information Core 166 was installed on a RAID1 built on two disks (sda and sdb). After the upgrade to 167 the installation was on /dev/sda only, /dev/sdb is not used anymore and the RAID is broken. Attached are media information pictures showing the change after the upgrade on May 11th 6 pm.
I tried to install the OS image created from the running Core 167 system on another system with two identical disks used as RAID1. The installation stalled when trying to install the boot loader. After interrupting and rebooting the system I found the installation on /dev/sda only, again.
Looks like Dracut cannot initialise the RAID properly. However what do you see on the second and third terminal when the bootloader cannot be installed?
Thanks for your quick response. I tried it again, and this time the installation was completed (installing the boot loader took about 3 to 5 minutes, maybe I was not patient enough the first time). The 2nd terminal showed: Installing GRUB on /dev/sda... Installing GRUB on /dev/sdb... The 3rd terminal showed: bash: cannot set terminal process group (-1): Inappropriate ioctl for device bash: no job control in this shell After installation completed and before the following reboot I could see the RAID device /dev/md0 being resynced md0: active raid1 sdb[1] sda[0] Then I rebooted from disk, and the RAID is not used anymore, again: [root@ipfirebkp ~]# cat /proc/mdstat Personalities : md127 : inactive sdb[1](S) 488386568 blocks super 1.0 unused devices: <none> I noticed the name of the device changed from md0 to md127. [root@ipfirebkp ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 465.8G 0 disk ├─sda1 8:1 0 4M 0 part ├─sda2 8:2 0 128M 0 part /boot ├─sda3 8:3 0 32M 0 part /boot/efi ├─sda4 8:4 0 1G 0 part [SWAP] └─sda5 8:5 0 464.6G 0 part / sdb 8:16 0 465.8G 0 disk └─md127 9:127 0 0B 0 md I will be happy to test anything you want me to on this backup system.
I also have the same effect but I had missed this on my CU167 VirtualBox vm testbed machine until I read this bug entry. cat /proc/mdstat Personalities : md127 : inactive sdb[1](S) 52428724 blocks super 1.0 unused devices: <none> lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 50G 0 disk ├─sda1 8:1 0 4M 0 part ├─sda2 8:2 0 128M 0 part /boot ├─sda3 8:3 0 32M 0 part /boot/efi ├─sda4 8:4 0 245.4M 0 part [SWAP] └─sda5 8:5 0 49.6G 0 part / sdb 8:16 0 50G 0 disk └─md127 9:127 0 0B 0 md sr0 11:0 1 1024M 0 rom
(In reply to Dirk Sihling from comment #3) > I tried it again, and this time the installation was completed (installing > the boot loader took about 3 to 5 minutes, maybe I was not patient enough > the first time). > The 2nd terminal showed: > Installing GRUB on /dev/sda... > Installing GRUB on /dev/sdb... This should not take that long. It should normally be done within 5 seconds. Is there any chance that the storage device is incredibly slow. > > After installation completed and before the following reboot I could see > the RAID device /dev/md0 being resynced > > md0: active raid1 sdb[1] sda[0] > > Then I rebooted from disk, and the RAID is not used anymore, again: > > [root@ipfirebkp ~]# cat /proc/mdstat > Personalities : > md127 : inactive sdb[1](S) > 488386568 blocks super 1.0 > > unused devices: <none> > > I noticed the name of the device changed from md0 to md127. Hmm, this is extra weird. I would have expected that the kernel did not find any RAID device at all. That it finds it, and assembles the RAID is unusual. There shouldn't be any reason why that happens. > [root@ipfirebkp ~]# lsblk > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS > sda 8:0 0 465.8G 0 disk > ├─sda1 8:1 0 4M 0 part > ├─sda2 8:2 0 128M 0 part /boot > ├─sda3 8:3 0 32M 0 part /boot/efi > ├─sda4 8:4 0 1G 0 part [SWAP] > └─sda5 8:5 0 464.6G 0 part / > sdb 8:16 0 465.8G 0 disk > └─md127 9:127 0 0B 0 md > > I will be happy to test anything you want me to on this backup system. I will have a look at some source and get back to you.
I tried to recreate this setup with Virtualbox, but sadly that won't allow me to capture much of the boot log. Could you please add "rd.debug" to the kernel command line and post the entire boot log if you are able to capture it somehow?
I have got three log files from my virtualbox vm One is the bootlog from when the vm was on CU166 and the raid array was built an there are several lines related to md127 Another is the standard bootlog from after the vm was upgraded to CU167 and the raid array is not built. No mention of md127 at all. The third is the bootlog from booting with rd.debug added to the kernel command line.
Created attachment 1048 [details] Bootlog from CU166 vm
Created attachment 1049 [details] Bootlog from CU167 vm
Created attachment 1050 [details] Bootlog from CU167 vm with rd.debug on kernel command line
On my CU167 system the directory /dev/disk/by-uuid shows the following drwxr-xr-x 2 root root 120 May 16 10:35 . drwxr-xr-x 7 root root 140 May 16 10:35 .. lrwxrwxrwx 1 root root 10 May 16 10:35 05b60374-d025-4f0b-a003-b16f49204e34 -> ../../sda5 lrwxrwxrwx 1 root root 10 May 16 10:35 2d65d8f0-0b11-4818-b03f-d7e9536e10a6 -> ../../sda2 lrwxrwxrwx 1 root root 10 May 16 10:35 8AAB-E7E8 -> ../../sda3 lrwxrwxrwx 1 root root 10 May 16 10:35 cdb826fa-6921-4208-ac8a-51f35bb9386e -> ../../sda4 mdadm --detail --scan INACTIVE-ARRAY /dev/md127 metadata=1.0 name=ipfire:0 UUID=9f728045:9fea7c5b:c2582abf:290e7f13 mdadm --misc --detail /dev/md127 /dev/md127: Version : 1.0 Raid Level : raid1 Total Devices : 1 Persistence : Superblock is persistent State : inactive Working Devices : 1 Name : ipfire:0 UUID : 9f728045:9fea7c5b:c2582abf:290e7f13 Events : 349 Number Major Minor RaidDevice - 8 16 - /dev/sdb On the CU167 system there is only this entry under dev related to an md device brw-rw---- 1 root disk 9, 127 May 16 10:35 /dev/md127 On a system with working raid the following is present drwxr-xr-x 2 root root 160 May 16 12:50 md brw-rw---- 1 root disk 9, 127 May 16 12:50 md127 brw-rw---- 1 root disk 259, 0 May 16 12:50 md127p1 brw-rw---- 1 root disk 259, 1 May 16 12:50 md127p2 brw-rw---- 1 root disk 259, 2 May 16 12:50 md127p3 brw-rw---- 1 root disk 259, 3 May 16 12:50 md127p4 brw-rw---- 1 root disk 259, 4 May 16 12:50 md127p5 and the contents of the md directory are drwxr-xr-x 2 root root 160 May 16 12:50 . drwxr-xr-x 16 root root 3.5K May 16 12:50 .. lrwxrwxrwx 1 root root 8 May 16 12:50 ipfire:0 -> ../md127 lrwxrwxrwx 1 root root 10 May 16 12:50 ipfire:0p1 -> ../md127p1 lrwxrwxrwx 1 root root 10 May 16 12:50 ipfire:0p2 -> ../md127p2 lrwxrwxrwx 1 root root 10 May 16 12:50 ipfire:0p3 -> ../md127p3 lrwxrwxrwx 1 root root 10 May 16 12:50 ipfire:0p4 -> ../md127p4 lrwxrwxrwx 1 root root 10 May 16 12:50 ipfire:0p5 -> ../md127p5 I am going to build a new CU166 vm and see if I can find what is happening when the upgrade to CU167 is occurring.
Created attachment 1052 [details] Bootlog from CU167 with rd.debug File /var/log/bootlog from my CU167 system with rd.debug on the kernel command line
(In reply to Michael Tremer from comment #6) > I tried to recreate this setup with Virtualbox, but sadly that won't allow > me to capture much of the boot log. > > Could you please add "rd.debug" to the kernel command line and post the > entire boot log if you are able to capture it somehow? I added my bootlog from today with rd.debug on the command line. Something strange I noticed: on one boot the installation on /dev/sdb was used: [root@ipfirebkp log]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 465.8G 0 disk └─md127 9:127 0 0B 0 md sdb 8:16 0 465.8G 0 disk ├─sdb1 8:17 0 4M 0 part ├─sdb2 8:18 0 128M 0 part /boot ├─sdb3 8:19 0 32M 0 part /boot/efi ├─sdb4 8:20 0 1G 0 part [SWAP] └─sdb5 8:21 0 464.6G 0 part / [root@ipfirebkp log]# blkid /dev/sda /dev/sda: UUID="05b8873e-60a5-a6e2-8ac5-fcbabb83a5b7" UUID_SUB="9248c5ee-e150-fde0-305d-39c5fa4efa86" LABEL="ipfire:0" TYPE="linux_raid_member" [root@ipfirebkp log]# blkid /dev/sdb /dev/sdb: UUID="05b8873e-60a5-a6e2-8ac5-fcbabb83a5b7" UUID_SUB="88c9ba11-84ec-3990-1d64-2bb747d70d9a" LABEL="ipfire:0" TYPE="linux_raid_member" [root@ipfirebkp log]# blkid /dev/md127 And the next time I booted the installation on /dev/sda was used, which I then had to setup as well: [root@ipfirebkp log]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 465.8G 0 disk ├─sda1 8:1 0 4M 0 part ├─sda2 8:2 0 128M 0 part /boot ├─sda3 8:3 0 32M 0 part /boot/efi ├─sda4 8:4 0 1G 0 part [SWAP] └─sda5 8:5 0 464.6G 0 part / sdb 8:16 0 465.8G 0 disk └─md127 9:127 0 0B 0 md [root@ipfirebkp log]# blkid /dev/sda /dev/sda: UUID="05b8873e-60a5-a6e2-8ac5-fcbabb83a5b7" UUID_SUB="9248c5ee-e150-fde0-305d-39c5fa4efa86" LABEL="ipfire:0" TYPE="linux_raid_member" [root@ipfirebkp log]# blkid /dev/sdb /dev/sdb: UUID="05b8873e-60a5-a6e2-8ac5-fcbabb83a5b7" UUID_SUB="88c9ba11-84ec-3990-1d64-2bb747d70d9a" LABEL="ipfire:0" TYPE="linux_raid_member"
> > This should not take that long. It should normally be done within 5 seconds. > > Is there any chance that the storage device is incredibly slow. No, I don't think so, when resyncing the RAID the writing speed was 170M/s
Thanks for providing the logs. There is this line in it: > [ 0.732387] dracut: rd.md=0: removing MD RAID activation That is only part of c167 and it should not be there. > https://git.ipfire.org/?p=thirdparty/dracut.git;a=blob;f=modules.d/90mdraid/parse-md.sh;hb=631d5f72a223288aa1f48bb8e8d0313e75947400#l11 This is where this happens in dracut. Could you please add rd.md=1 to the kernel command line and boot up with that? I suppose that should fix it and assemble the RAID correctly at boot time. Now we only need to find out why dracut thinks that it should not do this automatically.
It might be another problem introduced by shellcheck: > https://git.ipfire.org/?p=thirdparty/dracut.git;a=commitdiff;h=909961d048d383beb9f437cf3304ebbb602e4247#patch26 > https://git.ipfire.org/?p=thirdparty/dracut.git;a=commitdiff;h=77fa33017dec6834b971702ece817919348e2a7d
Added rd.md=1 to the kernel line but it didn't make any difference. cat /proc/mdstat Personalities : md127 : inactive sda[0](S) 52428724 blocks super 1.0 unused devices: <none>
Could you send the entire log again? Did the line disappear? I suppose that dracut is parsing the configuration incorrectly. So this was just an attempt to convince differently.
I tried that, too, and I did not get the RAID back as well. But maybe both disks are not "similar" enough anymore after I booted and setup both systems on /dev/sda and /dev/sdb. I'll try a fresh install and add the kernel command line before the reboot at the end of install.
The RAID seems to be absolutely fine and healthy. Dracut just thinks it has been configured to not turn on and RAIDs. That is my hypothesis at the moment.
(In reply to Michael Tremer from comment #18) > Could you send the entire log again? Did the line disappear? > > I suppose that dracut is parsing the configuration incorrectly. So this was > just an attempt to convince differently. I still have the line with rd.md=0: dracut: //usr/lib/initrd-release@5(): VERSION='2.27 dracut-056' dracudevd[339]: starting eudev-3.2.6 ut: //usr/lib/initrd-release@6(): PRETTY_NAME='IPFire 2.27 (x86_64) - core167 dracut-056 (Initradracut: rd.md=0: removing MD RAID activation mfs)'
Created attachment 1053 [details] Complete boot log from serial console with md.rd=1 on command line I added the complete boot log grabbed from the serial console with rd.debug and rd.md=1 on the kernel command line.
Created attachment 1054 [details] Bootlog from CU167 vm with rd.md=1 & rd.debug on kernel command line As for Dirk I also have the same line still. Attached is the bootlog with rd.md=1 and rd.debug both set
I will leave any further actions as Dirk is working on it so that we don't end up with too many duplicate entries
(In reply to Adolf Belka from comment #24) > I will leave any further actions as Dirk is working on it so that we don't > end up with too many duplicate entries Thanks for your help so far. :-)
> https://lists.ipfire.org/pipermail/development/2022-May/013494.html I believe that this is not a bug in dracut. It is just some unexpected behaviour on my part. Could you please add "rd.auto" to /etc/default/grub as I did in the patch, then run "grub-mkconfig -o /boot/grub/grub.cfg" and reboot. That should assemble the RAID correctly at boot time. It also fixed itself on my system but that resulted in the data of the device that wasn't mounted before (i.e. the older state).
Looks good, now the RAID is assembled again: [root@ipfirebkp ~]# cat /proc/mdstat Personalities : [raid1] md127 : active raid1 sdb[1] sda[0] 488386432 blocks super 1.0 [2/2] [UU] [===>.................] resync = 19.0% (92943744/488386432) finish=51.7min speed=127278K/sec bitmap: 4/4 pages [16KB], 65536KB chunk unused devices: <none> [root@ipfirebkp ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 465.8G 0 disk └─md127 9:127 0 465.8G 0 raid1 ├─md127p1 259:0 0 4M 0 part ├─md127p2 259:1 0 128M 0 part /boot ├─md127p3 259:2 0 32M 0 part /boot/efi ├─md127p4 259:3 0 1G 0 part [SWAP] └─md127p5 259:4 0 464.6G 0 part / sdb 8:16 0 465.8G 0 disk └─md127 9:127 0 465.8G 0 raid1 ├─md127p1 259:0 0 4M 0 part ├─md127p2 259:1 0 128M 0 part /boot ├─md127p3 259:2 0 32M 0 part /boot/efi ├─md127p4 259:3 0 1G 0 part [SWAP] └─md127p5 259:4 0 464.6G 0 part / Thank you and Adolf very much for your help, now I can plan for getting my production system back on a RAID, too.
Thank you for confirming. Getting things back up is now probably a difficult thing. The re-synced RAID will have the state from before the last boot into only a single device. That should generally be fine, but if a user now installs an update, those changes will be wiped when the RAID is resyncing. This is a major headache and I am not quite sure what the best way forward would be. One of the options is to have people do a backup, reinstall and restore the backup. Another option could be that we just update everything and hope for the best. But I generally do not like this strategy.
(In reply to Michael Tremer from comment #28) > > One of the options is to have people do a backup, reinstall and restore the > backup. > That's what I had in mind. My CU167 production system is basically running fine, just not on a RAID. I'll do a backup and clean install, then restore the configuration. My backup (or test) system will take over in the meantime. But I agree, it's not as simple as just doing an upgrade. Thanks again.
On my CU167 vm testbed machine the patch fixed the raid but then when the vm restarted unbound failed to start. I then ran the unbound start command and got the following message with the message about not being able to open the include file repeated many many times. /etc/init.d/unbound start Starting Unbound DNS Proxy... /etc/unbound/unbound.conf:64: error: cannot open include file '/etc/unbound/dhcp-leases.conf': Structure needs cleaning read /etc/unbound/unbound.conf failed: 1 errors in configuration file [1652716903] unbound[5823:0] fatal error: Could not read config file: /etc/unbound/unbound.conf. Maybe try unbound -dd, it stays on the commandline to see more errors, or unbound-ch[ FAIL ] /etc/unbound/unbound.conf:64: error: cannot open include file '/etc/unbound/dhcp-leases.conf': Structure needs cleaning I checked and there is no dhcp-leases.conf file in /etc/unbound Not sure if this is a consequence of the rebuild of the raid array or not. Any clues how to recreate a dhcp-leases.conf file for unbound. I haven't been able to figure anything out yet. If it comes to it I can create a backup from my CU167 vm and then re-install and restore. The vm with the above error messages was a clone of my standard CU167 vm.
(In reply to Adolf Belka from comment #30) > On my CU167 vm testbed machine the patch fixed the raid but then when the vm > restarted unbound failed to start. Can you try running a filesystem check? I am not surprised that the RAID is broken. In a RAID-1 configuration, you cannot decide which one is the "correct" data because you have a 1:1 decision. There is no majority in it.
I tried running an filesystem check and it found a lot of errors which were fixed. When rebooting the system then stopped and said that errors had been found. These looked to be different to before. Ran e2fsck on /dev/md127p5 and these were fixed. Rebooted and again stopped with filesystem errors but fewer again than previous. After about 6 e2fsck and reboots on the system then I have ended up with seven INODE yyyyy ref counts x should be x-1 FIXED Message is that these have been fixed but the system need to reboot to finalise. It rebooted and another set of INODE ref count error come up that have been fixed. This basically is just continuing on and on with different Inode numbers each time so I have decided to do a backup on my active CU167 system and then do a re-install of CU167 and restore
https://patchwork.ipfire.org/project/ipfire/patch/20220516144814.4143999-1-michael.tremer@ipfire.org/
https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=1c1d9fd7bfdf5495069c3119982753a9ddc5fe24
I did a fresh install of CU167 and when it rebooted after the initial install I edited the kernel command line to include rd.auto and after setup was complated I had a working raid array. I the did the grub-mkconfig command to make the change persistent. After the sync had completed I restored my CU167 backup and rebooted. The raid array stayed working and I had my CU167 status as previously. but on a working raid array.
That was my plan for the scheduled downtime tomorrow. But currently I am trying to find a way to mark the "right" disk as faulty and force the rebuild from the running installation.
(In reply to Dirk Sihling from comment #36) > That was my plan for the scheduled downtime tomorrow. > But currently I am trying to find a way to mark the "right" disk as faulty > and force the rebuild from the running installation. I would be very much interested in that. Until then we will have to block the release so that we do not entirely destroy people's installations.
Today I tried to erase the partition table of the disk that did not contain the running installation. My idea was to rebuild the RAID with the disk that has valid data after reboot with rd.auto. Unfortunately the RAID still thought it was complete and consistent. The requested file system check left my system without /boot. Tomorrow I'll give it another try.
Today I tried to invalidate one disk with --zero-superblock, but that didn't work either. The system just booted off the "invalidated" disk anyway, and the other one was assigned to the inactive RAID. I am afraid I have no more ideas what I could try. I'll use Adolf's description to get my production system going again later today.
> https://lists.ipfire.org/pipermail/development/2022-May/013508.html I believe that this script works. Please see the header for details about how it works. Thank you for giving me the clue. Does anyone have any systems that we could use to test this?
I am working from home today, and I am not sure I will have the time tomorrow, but if I do I can test it with my broken CU167 RAID installation (I switched my backup system into the production one yesterday, which was faster and left me with a running system as backup in case something went wrong). On Monday I will definitely find time for testing.
I can create another CU166 vm and update it to CU167 and then use it for a variety of things, including changing what is active and what not and then I can try the patches out. Hopefully I will be able to try later today.
Not a good result I am afraid. I ran the rd.auto patch and rd.auto was added to /etc/default/grub It wasn't mentioned about running grub-mkconfig -o /boot/grub/grub.cfg and I had to run that previously to get rd.auto persistent so I ran that here. Not sure if that was correct or not. I ran the script and cat /proc/mdstat came back with no personalities. I then rebooted and after stopping everything the screen went to restart but then came up with the following message:- FATAL: No bootable medium found! System halted I have a vm clone of the CU167 vm with the failed raid array and with various changes done with only one disk connected so I can clone from this vm and run whatever further changes that are identified as being needed.
Today I tried the following: I put 166 into /opt/pakfire/db/core/mine and did the upgrade again, because I thought the script would be executed at the end of the upgrade process or after reboot. That didn't happen. My system booted into the /dev/sdb installation again. How can I test your script?
Ok, I probably needed to upgrade to CU168 Testing (master/bbd4767f), which I did (twice, once for /dev/sdb, and also for /dev/sda). Both times I ended up with a system running on one disk only. At least it seems to be running without any problems. Current state: [root@ipfire ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 931.5G 0 disk └─md127 9:127 0 0B 0 md sdb 8:16 0 931.5G 0 disk ├─sdb1 8:17 0 4M 0 part ├─sdb2 8:18 0 128M 0 part /boot ├─sdb3 8:19 0 32M 0 part /boot/efi ├─sdb4 8:20 0 1G 0 part [SWAP] └─sdb5 8:21 0 930.3G 0 part / [root@ipfire ~]# cat /opt/pakfire/db/core/mine 168
Hallo @Dirk. As @Michael was u7nsure if the changes he had made would fix things or not he did not put them into the CU168 Testing repository. Yo have to run the changes manually on your system.
(In reply to Adolf Belka from comment #46) > Hallo @Dirk. As @Michael was u7nsure if the changes he had made would fix > things or not he did not put them into the CU168 Testing repository. > > Yo have to run the changes manually on your system. Peter is away and I wasn't sure if it would not be a good idea to wait for a little bit more feedback. As soon as we think it is okay, I will merge it.
How can I get the script so I can test it on my system?
(In reply to Dirk Sihling from comment #48) > How can I get the script so I can test it on my system? See link in comment 40 from @Michael.
(In reply to Michael Tremer from comment #47) > > Peter is away and I wasn't sure if it would not be a good idea to wait for a > little bit more feedback. > > As soon as we think it is okay, I will merge it. Hi @Michael, In case you haven't seen my feedback from testing out the script on my vm. I ended up with a Fatal Error message in my test. The feedback is in Comment 43.
(In reply to Adolf Belka from comment #50) > The feedback is in Comment 43. Oh, I seem to have missed that. Yes, you would need to update the bootloader configuration. So, first you run the RAID repair script; then you add the "rd.auto" parameter and then rebuild the GRUB configuration. That definitely wasn't clear from my patch. Do you have the chance to test again, or how can we get this tested any further?
I retried the vm test. Followed the opposite order to what I did the first time. So upgraded CU167 to CU168 and then ran the repair-mdraid script, then added the rd.auto to default/grub and then ran grub-mkconfig and then rebooted but got exactly the same message. I am cloning another version of my CU167 right now and will then run the CU168 update and the same script etc but not do the reboot so I can check any files etc that you need info on.
A quick check or order of running the patches. Should I run repair-mdraid followed by rd.auto fix and grub-mkconfig after doing the CU168 upgrade or should I run them on the CU167 version before doing the CU168 upgrade. I have been doing the first sequence where the patch changes are applied after doing the CU168 upgrade but before rebooting.
I created a new clone of my CU167 vm and updated to CU168 and then ran repair-mdraid then added rd.auto and then ran grub-mkconfig Checking the disks with fdisk I got the following for sda and sdb, the two disks for the raid array. Disk /dev/sda: 50 GiB, 53687091200 bytes, 104857600 sectors Disk model: VBOX HARDDISK Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes GPT PMBR size mismatch (104857343 != 104857599) will be corrected by write. The backup GPT table is not on the end of the device. Disk /dev/sdb: 50 GiB, 53687091200 bytes, 104857600 sectors Disk model: VBOX HARDDISK Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: B47388B7-F5D4-49D0-A44A-4F9E62EC3697 Device Start End Sectors Size Type /dev/sdb1 2048 10239 8192 4M BIOS boot /dev/sdb2 10240 272383 262144 128M Linux filesystem /dev/sdb3 272384 337919 65536 32M EFI System /dev/sdb4 337920 838109 500190 244.2M Linux swap /dev/sdb5 838110 104851455 104013346 49.6G Linux filesystem So for sda which was the disk being shown in the /proc/mdstat message with the line md127 : inactive sda[0](S) there are no partitions shown but with a GPT PMBR size mismatch message which is not shown on sdb. Not sure if this is linked to the "FATAL: No bootable medium found! System halted" error message that I am still getting or not. It doesn't look like what I would have expected to see for sda.
I also tried parted in case that gave different results as the disks are gpt based but it also doesn't recognise the partition table for sda, it has unknown instead of gpt. parted -l Error: /dev/sda: unrecognised disk label Model: ATA VBOX HARDDISK (scsi) Disk /dev/sda: 53.7GB Sector size (logical/physical): 512B/512B Partition Table: unknown Disk Flags: Warning: Not all of the space available to /dev/sdb appears to be used, you can fix the GPT to use all of the space (an extra 256 blocks) or continue with the current setting? Fix/Ignore? I Model: ATA VBOX HARDDISK (scsi) Disk /dev/sdb: 53.7GB Sector size (logical/physical): 512B/512B Partition Table: gpt Disk Flags: Number Start End Size File system Name Flags 1 1049kB 5243kB 4194kB BOOTLDR bios_grub 2 5243kB 139MB 134MB ext4 BOOT 3 139MB 173MB 33.6MB fat16 ESP boot, esp 4 173MB 429MB 256MB linux-swap(v1) SWAP swap 5 429MB 53.7GB 53.3GB ext4 ROOT
I will probably be able to run a few further tests on my vm setup today but after today my vm setup won't be available for a while so I won't be able to do any further testing or evaluation.
I ran parted -l on a clone of the CU167 vm. This has the other disk drive missed from the raid array. cat /proc/mdstat Personalities : md127 : inactive sdb[1](S) 52428724 blocks super 1.0 parted -l gave parted -l Warning: Not all of the space available to /dev/sda appears to be used, you can fix the GPT to use all of the space (an extra 256 blocks) or continue with the current setting? Fix/Ignore? I Model: ATA VBOX HARDDISK (scsi) Disk /dev/sda: 53.7GB Sector size (logical/physical): 512B/512B Partition Table: gpt Disk Flags: Number Start End Size File system Name Flags 1 1049kB 5243kB 4194kB BOOTLDR bios_grub 2 5243kB 139MB 134MB ext4 BOOT 3 139MB 173MB 33.6MB fat16 ESP boot, esp 4 173MB 429MB 256MB linux-swap(v1) SWAP swap 5 429MB 53.7GB 53.3GB ext4 ROOT Warning: Not all of the space available to /dev/sdb appears to be used, you can fix the GPT to use all of the space (an extra 256 blocks) or continue with the current setting? Fix/Ignore? I Model: ATA VBOX HARDDISK (scsi) Disk /dev/sdb: 53.7GB Sector size (logical/physical): 512B/512B Partition Table: gpt Disk Flags: Number Start End Size File system Name Flags 1 1049kB 5243kB 4194kB BOOTLDR bios_grub 2 5243kB 139MB 134MB ext4 BOOT 3 139MB 173MB 33.6MB fat16 ESP boot, esp 4 173MB 429MB 256MB linux-swap(v1) SWAP swap 5 429MB 53.7GB 53.3GB ext4 ROOT So both drives are still showing the partition tables as would be expected. Then I ran the repair-mdraid script on this and ran parted-l again and got parted -l Warning: Not all of the space available to /dev/sda appears to be used, you can fix the GPT to use all of the space (an extra 256 blocks) or continue with the current setting? Fix/Ignore? I Model: ATA VBOX HARDDISK (scsi) Disk /dev/sda: 53.7GB Sector size (logical/physical): 512B/512B Partition Table: gpt Disk Flags: Number Start End Size File system Name Flags 1 1049kB 5243kB 4194kB BOOTLDR bios_grub 2 5243kB 139MB 134MB ext4 BOOT 3 139MB 173MB 33.6MB fat16 ESP boot, esp 4 173MB 429MB 256MB linux-swap(v1) SWAP swap 5 429MB 53.7GB 53.3GB ext4 ROOT Error: /dev/sdb: unrecognised disk label Model: ATA VBOX HARDDISK (scsi) Disk /dev/sdb: 53.7GB Sector size (logical/physical): 512B/512B Partition Table: unknown Disk Flags: So it looks like running the repair-mdraid script is causing one of the drives to end up with an unknown Partition Table and is the drive that was listed in the mdstat as being part of the raid array but inactive.
Having read through the repair-mdraid script a bit more closely then what I am finding with parted probably is what would be expected because part of the script is to remove the partition table from the bad device. In the script there is a section raid-rebuild which should run right at the end of the script. That looks to be searching for any raid elements in /dev/md/ and looking for ipfire:0 but directory md does not exist under /dev/ It could well be that I am not understanding well enough what the script should be doing. I have not been able to figure out how to make the required drive show up as bootable to my vm so I keep getting the "FATAL: No bootable medium found! System halted" message
Hello all, first, thanks for all the testing feedback and reporting back. Based from what I understood, the RAID repair script has one important pitfall: If for some reason the machine cannot boot from the second SSD/HDD - which appears to be the case in at least one of Adolf's testing attempts -, a reboot will cause the IPFire installation not to come back up again. Manual interaction is then required to fix this, with physical presence at worst. Core Update 168 has been in testing for three weeks now, and contains some security-relevant items. Therefore, I would like to propose the following procedure: (a) Michael's RAID repair script will go into the update and is executed automatically during the upgrade procedure, if the installation is found to run on a RAID. (b) The release announcement for Core Update 168 will come with a very strong note regarding the situation, and urge IPFire users running their installations on a RAID to manually check things are going to be okay _before_ conducting the first reboot after having upgraded. Does this read reasonable to you? Thanks, and best regards, Peter Müller
https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=69aac83da960bc89783aa8dc5373b907cccc60f8 https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=71d53192d37db0d86a9dc04b11aa40016ba09b47 @Michael: Are there any further steps required for this?
(In reply to Peter Müller from comment #59) > Core Update 168 has been in testing for three weeks now, and contains some > security-relevant items. Therefore, I would like to propose the following > procedure: > > (a) Michael's RAID repair script will go into the update and is executed > automatically during the upgrade procedure, if the installation is found to > run on a RAID. > > (b) The release announcement for Core Update 168 will come with a very > strong note regarding the situation, and urge IPFire users running their > installations on a RAID to manually check things are going to be okay > _before_ conducting the first reboot after having upgraded. > > Does this read reasonable to you? I am probably the least experienced one here, but for me this sounds ok and I can't think of any other applicable procedure. My problem was solved by a fresh install and importing a backup of the configuration, which worked well.
(In reply to Dirk Sihling from comment #61) > (In reply to Peter Müller from comment #59) > > Core Update 168 has been in testing for three weeks now, and contains some > > security-relevant items. Therefore, I would like to propose the following > > procedure: > > > > (a) Michael's RAID repair script will go into the update and is executed > > automatically during the upgrade procedure, if the installation is found to > > run on a RAID. > > > > (b) The release announcement for Core Update 168 will come with a very > > strong note regarding the situation, and urge IPFire users running their > > installations on a RAID to manually check things are going to be okay > > _before_ conducting the first reboot after having upgraded. > > > > Does this read reasonable to you? > > I am probably the least experienced one here, but for me this sounds ok and > I can't think of any other applicable procedure. Skilled enough to find this in the first place :) I would like to thank everyone who has been working on this. This is one of the nastiest bugs that we have had in a while, because it isn't very easy to correct. The script is kind of best effort. I absolutely would say that if your system is configured as a RAID but does not come up at all if the first hard drive fails, then it is configured incorrectly. However, I do not remember testing this extensively on my own systems either. This is therefore something that we will have to highlight, since it will cause a problem even though our script has been working fine. That would be a shame. > My problem was solved by a fresh install and importing a backup of the > configuration, which worked well. Yes, I would say that at this point we should consider all affected systems broken and the script is a bit of a shot in the dark with a certain probability to fix it for good. I suppose that probability is rather high, but the factors that might bring it further down are outside of our control. Those can only be handled/mitigated by the person right in front of the system, so we need to make them aware.
https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=69aac83da960bc89783aa8dc5373b907cccc60f8 https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=71d53192d37db0d86a9dc04b11aa40016ba09b47
https://blog.ipfire.org/post/ipfire-2-27-core-update-168-released