6

Goal

I am trying to figure out why my file system has become read-only so I can address any potential hardware or security issues (main concern) and maybe fix the issue without having to reinstall everything and migrate my files from backup (I might lose some data but probably not much).

According to the manual of btrfs check:

Do not use --repair unless you are advised to do so by a developer or an experienced user, and then only after having accepted that no fsck successfully repair all types of filesystem corruption. Eg. some other software or hardware bugs can fatally damage a volume.

I am thinking of trying the --repair option or btrfs scrub but want input from a more experienced user.

What I’ve tried

I first noticed a read-only file system when trying to update my system in the terminal. I was told: Cannot open log file: (30) - Read-only file system [/var/log/dnf5. log]

I have run basic checks (using at least 3 different programs) of my SSD without anything obviously wrong. The SSD and everything else in my computer is about 6 and a half years old, so maybe something is failing. Here is the SMART Data section of the output from sudo smartctl -a /dev/nvme0n1:

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning: 0x00
Temperature: 31 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 1%
Data Units Read: 33,860,547 [17.3 TB]
Data Units Written: 31,419,841 [16.0 TB]
Host Read Commands: 365,150,063
Host Write Commands: 460,825,882
Controller Busy Time: 1,664
Power Cycles: 8,158
Power On Hours: 1,896
Unsafe Shutdowns: 407
Media and Data Integrity Errors: 0
Error Information Log Entries: 4,286
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 31 Celsius
Temperature Sensor 2: 30 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06, NSID Oxffffffff)
Self-test status: No self-test in progress
No Self-tests Logged

I tried the following I think from a live disk sudo mount -o remount,rw /mount/point but that output an error such as, cannot complete read-only system.

sudo btrfs device stats /home and sudo btrfs device stats / outputs:

[/dev/mapper/luks-7215db73-54d1-437e-875d-f82fae508b5d].write_io_errs 0
[/dev/mapper/luks-7215db73-54d1-437e-875d-f82fae508b5d].read_io_errs 0
[/dev/mapper/luks-7215db73-54d1-437e-875d-f82fae508b5d].flush_io_errs 0
[/dev/mapper/luks-7215db73-54d1-437e-875d-f82fae508b5d].corruption_errs 14
[/dev/mapper/luks-7215db73-54d1-437e-875d-f82fae508b5d].generation_errs 0

This seems to suggest that corruption is only in the /home directory.

However, sudo btrfs check /dev/mapper/luks-7215db73-54d1-437e-875d-f82fae508b5d stops at [5/8] checking fs roots with the end of the output at the top of this image:

sudo btrfs check /dev/mapper/luks-7215db73-54d1-437e-875d-f82fae508b5d

Some of these files may be in the / directory, but I’m not sure without looking into further.

sudo btrfs fi usage / provides:

sudo btrfs fi usage /

I think that Data,single, Metadata,DUP, and System,DUP might be saying I can repair the corruption if it’s only in metadata or system but not if it’s the actual file data. Might be something to explore more.

Here is vi /etc/fstab:

vi /etc/fstab

sudo dmesg | grep -i “btrfs” states:

sudo dmesg | grep -i “btrfs”

The file system is indeed unstable. Once, I wasn’t able to list any files in my /home directory, but I haven't run into this issue again across several reboots.

What I think might be causing this

I suspect that changing my username, hostname, and display name (shown on the login screen) recently may have caused problems because my file system became read-only about a week to a week and a half after doing so. I followed some tutorials online, but I noticed that many of my files still had the group and possibly user belonging to the old username. So I created a symbolic link at the top of my home directory pointing the old username to the new one, and it seemed like everything was fine until the read-only issue. There may have been more I did, but I don’t remember exactly as it’s been a few weeks now. I have a history of most or all of the commands I ran if it might be helpful.

I think it may be something hardware related, something I did, software bugs (maybe introduced by a recent update — I have a picture of packages affected in my most recent dnf upgrade transaction, but I was unable to rollback or undo the upgrade because of the read-only file system), improper shutdowns (may have done this while making changes to the username, hostname, and display name), or a security issue.

18
  • 1
    @horsey_guy just posted fstab near the end of the ‘what I've tried’ section. Thanks Commented yesterday
  • 1
    I just added a photo of dmesg filtered for “btrfs” near the end of the same section. Filtering for “error” turned up nothing else relevant AFAIK. Looking at journalctl -xb now. From what I've read so far btrfs check —repair looks like a long shot to me as well Commented yesterday
  • 4
    1. Please don't post images of text. Copy and paste the text itself into your question and format it as code by selecting it and pressing Ctrl-K or by using the editor's {} icon, or by adding a line containing three backticks before AND after the text. 2.You are almost certainly wrong about what you think is causing this. Changing the hostname etc will not cause disk errors.
    – cas
    Commented yesterday
  • 2
    3. The most likely cause can be found in your statement "The SSD and everything else in my computer is about 6 and a half years old, so maybe something is failing." - 6 1/2 years is beyond the life expectancy of almost any SSD, certainly beyond that of typical consumer-grade drives. Try running smartctl -a on the SSD's real device node (not the /dev/mapper entry), and keep an eye out for the drive's age and lifetime and any FAILED/FAILING entries. maybe something like smartctl -a /dev/sdi | awk '$1 ~ /^(9|202)/ || /FAIL(ING|ED)/' which works for my ancient Crucial MX300 drives.
    – cas
    Commented yesterday
  • 1
    smartctl -a doesn't show the attributes on an nvme like it does for a sata ssd - i assumed sata ssd was what you meant when you said "SSD" but didn't say "NVME" . I only realised you were talking about an nvme when I saw the smartctl output. And, yeah, the info seems inconsistent and contradictory. and the power on hours seems completely wrong for a 6.5 year old drive. BTW, "Power on Hours" is exactly what the name implies - the count of hours where the drive has had power.
    – cas
    Commented 5 hours ago

2 Answers 2

6

software problem

If this is assumed to be a software problem (inconsistent filesystem on good hardware) then the quick, safe, and easy (compared to making a full backup) approach is

  1. to set up device mapper manually for the block device to be checked
  2. create an rw snapshot of the device (with CoW to RAM, to an empty "real" block device (if you have e.g. LVM with some free space in a VG), a loop device backed by a file on a different filesystem, or an USB stick)
  3. run btrfs check --repair on the snapshot

If that works (i.e. the fsck does not rip your (virtual) btrfs volume to pieces) then you can safely repeat the command on the real device.

If this is something you want to try then I can help with the necessary commands.

hardware problem

If there is reason to assume that the hardware is failing then an image backup should be made first. The snapshot fsck approach can be applied to the new hardware then.

1
  • Yes, btrfs check --repair after a backup seems the most viable and sensible solution. I recommend the OP use something such as Clonezilla for backing up the whole partition or btrfs-clone for copying the filesystem (better option).
    – horsey_guy
    Commented yesterday
3

Your drive is ancient and dying - not quite dead yet but well on its way. It could fail completely AT ANY MOMENT.

Your only option is to replace it. ASAP. The urgency of replacing this drive can not be overstated.

The easiest way will be to replace it with a new drive of the exact same size and use ddrescue to make a bit copy of the entire old drive to the new drive.

However, since it's 6.5 years old, you may want to take advantage of the fact that newer SSDs come in far larger capacities for reasonable prices. In that case, it's probably easiest to start from scratch. The old SSD is still readable, so you should be able to copy the data off of it. Or start with a ddrescue clone of it and use gparted or something to expand the partitions & filesystems once you've copied it.

While you're at it, if you can physically fit another drive in the system, I recommend adding a second identical drive as a btrfs RAID-1 mirror drive so that you have at least some redundancy. One of the main points of using a filesystem like btrfs or zfs is error detection and correction. Without redundancy, you get error detection but not error correction. I don't know if you already have a 2nd drive in that btrfs pool or not because you haven't shown the underlying SSD(s), just the /dev/mapper nodes.

NOTE: if you can not replace the drive immediately, at least make a backup if you're not already making regular backups (which you should because RAID and RAID-like filesystems are NOT a substitute for backups). At least you'll have a backup of your data while you're deciding what to buy and waiting for it to arrive.

Hardware is replacable. Lost data is not.

14
  • 1
    The advice is just like the comment of this one too: unix.stackexchange.com/questions/754679/…
    – horsey_guy
    Commented yesterday
  • 2
    Is there any particular reason why you believe that a hardware issue is the likely root cause since I see none in the provided info? Unfortunately we have no SMART data (yet). The most pertinent device stats output shows no I/O error. Commented 14 hours ago
  • 1
    I had a very similar corruption issue recently on a mirrored volume which I could replicate on either mirror. SMART data showed no issue and a random simultaneous hardware issue on both drives seems extremely unlikely under those circumstances. I ended up sending the volume data onto to a newly created volume and then resynced the new volume with its mirror. So far no more issues. Commented 14 hours ago
  • 1
    @horsey_guy I also have errno=-5 IO failure from that link, thank you! It's in the bottom of the image corresponding to sudo dmesg | grep -i “btrfs” Commented 6 hours ago
  • 1
    The drive is 6.5 years old, well beyond its expected lifespan. It has started getting write errors, corruption, and remounting in read-only mode. These are signs that the drive is failing/has failed and needs to be replaced. The smartctl -a info that he added at my request is weird - it says 1896 power on hours (2.5 years which seems OK if it wasn't on 24/7) but 8158 power cycles which seem inconsistent with each other unless it gets power-cycled over 4 times per day. It's possible that it's just metadata corruption (it IS btrfs, after all) but that's very risky to assume.
    – cas
    Commented 6 hours ago

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.