Hacker News new | past | comments | ask | show | jobs | submit login

I agree, I don't buy spares, but when I have a drive failure, the first thing I do is an incremental backup, so that I know my data is safe regardless, while I am waiting for a drive.

Also worth noting that I don't think I experienced hard fails, it's often the unrecoverable error count shooting up in more than one event, which tells me it's time to replace. So I don't wait for the array to be degraded.

But I guess that's an important point, monitor your drives. Synology will do that for you, but you should monitor all your other drives. I have a script that uploads all the smart data off all my drives across all my machines to a central location, to keep an eye on SSD wear levels, SSD bytes written (sometimes you have surprises), free disk space and smart errors.






Do you have a link to your script? Mostly I'd love to have a good dashboard for that data.

Not the full script but can share some pointers.

Using smartctl to extract smart data as it works so well.

Generally "smartctl -j --all -l devstat -l ssd /dev/sdXXX". You might need to add "-d sat" to capture certain devices on linux (like drive on an expansion unit on synology). By the way, synology ships with an ancient version of smartctl, you can use a xcopy newer version on synology. "-j" export to json format.

Then you need to do a bit of magic to normalise the data. Like some wear level are expressed in health (start = 100) or percent used (start = 0). There are different versions of smart data, the "-l devstat" outputs a much more useful set of stats but older SSDs won't support that.

Host writes are probably the messiest part, because sometimes they are expressed in blocks, or units of 32MB, or something else. My logic is:

  if (nvme_smart_health_information_log != null)
  {
   return nvme_smart_health_information_log.data_units_written * logical_block_size * 1000;
  }
  if (scsi_error_counter_log?.write != null)
  {
   // should be 1000*1000*1000
   return (long)(double.Parse(scsi_error_counter_log.write.gigabytes_processed) * 1024 * 1024 * 1024);
  }
  var devstat = GetAtaDeviceStat("General Statistics", "Logical Sectors Written");
  if (devstat != null)
  {
   return devstat.value * logical_block_size;
  }
  if (ata_smart_attributes?.table != null)
  {
   foreach (var att in ata_smart_attributes.table)
   {
    var name = att.name;
    if (name == "Host_Writes_32MiB")
    {
     return att.raw.value * 32 * 1024 * 1024;
    }
    if (name == "Host_Writes_GiB" || name == "Total_Writes_GB" || name == "Total_Writes_GiB")
    {
     return att.raw.value * 1024 * 1024 * 1024;
    }
    if (name == "Host_Writes_MiB")
    {
     return att.raw.value * 1024 * 1024;
    }
    if (name == "Total Host Writes")
    {
     return att.raw.value;
    }
    if (name == "Total LBAs Written" || name == "Total_LBAs_Written" || name == "Cumulative Host Sectors Written")
    {
     return att.raw.value * logical_block_size;
    }
   }

  }
and even that fails in some cases where the logical block size is 4096.

I think you need to test it against your drives estate. My advice, just store the raw json output from smartctl centrally, and re-parse it as you improve your logic for all these edge cases based on your own drives.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: