SSD Trouble - Replacement of a tired unit
Sunday, August 31. 2025
Trouble
Operating multiple physical computers is a chore. Things do happen, especially at times when you don't expect any trouble. On a random Saturday morning, an email sent by a system daemon during early hours would look something like this:
The following warning/error was logged by the smartd daemon:
Device: /dev/sda [SAT], FAILED SMART self-check. BACK UP DATA NOW!
Device info:
SAMSUNG MZ7PC128HAFU-000L1, S/N:S0U8NSAC900712, FW:CXM06L1Q, 128 GB
For details see host's SYSLOG.
Aow crap! I'm about to lose data unless rapid action is taken.
Details of the trouble
Details from journalctl -u smartd:
Aug 30 00:27:40 smartd[1258]: Device: /dev/sda [SAT], FAILED SMART self-check. BACK UP DATA NOW!
Aug 30 00:27:40 smartd[1258]: Sending warning via /usr/libexec/smartmontools/smartdnotify to root ...
Aug 30 00:27:40 smartd[1258]: Warning via /usr/libexec/smartmontools/smartdnotify to root: successful
Then it hit me: My M.2 SSD is a WD. What is this Samsung I'm getting alerted about? Its this one:

Oh. THAT one! It's just a 2.5" S-ATA SSD used for testing stuff. I think I have a Windows VM running on it. If you look closely, there is word "FRU P/N" written in block letters. Also under the barcode there is "Lenovo PN" and "Lenovo C PN". Right, this unit manufactured in September 2012 was liberated from a Laptop needing more capacity. Then it ran one Linux box for a while and after I upgraded that box, drive ended up gathering dust to one of my shelves. Then I popped it back into another server and used it for testing.
It all starts coming back to me.
More details with parted /dev/sda print:
Model: ATA SAMSUNG MZ7PC128 (scsi)
Disk /dev/sda: 128GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 
Number  Start   End    Size    File system  Name                          Flags
 1      1049kB  106MB  105MB   fat32        EFI system partition          boot, esp, no_automount
 2      106MB   123MB  16.8MB               Microsoft reserved partition  msftres, no_automount
 3      123MB   127GB  127GB   ntfs         Basic data partition          msftdata, no_automount
 4      127GB   128GB  633MB   ntfs                                       hidden, diag, no_automount
Oh yes, Definitely a Windows-drive. Further troubleshooting with smartctl /dev/sda -x:
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  9 Power_On_Hours          -O--CK   090   090   000    -    47985
 12 Power_Cycle_Count       -O--CK   095   095   000    -    4057
177 Wear_Leveling_Count     PO--C-   017   017   017    NOW  2998
178 Used_Rsvd_Blk_Cnt_Chip  PO--C-   093   093   010    -    126
179 Used_Rsvd_Blk_Cnt_Tot   PO--C-   094   094   010    -    244
180 Unused_Rsvd_Blk_Cnt_Tot PO--C-   094   094   010    -    3788
190 Airflow_Temperature_Cel -O--CK   073   039   000    -    27
195 Hardware_ECC_Recovered  -O-RC-   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   253   253   000    -    0
233 Media_Wearout_Indicator -O-RCK   198   198   000    -    195
Just to keep this blog post brief, above is a shortened list of the good bits. Running the command spits out ~150 lines of information on the drive. Walking through what we see:
- Power on hours: ~48.000 is roughly 5,5 years.
	- Since the unit manufacture of Sep -12 it has been powered on for over 40% of the time.
- Thank you for your service!
 
- Power cycle count: ~4000, well ... that's a few
- Wear level: ~3000. Or when processed 17. I have no idea what the unit of this would be or the meaning of this reading.
- Reserve blocks: 126 reserve used, still 3788 unused.
	- That's good. Drive's internal diagnostics has found unreliable storage and moved my precious data out of it into reserve area.
- There is still plenty of reserve remaining.
- The worrying bit is obvious: bad blocks do exist in the drive.
 
- ECC & CRC errors: 0. Reading and writing still works, no hiccups there.
- Media wear: 195. Again, no idea of the unit nor meaning. Maybe a downwards counter?
Replacement
Yeah. Let's state the obvious. Going for the cheapest available unit is perfectly ok in this scenario. The data I'm about to lose won't be the most precious one. However, every single time I lose data, that's a tiny chunk stripped directly from my soul. I don't want any of that to happen.
Data Recovery
A simple transfer time dd if=/dev/sda of=/dev/sdd:
250069680+0 records in
250069680+0 records out
128035676160 bytes (128 GB, 119 GiB) copied, 4586.76 s, 27.9 MB/s
real    76m26.771s
user    4m30.605s
sys     14m49.729s
Hour and 16 minutes later my Windows-image was on a new drive. I/O-speed of 30 MB/second isn't much. With M.2 I'm used to a whole different readings. Do note, the replacement drive has twice the capacity. As it stands, 120 GB is plenty for the use-ase.
Going Mechanical
Some assembly with Fractal case:

Four phillips screws to the bottom of the drive. Plugging cables back. That's a solid 10 minute job. Closing the side cover of the case and booting the server to validate everything still working as expected.
New SMART
Doing a 2nd round of smartctl /dev/sda -x on the new drive:
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     -O--CK   100   100   000    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    1
 12 Power_Cycle_Count       -O--CK   100   100   000    -    4
148 Unknown_Attribute       ------   100   100   000    -    0
149 Unknown_Attribute       ------   100   100   000    -    0
167 Write_Protect_Mode      ------   100   100   000    -    0
168 SATA_Phy_Error_Count    -O--C-   100   100   000    -    0
169 Bad_Block_Rate          ------   100   100   000    -    54
170 Bad_Blk_Ct_Lat/Erl      ------   100   100   010    -    0/47
172 Erase_Fail_Count        -O--CK   100   100   000    -    0
173 MaxAvgErase_Ct          ------   100   100   000    -    2 (Average 1)
181 Program_Fail_Count      -O--CK   100   100   000    -    0
182 Erase_Fail_Count        ------   100   100   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
192 Unsafe_Shutdown_Count   -O--C-   100   100   000    -    3
194 Temperature_Celsius     -O---K   026   035   000    -    26 (Min/Max 23/35)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
199 SATA_CRC_Error_Count    -O--CK   100   100   000    -    131093
218 CRC_Error_Count         -O--CK   100   100   000    -    0
231 SSD_Life_Left           ------   099   099   000    -    99
233 Flash_Writes_GiB        -O--CK   100   100   000    -    173
241 Lifetime_Writes_GiB     -O--CK   100   100   000    -    119
242 Lifetime_Reads_GiB      -O--CK   100   100   000    -    1
244 Average_Erase_Count     ------   100   100   000    -    1
245 Max_Erase_Count         ------   100   100   000    -    2
246 Total_Erase_Count       ------   100   100   000    -    10512
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning
Whoa! That's the fourth power on to a drive unboxed from a retail packaking. Three of them had to be in the manufacturing plant. Power on hours reads 1, that's not much. SSD life left 99 (I'm guessing %).
Finally
All's well. No data lost. Just my stress level jumping up.
My thinking is: If that new drive survives next 3 years running a Windows on top of a Linux, then it has served its purpose.


