SSD Trouble - Replacement of a tired unit
Sunday, August 31. 2025
Trouble
Operating multiple physical computers is a chore. Things do happen, especially at times when you don't expect any trouble. On a random Saturday morning, an email sent by a system daemon during early hours would look something like this:
The following warning/error was logged by the smartd daemon:
Device: /dev/sda [SAT], FAILED SMART self-check. BACK UP DATA NOW!
Device info:
SAMSUNG MZ7PC128HAFU-000L1, S/N:S0U8NSAC900712, FW:CXM06L1Q, 128 GB
For details see host's SYSLOG.
Aow crap! I'm about to lose data unless rapid action is taken.
Details of the trouble
Details from journalctl -u smartd:
Aug 30 00:27:40 smartd[1258]: Device: /dev/sda [SAT], FAILED SMART self-check. BACK UP DATA NOW!
Aug 30 00:27:40 smartd[1258]: Sending warning via /usr/libexec/smartmontools/smartdnotify to root ...
Aug 30 00:27:40 smartd[1258]: Warning via /usr/libexec/smartmontools/smartdnotify to root: successful
Then it hit me: My M.2 SSD is a WD. What is this Samsung I'm getting alerted about? Its this one:
Oh. THAT one! It's just a 2.5" S-ATA SSD used for testing stuff. I think I have a Windows VM running on it. If you look closely, there is word "FRU P/N" written in block letters. Also under the barcode there is "Lenovo PN" and "Lenovo C PN". Right, this unit manufactured in September 2012 was liberated from a Laptop needing more capacity. Then it ran one Linux box for a while and after I upgraded that box, drive ended up gathering dust to one of my shelves. Then I popped it back into another server and used it for testing.
It all starts coming back to me.
More details with parted /dev/sda print:
Model: ATA SAMSUNG MZ7PC128 (scsi)
Disk /dev/sda: 128GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 106MB 105MB fat32 EFI system partition boot, esp, no_automount
2 106MB 123MB 16.8MB Microsoft reserved partition msftres, no_automount
3 123MB 127GB 127GB ntfs Basic data partition msftdata, no_automount
4 127GB 128GB 633MB ntfs hidden, diag, no_automount
Oh yes, Definitely a Windows-drive. Further troubleshooting with smartctl /dev/sda -x:
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
9 Power_On_Hours -O--CK 090 090 000 - 47985
12 Power_Cycle_Count -O--CK 095 095 000 - 4057
177 Wear_Leveling_Count PO--C- 017 017 017 NOW 2998
178 Used_Rsvd_Blk_Cnt_Chip PO--C- 093 093 010 - 126
179 Used_Rsvd_Blk_Cnt_Tot PO--C- 094 094 010 - 244
180 Unused_Rsvd_Blk_Cnt_Tot PO--C- 094 094 010 - 3788
190 Airflow_Temperature_Cel -O--CK 073 039 000 - 27
195 Hardware_ECC_Recovered -O-RC- 200 200 000 - 0
198 Offline_Uncorrectable ----CK 100 100 000 - 0
199 UDMA_CRC_Error_Count -OSRCK 253 253 000 - 0
233 Media_Wearout_Indicator -O-RCK 198 198 000 - 195
Just to keep this blog post brief, above is a shortened list of the good bits. Running the command spits out ~150 lines of information on the drive. Walking through what we see:
- Power on hours: ~48.000 is roughly 5,5 years.
- Since the unit manufacture of Sep -12 it has been powered on for over 40% of the time.
- Thank you for your service!
- Power cycle count: ~4000, well ... that's a few
- Wear level: ~3000. Or when processed 17. I have no idea what the unit of this would be or the meaning of this reading.
- Reserve blocks: 126 reserve used, still 3788 unused.
- That's good. Drive's internal diagnostics has found unreliable storage and moved my precious data out of it into reserve area.
- There is still plenty of reserve remaining.
- The worrying bit is obvious: bad blocks do exist in the drive.
- ECC & CRC errors: 0. Reading and writing still works, no hiccups there.
- Media wear: 195. Again, no idea of the unit nor meaning. Maybe a downwards counter?
Replacement
Yeah. Let's state the obvious. Going for the cheapest available unit is perfectly ok in this scenario. The data I'm about to lose won't be the most precious one. However, every single time I lose data, that's a tiny chunk stripped directly from my soul. I don't want any of that to happen.
Data Recovery
A simple transfer time dd if=/dev/sda of=/dev/sdd:
250069680+0 records in
250069680+0 records out
128035676160 bytes (128 GB, 119 GiB) copied, 4586.76 s, 27.9 MB/s
real 76m26.771s
user 4m30.605s
sys 14m49.729s
Hour and 16 minutes later my Windows-image was on a new drive. I/O-speed of 30 MB/second isn't much. With M.2 I'm used to a whole different readings. Do note, the replacement drive has twice the capacity. As it stands, 120 GB is plenty for the use-ase.
Going Mechanical
Some assembly with Fractal case:
Four phillips screws to the bottom of the drive. Plugging cables back. That's a solid 10 minute job. Closing the side cover of the case and booting the server to validate everything still working as expected.
New SMART
Doing a 2nd round of smartctl /dev/sda -x on the new drive:
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate -O--CK 100 100 000 - 0
9 Power_On_Hours -O--CK 100 100 000 - 1
12 Power_Cycle_Count -O--CK 100 100 000 - 4
148 Unknown_Attribute ------ 100 100 000 - 0
149 Unknown_Attribute ------ 100 100 000 - 0
167 Write_Protect_Mode ------ 100 100 000 - 0
168 SATA_Phy_Error_Count -O--C- 100 100 000 - 0
169 Bad_Block_Rate ------ 100 100 000 - 54
170 Bad_Blk_Ct_Lat/Erl ------ 100 100 010 - 0/47
172 Erase_Fail_Count -O--CK 100 100 000 - 0
173 MaxAvgErase_Ct ------ 100 100 000 - 2 (Average 1)
181 Program_Fail_Count -O--CK 100 100 000 - 0
182 Erase_Fail_Count ------ 100 100 000 - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 0
192 Unsafe_Shutdown_Count -O--C- 100 100 000 - 3
194 Temperature_Celsius -O---K 026 035 000 - 26 (Min/Max 23/35)
196 Reallocated_Event_Count -O--CK 100 100 000 - 0
199 SATA_CRC_Error_Count -O--CK 100 100 000 - 131093
218 CRC_Error_Count -O--CK 100 100 000 - 0
231 SSD_Life_Left ------ 099 099 000 - 99
233 Flash_Writes_GiB -O--CK 100 100 000 - 173
241 Lifetime_Writes_GiB -O--CK 100 100 000 - 119
242 Lifetime_Reads_GiB -O--CK 100 100 000 - 1
244 Average_Erase_Count ------ 100 100 000 - 1
245 Max_Erase_Count ------ 100 100 000 - 2
246 Total_Erase_Count ------ 100 100 000 - 10512
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
Whoa! That's the fourth power on to a drive unboxed from a retail packaking. Three of them had to be in the manufacturing plant. Power on hours reads 1, that's not much. SSD life left 99 (I'm guessing %).
Finally
All's well. No data lost. Just my stress level jumping up.
My thinking is: If that new drive survives next 3 years running a Windows on top of a Linux, then it has served its purpose.