When Backups Fail

Whenever someone tells me, either personally or professionally, that they’re getting a new computer, my first words are “What about backup?” I liked to think that I took my own advice and that I was well-protected against hard drive failure. This is the story of how I was wrong.

Almost.

The bold beginning

Let’s start with my backup arrangement that I used for the past year, ever since I had a major file loss.

Slide1

Here’s the breakdown of that picture:

  • Macintosh HD – a 3TB Fusion Drive (combined solid-state drive (SSD) and hard drive); the main hard drive of my iMac.
  • Data – a 3TB hard drive inside an Apple Airport Express; my Time Machine backup.
  • Dropbox – a cloud service remote backup for my /Users/seligman/Dropbox folder.
  • Backblaze – a second cloud service remote backup of Macintosh HD.

Three layers of backup coupled with fast SSDHD (solid-state drive/hard drive). What could go wrong?

Signs of trouble

Occasionally over the past few years I’d get a warning from Time Machine that the Data drive in the Time Capsule needed to be erased. I didn’t think much about it, because of the way Time Machine works: It duplicates the contents of the main drive, and saves any replaced files for as long as it can. But if the size of a file it wants to copy is greater than the amount of free space remaining on Time Machine, the backup process can get into trouble. As I used up more space on the main Macintosh HD drive, there was less space on the Time Machine backup to work with.

At this point, you may ask: The main drive was 3TB, as was the Time Machine backup. 3TB is a lot of storage for a personal computer. What was taking up so much space, with such large files?

Answer:

Disc Collection

In other words, most of it consisted of digital media files from the DVDs and Blu-Rays I collected over the past twenty years. Not shown is a massive collection of CDs going back forty years (which I keep in an otherwise-unused huge drawer at work). Also not shown are digital media that I downloaded over the years, mostly purchased via iTunes.

The biggest remaining chunks are video files associated with creating YouTube clips for the Nevis Labs channel.

Much of it, at least in theory, I could recreate. In practice, it would mean a massive amount of effort. Spread over decades, it wasn’t much. To build it all up again seemed to painful to contemplate.

In 2019, I got at least two warnings about needing to rebuild the Time Machine drive. I clicked on the “go ahead” button and let Time Machine do its thing; that’s supposed to be the big advantage of Time Machine over other backup methods.

It occurred to me that this might mean the Time Machine hard drive was failing. I ran what tests I could; a drive in an Apple Airport Express is not directly visible to programs like Disk Utility. But those tests said that the drive was fine.

Still, I had a notion in the back of my mind that the Airport Express drive was 8-10 years old and it might be time to replace it. I had a thought about using an external drive instead, but I didn’t follow up on it… then.

The last time I received a warning that the Time Machine drive needed to be erased and rebuilt was in December of 2019. I automatically clicked “go ahead” and went blithely along.

Main drive failure

Early in January 2020 my iMac became slow and unresponsive. If I clicked on some user widget on the screen, it might take up to a minute to respond. It was a big shift from the usual fast speed from just a couple of weeks before.

At first I thought this was a font issue. The least time I saw a sudden slowdown of my Mac it turned out to be fonts that were the problem. At that time, about ten years ago, I used FontExplorer X Pro to deal with them. When I looked this time, I saw that I had more than a thousand fonts installed.

Of course, I don’t need a thousand fonts; I’m not a graphics designer. These fonts were the accumulation of a couple of decades of software installs: multiple version of Microsoft Office, graphics programs, Adobe products, and so forth.

So I tried to remove fonts that I didn’t need. It made things worse than before; did you know that a Mac system won’t function unless all its Arabic fonts are installed? I had to reinstall the OS twice to recover… and still everything was slow.

I checked my memory use, and according to Activity Monitor I had plenty of free RAM.

I finally decided to go to my area of expertise. I started to use the Terminal instead of fancy graphics tools to solve the problem. You already know it turned out to be a hard drive problem. Specifically, it was the SSD portion of the Fusion Drive.

Diagnosis

Here are the technical details of how I came to this conclusion. If you don’t like to deal with sysadmin stuff, skip this section.

First, I wanted to check how much memory I was using. The command-line program for this in Mac OS X Darwin is vm_stat. Here’s the result on my iMac just now, after the problem was solved:

# vm_stat
Mach Virtual Memory Statistics: (page size of 4096 bytes)
Pages free:                              114622.
Pages active:                           1039899.
Pages inactive:                         1083290.
Pages speculative:                        14319.
Pages throttled:                              0.
Pages wired down:                        652636.
Pages purgeable:                          22198.
"Translation faults":                 782461912.
Pages copy-on-write:                    6341973.
Pages zero filled:                    457651989.
Pages reactivated:                     17660655.
Pages purged:                           1912200.
File-backed pages:                       510632.
Anonymous pages:                        1626876.
Pages stored in compressor:             3224421.
Pages occupied by compressor:           1289005.
Decompressions:                        13590655.
Compressions:                          19332473.
Pageins:                              484431874.
Pageouts:                                136862.
Swapins:                                      0.
Swapouts:                                     0.

Except that, when I executed this command on my busted Mac, the last two values (“swapins” and “swapouts”) were 9-digit numbers. That works out to roughly about a TB of memory swapped in and out.

Modern operating systems structure their memory in “pages”, chunks of memory that are handled as a unit. When all the physical chunks of memory in the computer are used up, pages are written out to disk. My iMac has 16GB RAM, so the number of pages to swap in and out should be zero, or at least very low. In fact, the only time I’ve ever seen those numbers non-zero is that one time I described above.

What was swapping all those memory pages in and out doing to my Fusion Drive? I used smartmontools to find this out. (This utility is part of standard UNIX, but it’s not part of the normal Mac or Windows OS. I strongly recommend installing it.) When I use it on the SSD part of the Fusion Drive, there’s a lot of output that ends in this:

/usr/local/sbin/smartctl -a /dev/disk0 -s on
[...]
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   000    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x000f   100   100   000    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       9027
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       98098
169 Unknown_Apple_Attrib    0x0022   100   100   010    Old_age   Always       -       751793342176
173 Wear_Leveling_Count     0x0022   148   148   100    Old_age   Always       -       5871324366268
174 Host_Reads_MiB          0x0030   100   100   000    Old_age   Offline      -       131266311
175 Host_Writes_MiB         0x0030   100   100   000    Old_age   Offline      -       84080554
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       96
194 Temperature_Celsius     0x0022   062   062   000    Old_age   Always       -       38 (Min/Max 21/88)
197 Current_Pending_Sector  0x0032   000   000   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
244 Unknown_Attribute       0x0002   000   000   000    Old_age   Always       -       0

I identified which drive (the /dev/disk0 in the command line) by using the Terminal command diskutil list.

Compare the above result with the similar output on the SSD in my SSDHD on my computer at work:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   099   099   010    Pre-fail  Always       -       2
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       58608
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       320
177 Wear_Leveling_Count     0x0013   092   092   000    Pre-fail  Always       -       430
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   099   099   010    Pre-fail  Always       -       2
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   099   099   010    Pre-fail  Always       -       2
187 Uncorrectable_Error_Cnt 0x0032   099   099   000    Old_age   Always       -       3290
190 Airflow_Temperature_Cel 0x0032   066   060   000    Old_age   Always       -       34
195 ECC_Error_Rate          0x001a   199   199   000    Old_age   Always       -       3290
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       44
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       117217270543

The most obvious thing to note is that drives from different manufacturers can return different attributes.

Fortunately, both of these SSDs return the amount of data written to the drive, though in different units. The SSD on my iMac has the values:

174 Host_Reads_MiB  =   131266311
175 Host_Writes_MiB =    84080554

The unit “MiB” (which I believe is pronounced “mibibyte”) refers to exactly 1,000,000 bytes, as opposed to a MB (megabyte) which is 1024*1024*1024 bytes. For the purposes of this discussion, it’s sufficient to treat both these units as the same.

So during the seven years I used the SSD in the Fusion Drive, the OS read 131TB and wrote 84TB. That latter number seemed a bit high to me. It’s a 3TB drive, so that value says I wrote to the drive 44 times its total size. Considering that I never did much serious video editing or any other activity with a lot of output, it seemed like something was wrong.

My SSD at work reports the total written in “LBAs” (LBA = Logical Block Address), which are 512 byte sectors. So we have:

241 Total_LBAs_Written =       117217270543

This comes to about 54.5 TB over roughly the same seven-year period. This is more plausible, since over the past five years I’ve done a lot of video downloads, conversions, and editing.

The real kicker is the “wear leveling” parameter. A given sector of an SSD can only be written to a finite number of times. To prevent any given sector from wearing out, the hardware in the SSD automatically distributes sector writes across the entire range of the SSD. A typical SSD, even after many years of use, might have a wear level down to 90%-95% range. If an SSD gets to 50%, it’s worn out and needs to be replaced.

For my SSD in the Fusion Drive at home, I have:

173 Wear_Leveling_Count =   148

For the one at work:

177 Wear_Leveling_Count =   092

Again, for different manufacturers the way this value is displayed can vary. If the value is greater than 100, then you have to subtract 100 to get the leveling as a percent. So the SSD at home is at a 48% wear level, while the one at work is at an expected 92% wear level.

Something wore out the SSD on my iMac. That’s why the Fusion Drive was so slow. I’m not going to copy-n-paste the values, but you can see that the power-on hours for the home SSD is much lower than that of the work SSD, further adding to the conclusion that something anomalous happened with my home computer.

That’s the analysis of the SSD part of the Fusion Drive. What about the hard drive part?

For my computer at home:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   090   074   006    Pre-fail  Always       -       227218413
  3 Spin_Up_Time            0x0003   095   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   037   037   020    Old_age   Always       -       65535
  5 Reallocated_Sector_Ct   0x0033   074   051   036    Pre-fail  Always       -       32432
  7 Seek_Error_Rate         0x000f   056   056   030    Pre-fail  Always       -       1808362289149
  9 Power_On_Hours          0x0032   061   061   000    Old_age   Always       -       34995
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   057   037   020    Old_age   Always       -       44830
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       2138
188 Command_Timeout         0x0032   100   036   000    Old_age   Always       -       332 332 407
189 High_Fly_Writes         0x003a   087   087   000    Old_age   Always       -       13
190 Airflow_Temperature_Cel 0x0022   049   043   045    Old_age   Always   In_the_past 51 (Min/Max 43/52 #5)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   098   098   000    Old_age   Always       -       5029
193 Load_Cycle_Count        0x0032   045   045   000    Old_age   Always       -       110367
194 Temperature_Celsius     0x0022   051   057   000    Old_age   Always       -       51 (0 13 0 0 0)
197 Current_Pending_Sector  0x0012   001   001   000    Old_age   Always       -       32712
198 Offline_Uncorrectable   0x0010   001   001   000    Old_age   Offline      -       32712
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       39388h+11m+48.323s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       46269392914
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       163923486796

What I’ve not shown is a bunch of error messages from smartctl that basically said there was an error in reading some of the parameters.

Other things to note:

  • Total_LBAs_Written = 46269392914 means that 21.5 TB were written to the hard drive. Compare that with 84 TB written to the SSD. The SSD was definitely bearing the brunt of sector refreshes.
  • Offline_Uncorrectable = 32712 means the drive is definitely going bad. Either the drive has 32712 bad sectors, or the count is wonky because it’s counting backwards from 32767. It’s bad either way.

I won’t bore you (at least, not more than you are already) with the disk report on my computer at work, partly because its list of attributes doesn’t point to any direct problems with the drive.

It’s worth noting that if Apple’s utilities included some kind of SMART test and warning, I would have known there was a problem weeks or months ago. Instead, I had to find out about these problems the hard way.

What caused this? I’ll never know for sure. Here’s my guess:

  • Some process or program went wonky on my iMac. Possibly this was related to problems in hard drive portion of the Fusion Drive.
  • That process started consuming massive amounts of RAM. It did so in such a way that the Activity Monitor couldn’t detect it, but vm_stat could.
  • The memory use overflowed the RAM and memory pages started being swapped out to disk. In this case, this was the Fusion Drive.
  • The excess memory use just kept on growing, memory pages kept being written to the Fusion Drive, and hence to the SSD.
  • This kept up through reboots of the computer and upgrades of the operating system.
  • Finally, the various sectors of the SSD had been accessed so frequently that the wear level got extremely low. That made the effective speed of the system glacially slow.
  • Once the SSD wore out, there was nothing I could do to restore it, except for completely replacing the drive. We’ll get to that below.

The last gasp of a dying drive

I could still work on my iMac, after a fashion. Right after (yet another) OS reinstall or a reboot, the system would respond just well enough to run Terminal, Thunderbird, and (maybe) Firefox. But it was becoming clear, especially after running the diagnostics I describe above, that a new hard drive was in my future.

The straw that broke the camel’s back and caused me to take action was a dialog box that popped up on my screen: “studentd quit unexpectedly”. I didn’t know what that was and didn’t care, so I hit the “OK” button (or whatever acknowledgement there was). Seconds later it showed up again. And again. And again.

It didn’t matter how many times I acknowledged that studentd had quit. The dialog box would helpfully inform me of studentd’s non-functioning status.

I looked up studentd. It had to do with an Apple service called Classroom. I never used this service, didn’t want it, didn’t need it. But the arrangement of Apple’s daemon services was such that this process always automatically launched and was automatically rerun if it wasn’t running, even if it wasn’t needed.

Whatever studentd was supposed to do, it was clear that my failing Fusion Drive wouldn’t let it run anymore. Once again I reinstalled the OS, but I still kept getting the “studentd quit unexpectedly” dialog. I finally just left the dialog box on my screen unacknowledged.

It was time for an action plan.

I thought I had a backup. Does that count?

My first idea was to get the hard drive in my iMac replaced. My Applecare Plus warranty had long since expired, so my guesstimate was this would cost $500-$600, given the labor involved and the premium prices Apple charges for its hard drives. Also, Apple would only replace the Fusion Drive with the same model the iMac originally came with; if I wanted to contemplate a 4TB SSD I’d be out of luck.

I contacted a friend of mine who used to work at the Apple Genius Bar. He suggested an alternative: Use an external drive and boot from it instead. That way, I could be in control of the drives I used. If I ever felt the need to purchase a new iMac, I could plug the external drive into the new computer and not worry about copying or reconfiguring anything. (The only issue is whether a new Mac could run Mac OS 10.14 Mojave, since I don’t want to upgrade to the latest, Catalina; I may write another blog post someday about this.)

Either approach required me to restore from backup. So I examine my options:

  • I had a Dropbox backup, but it only included files in my ~/Dropbox folder. My media files were not part of it. Neither were some personal files that I badly wanted to preserve.
  • The Backblaze backup was current as of a few days prior to the massive drive slowdown. It was only then I discovered that Backblaze only copies a Mac’s /Users directory; the files in /Library and /Applications are not included, for example. This lack of knowledge was my fault for not reading the Backblaze web site carefully enough.
  • The Time Machine backup. It must be complete by now, right?

When I checked the status of my Time Machine backup, I saw it was still not complete. Remember how I refreshed my Time Machine backup in December of 2019? Now it was January 2020, with a failing hard drive. As I watched Time Machine’s progress, I saw it would take at least 10 days to finish.

Time to get practical:

  • I knew I’d need some kind of complete backup. So optimize completing the Time Machine backup.
  • To that end, I rebooted my iMac in Safe Mode, to suppress any background processes that were accessing the Macintosh HD drive and slowing things down. In particular, Dropbox and Backblaze were running at glacial speeds that would take years to complete, and Adobe Creative Suite kept attempting to repair itself unsuccessfully.
  • When in Safe Mode, I still got the repeated “studentd quit unexpectedly” dialog. A bit of research showed that the following Terminal command would prevent the execution of studentd entirely:
    sudo mv /System/Library/LaunchAgents/com.apple.macos.studentd.plist \
            /System/Library/LaunchAgents/com.apple.macos.studentd.plist.bak
    

    … and reboot to Safe Mode again.

  • Turn off desktop background and screen savers. There was no reason to waste the CPU cycles or cause the system to access the hard drive more than necessary.
  • Make sure Spotlight was off, again to minimize any drive access.
  • Order the complete restore from Backblaze. Since that was about 2TB of files, they had to send me a USB drive. They charged me $189 to create the drive’s contents and ship it to me, but that was refundable if I sent the drive back to them.
  • Preserve my personal files:
    mv ~/Documents ~/Dropbox/
    ln -sf ~/Dropbox/Documents ~/Documents
    (cd ~/Dropbox; tar -cf - ./Documents) | \
        ssh remote-computer "(cd ~/Dropbox; tar -xvf -)"
    

    The net effect of the above commands was to place my Documents directory into my Dropbox area, make sure that any program referring to something in ~/Documents would now find them in my Documents folder, and copy the Documents folder to a remote computer that was also running Dropbox. That gave me a copy of my personal files in my Dropbox area without having to run Dropbox on my failing computer; it would sync the Documents folder from the Dropbox process on my remote computer.

  • Renew my license for the latest version of Carbon Copy Cloner, since I planned to make more backups. As we’ll see below, CCC proved to be an even more useful tool than I planned.
  • At work, use Diskmaker X to make a Mojave USB-key installer. While I thought I might use it to reinstall the Mac OS on >Macintosh HD from external media, this also proved to be a more useful tool than I planned.
  • With permission, ask to borrow a laptop to take home so I could continue to work from home while my iMac focused on making the Time Machine backup.
  • Order 4TB hard drives and a two-drive disk enclosure. I’ll get into this below.
  • Now, with all background processes halted and nothing else to do, set my iMac in Safe Mode to copy from Macintosh HD to my Airport Express.

Here was my goal:

Slide2

Alpha would be my new main drive. Beta would be a nightly clone of Alpha made with CCC. Gamma would be my new Time Machine backup.

As I waited for this new hardware to arrive, I saw the Time Machine backup was running more quickly since there were no background processes accessing the slow Macintosh HD. It was going so fast that I cancelled the Backblaze restore drive after a couple of days, before it was shipped to me.

By this time the hardware arrived. I had an old OWC miniStack that I purchased in 2008. It could connect to a Mac using USB 3.0. I installed a 4TB drive into the miniStack, and used the laptop and that USB Mac OS installer to install Mojave on the drive. I named the drive Beta. I then set that hard drive aside, but it was comforting to know that I had a system drive I could potentially use to boot my iMac.

After four days, the Time Machine backup was scheduled to complete. I watched the progress, got to when the last MBs should have been copied… and it wouldn’t terminate. In the progress message of “Copying XXX MB of YYY MB”, both XXX and YYY would increment.

Researching the web showed a number of potential fixes to this problem, none of which worked for me. I found a trick to look at the time machine log:

printf '\e[3J' && log show --predicate 'subsystem == "com.apple.TimeMachine"' --info --last 6h | grep -F 'eMac' | grep -Fv 'etat' | awk -F']' '{print substr($0,1,19), $NF}' 

It seemed like the number of files to be backed up was monotonically increasing, even though the iMac was in Safe Mode and I wasn’t doing anything with it. Was this a problem with Macintosh HD or the Time Machine backup drive? Probably the former, based on what happened later.

So I did not have a single coherent backup of my hard drive.

Let’s try again

I installed a fresh 4TB drive into the miniStack, plugged it into my iMac, named it Gamma, and started a Time Machine backup onto it.

I calculated it would take 10-14 days to complete. I used the time to reorder the Backblaze USB drive again, just in case. It should have taken another five days to make, but they shipped it to me in two. I speculate that they never stopped creating the initial drive; perhaps they’ve grown used to fools like me changing their minds about the need for a remote drive.

I waited, watching the Time Machine progress onto Gamma. After about 10 days, it was almost at the end… then it went into the “Copying XXX MB of YYY MB” loop again.

I couldn’t make a finalized Time Machine backup at all.

The two-drive enclosure I purchased was an OWC Mercury Elite Pro Dual with Thunderbolt 2. I put Alpha and Beta into the enclosure, installed Carbon Copy Cloner on Beta, and booted the iMac from Beta.

Wow! An iMac working at normal speed!

I used the Mac OS USB install drive to install an OS on Alpha, then as part of that process tried to initialize Alpha from the Time Machine backup. No go. I got the message that the Time Machine “could not be used.”

OK, Time Machine was no longer an answer. How about Carbon Copy Cloner? While running the OS on Beta, I ran CCC to clone from Macintosh HD to Alpha.

Again, no go. After about two days, CCC terminated due to too many read errors from the bad drive.

Backblaze to the rescue

So I couldn’t make a backup of all of Macintosh HD.

By this time the Backblaze restore USB drive had arrived, with everything from the /Users directory on down. So I set up another Carbon Copy Cloner task, but this one would copy everything except /Users from Macintosh HD to Alpha. That copy took only about 12 hours to run, 6 of which were just comparing the files on the two drives to only copy over the new files. There were some drive errors, but not enough to stop the process.

The CCC task ran to completion. I had the contents of /Library, /Applications, /opt, and so on copied. Then I copied /Users from the Backblaze drive to /Users on Alpha. At last, I had my complete hard drive.

Well, not quite…

When I first booted from Alpha, I got a repeated dialog box that stated macos needs to repair your library and required me to enter my password. As soon I hit “Use password…” and typed in my computer’s password, the dialog box would pop up again.

Eventually I ran the Mac OS USB installer again. That resolved the issue. Finally, I could reboot my computer into a working OS again.

And so the saga was over… NOT!

The reason why I went through the exercise of restoring all the non-/Users files is that I want to preserve all my applications and their settings. This mostly worked. A couple of apps gave me problems, but they were minor by comparison:

  • I had to completely reinstall Microsoft Office. Even so, there was something wonky associated with permissions to access the template files. I never use templates, but MS-Word insisted on displaying an error message anyway. Fortunately I found a fix for this problem.
  • Adobe Creative Suite insisted on being reinstalled. No big deal, since I use it infrequently. I just let it download in the background.
  • Time Machine, Spotlight, and Backblaze were giving me problems. For the first two, I saw they were trying to archive 6TB of files! I finally figured it out: These utilities were scanning and backing up Macintosh HD and Beta in addition to Alpha. I fixed that in System Preferences… mostly.

That last “mostly” was due to Backblaze, which made it hard to exclude Macintosh HD from its list of drives to back up. This makes sense in general; most users want backups of their main internal drives and Backblaze wanted to be a thorough backup. But aside from being unnecessary at this point, any attempt to access Macintosh HD would slow down the computer. In addition to Backblaze’s performance, this was evident in Open/Save dialogs that might have to scan all drives.

I lived with Macintosh HD‘s performance for a couple more weeks, just to make sure there weren’t any lingering files to copy. There weren’t, apart from some forgotten and unused podcast files that I mostly copied so I’d have a “complete” restore. Then I found a way to make sure that Macintosh HD wouldn’t even be mounted when I rebooted the computer:

Create the file /etc/fstab if it doesn’t already exist (you’ll have to use sudo), and edit it to add the following line:

LABEL=Macintosh\040HD none apfs rw,noauto

Now it’s as if Macintosh HD doesn’t exist. I could still mount it using Disk Utility if I had to, but so far I’ve never had to.

I mailed back the Backblaze USB drive. It had done its work admirably. As promised, I got the money refunded by Backblaze.

Are we done yet? Close, but not quite.

OWC Mercury Elite Pro Dual mini-review

I purchased this in haste, as I watched my existing hard drive fall into the long, dark twilight. My funds were low after the party the previous month, so I needed something inexpensive. Sometimes you get what you pay for.

The OWC Mercury Elite Pro Dual has two major flaws:

  • It is noisy. It’s not so much the fan, although that’s a bit louder than most drive enclosures I’ve seen. It’s that there’s almost no dampening of vibration or sounds coming from the hard drives. I had to listen to constant clickety-clacking as the drives’ heads moved over the drives’ disks.

    In comparison, the OWC miniStack I purchased in 2008 is almost noiseless. It’s the same manufacturer, and was make 9 years earlier, but it emitted less noise than its newer dual-drive cousin.

  • It has the worst hardware RAID controller I’ve ever seen.

    If you asked yourself why I didn’t configure the two drives in the Mercury Dual into a RAID1, the controller was the reason. The Dual Elite hardware requires that for a RAID1 (mirroring) or RAID0 (striping), the two drives must be the exact same model with the exact same firmware. What happens five years from now when one of the drives breaks down and that exact model is no longer available?

    It also removes one of the fun things you can do with a RAID1: Replace one drive with a better one, wait for the two drives to sync, then replace the other one; in other words, incremental upgrades to the RAID1. Other hardware RAIDs I’ve worked with (including the two-drive Synology NAS and various low-end 3ware and LSI cards) have this ability.

    In fact, I would have bought the Synology instead, but it can only be used for network-attached storage, not as a boot drive for a computer.

I purchased the Dual Elite because I liked my old miniStack, I trusted OWC as a company, and I wanted Thunderbolt 2 (this was probably unnecessary, since at the max speed of the Dual Elite, USB 3.0 and Thunderbolt 2 would have the same performance). I would still buy a miniStack from them, but I’ll never again purchase an OWC multi-drive enclosure.

If you’re wondering why didn’t I buy a four-drive enclosure and put Gamma in the same box (saving both table space and power outlet), it’s because I wanted to protect myself in case of a hardware failure. If either the miniStack or the Dual Elite fails, at least I’ll have something to fall back on and get running quickly.

Backblaze to the un-rescue

I listened to the clicking and clacking of Alpha for a month. During that time, I also became impatient with the speed of the 4TB hard drive compared to the Fusion Drive I used to have.

I could only find one model of 4TB SSDHD. While it was still available, the last one had been manufactured five years ago. In light of my difficulties, this did not seem a wise purchase.

So I bit the bullet and got a 4TB SSD. A month later, my finances weren’t quite so dire and I could afford it.

The procedure for the switch was straight-forward:

  • Do one last duplicate of Alpha to Beta using Carbon Copy Cloner;
  • remove Alpha from the drive enclosure and replace it with the SSD;
  • boot from Beta;
  • use CCC to clone back from Beta to Alpha.

Simple, right?

I had my quieter and faster drive, but the procedure confused the heck out of Backblaze. Cloning drives can cause Backblaze to treat a drive as brand-new and never-before copied. Alpha was a clone of a clone. Backblaze wanted me to pay a new annual licensing fee to maintain a backup of a second drive.

I went through a mini-saga of consulting websites, fiddling with the /.bzvol directory on both Alpha and Beta, and even reinstalling Backblaze twice. Finally Backblaze would allow me to click on the “inherit previous backup” button without giving an error message.

So it’s over, right?

I dunno. In the last hour as I’ve typed this post, I’ve only been able to type about a dozen characters before there’s a long pause. It only affects this particular WordPress composition page.

So I can’t even write about this problem before a new one crops up somewhere.

For now, I’ve got a working computer. We’ll see what tomorrow brings.

One thought on “When Backups Fail

  1. Nice in-depth troubleshooting! Turning off spotlight was a great idea to buy yourself a bit of time. Reading this makes me realize that I haven’t given enough thought to backups on my own personal computers.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s