Hardware WILL Fail
In our IT careers, we always see hardware failures. The approach towards this is to identify quickly and replace without hesitation. Hardware is hardware, and it WILL fail. There is no need to have any worry or questionable conscience over this.
I once had a fire alarm in my home that suddenly went off. This continued over a period of time. An incompetent alarm contractor (the senior person) said it was the air conditioning grill being too close and setting it off. He adjusted the A/C fins. All the other apartments followed the same rules of distance from the alarm to the A/C grills. I said it would be too cold having the air directed at me on my lounge sofa. He said, “Move the sofa”. I asked the contractors to leave and I lodged a complaint saying the fellow was never to enter my apartment again.
We got a new alarm. This alarm went off as well. We replaced it and it was fine. It turns out the new alarm was part of a known faulty batch, so there was no hesitation from the supplier. The confusion around all this was considerable, but you see the point. The issue was entirely related to hardware. Needless to say I did not move my sofa.
In the late 1990’s, the sales team I supported brought a million dollar piece of equipment to demo to an Australian client. It failed, as did the sale. During testing, I noticed the Unix “cat” command returned strange characters on files. We replaced the hard disk and the SCSI cable just in case. That fixed the problem. Well, no. When the software was recompiled to the latest version, the developer left out a module. It took a lot of analysis to find and fix.
On Amazon AWS, I have had two hard disk failures. I was in a terminal shell. As I used “cd ..” to go to a parent directory, the sub-directories vanished. By the time I got to the root directory, all was gone. The console log showed a hard disk failure. The second failure was when I started to get websites falling over. I found that memcached would keep stopping. The error said it could not read a directory file, but that file did exist. I went to the root directory and used the “find” command, and it froze, unable to navigate sub-directories. Again, an error in the console log.
On Akamai/Linode, hardware errors kept scrolling down my shell terminal screen. I rebuilt to solve the problem. The second situation was less obvious. Some websites were slower than they should have been. I had varying timeouts when using the wget command from Amazon to Linode. There were DNS resolve errors. I rebuilt with a new IP address and Linode, and the problem went away.
I give these details because hardware failure is real, regardless of which vendor we use. They can be tricky to identify. The issues do not resolve until we go to anther classroom in the school building, so-to-speak.
If we make a disaster recovery snapshot, or daily backups with errors, we have a problem. One way to test a system prior to a snapshot backup is to use these commands to ensure they do not freeze.
cd /
find . -name wp-config.php
ls -lRa
Another indicator of failure is if a “tar cvf” command fails even when there is enough disk space. Of course, as with the other Unix/Linux commands, the output should be correct.
What about software?
We can also check files for viruses, and use a database checking command. Even though older backups are definitely out of date, we should retain some.
We can also keep copies of WordPress .xml exports, the CSS styling and menu/widget details, plugin exports, and WordPress .tar files on a PC or USB disk.
A number of WordPress theme live editors do not give the ability to go down to a lower level and copy the webpage code. One would need to plan and test how to restore a site, including CSS changes with any WP theme. Most folks use a theme they like, but without prior recovery testing, it becomes risky to do anything other than hope the website continues well when upgrades are done. In other words, there should be a certain technical understanding of the software in order to support it, rather than just use it.
3rd party WordPress applications/plugins can introduce memory overload or leaks, thereby impacting a system. While not hardware, they can grind performance down or crash memory. People can introduce a new service to a website, then call for help, not mentioning what the previous changes were, so we discover things like plugin error messages, or repeated entries we didn’t see before from the Apache or Nginx access logs. As an example, one security plugin expected access from IP6 even though it was disabled, which used up all the system’s memory. Another service was hammering a website even though it was disabled in the plugin, but it was active from the service itself.
Vendor Hardware Changes
A hosting provider in Canberra had a really good service. Websites were fast, and people were friendly. One day, things got slower. Maybe WordPress was more complicated? That is possible. This degradation was from something else behind the scenes. Maybe more people were using the service on the same shared hardware.
The company was then bought out in a takeover by a well know group. Okay. Fine. Then later, website failures became chronic. I logged into a shell terminal and saw there was a different operating system, and hardware. I called the Help Desk and they angrily said I was not allowed to have a shell terminal login. They admitted the hardware was downgraded for the takeover and my only option was to upgrade my plan.
Another provider advertised a low cost VPS service, but it came out of Singapore. It seemed buggy, and connections to databases often dropped out. The service was removed!
We have no control over behind-the-scenes activities from vendors. We are not able to spend lots of money on higher levels of redundancy, so we plan for disaster recovery. Hardware fault detection and recovery has been designed and built into IT systems for a long time, but it is not a guarantee, at all. Marketing tells us all is good.




