How to Address Infant Mortality Equipment Failure

This article contains insights from the book “Maintenance Control” by James Borowski, a maintenance professional and UpKeep customer. If you want additional help predicting and preventing equipment failures, you can download two chapters of the book for free that deal with equipment failure and reliability.

Infant mortality is a special equipment failure mode that shows the probability of failure being highest when the equipment is first started, but reduces as time goes on. Eventually, the probability of failure levels off after time.

Assuming some basic conditions, equipment that is brand new and recently put into service has the highest resistance to operational stresses. But in some instances this is a false picture and infant mortality arises, usually for one of the following reasons:

  • Equipment that is manufactured, assembled, and/or installed improperly, can fail shortly after being in service.
  • Maintenance, which breaks into a piece of equipment in order to install new parts with different tolerances, or that adds dirt to an otherwise clean system, can cause failure shortly after the equipment is restarted.
  • Equipment that is poorly designed or equipment that is restarted improperly can fail early in its life.

All of the above examples can be used to show how equipment—old and new—can fail early after being newly manufactured, re-manufactured, overhauled, or serviced in some way.

Some experts believe that infant mortality is one of the most common causes of equipment failure throughout the industry.

Graph of infant mortality equipment failure mode

Causes of early-life equipment failure

To determine types and patterns of work for a maintenance organization, the basic instrument is the work order. As work is documented, codes are assigned and information gathered to highlight the pattern of infant mortality. This may not be a simple task. The problem here is that someone needs to review data and determine what the documentation is indicating.

In the following sections, we’ll review different causes of infant mortality and solutions you can implement to prevent early life equipment failure.

Cause: Rebuild by trade/craft personnel

This may be a large gear train, line-shaft assembly, steam turbine, or large HP electric motor. Several crafts may have been involved. After a time, the machine goes down for a reason that is not readily apparent. After disassembly and analysis, a cause of faulty workmanship is determined to be the root cause.

The labor and material to rework the machine are charged to a follow-up work order linked to the original work order. The cause of failure may be due to poor workmanship because of a lack of standards, procedures not being followed, or poor quality procedures being followed.

For all the above examples, the overall job is to recognize these failures as infant mortality and track them for analysis purposes. This is done after the fact by a knowledgeable person within the maintenance organization.

Solution

The countermeasure or corrective action for the above example failures may be to ensure standards are followed, improve or create new procedures, and/or train individual craftsmen or an entire maintenance staff with the expectation that the knowledge gained is used on the job.

Cause: Failed component exchanged for new/remanufactured component

When a part or component assembly fails in service, it is naturally replaced with a new or rebuilt assembly. Recently installed gearboxes and electric motors are a good example. The new assemblies may be functionally compatible, but may not fit just right. Or, the mechanical coupling between rotating elements may be an unfamiliar design requiring closer alignment, which can be a challenge for a maintenance crew in making a correct installation.

In some cases, when a worn, intermeshing moving part is interchanged with a fresh replacement, the difference in running clearances can be troubling. Adding a new gear to a gear train or line-shaft assembly can cause problems because of this mismatch. Changing just the worn or failed gear is not enough in some applications to get the machine back on the reliability track. All or several meshing parts may need to be changed.

Many repair parts used by a maintenance organization are rebuilt. These items are typically expensive and every attempt is made to rebuild them over again as a reasonable cost-saving measure. Yet, the quality of the rebuild may be the cause of infant mortality. Vendors who provide low-quality rebuilds at minimal cost are a detriment to reliability. These contractor organizations must be ferreted out and recognized so reliability can be improved. It is very frustrating to repair a machine using a rebuilt component, having that item fail in a relatively short time. As part of the analysis for infant mortality, the failure of rebuilt items must be documented and tagged with the label of infant mortality.

In a related area, purchasing may acquire a low-cost item that does not live up to expectations, failing early and often. Once again, these situations must be documented as infant mortality and resolved with the purchasing department.

Another common example in this area is the rework required after a piece of equipment is repaired on an emergency basis. With time constraints and production pressures dominating the repair atmosphere, getting equipment back in service does not always correlate with quality work. The result is that the machine gets a Band-Aid to keep it working, but fails again, and maybe again and again. This is probably the most common form of infant mortality. These activities must be documented, tracked, and analyzed.

Does this situation sound familiar?

Solution

The corrective action for this failure is similar to the previous example. That is, the maintenance organization must ensure that procedures are followed, or they must improve existing procedures or create new procedures.

Training of craft personnel is another beneficial factor as long as it comes with an expectation to use the knowledge on the job. In addition, the organization needs to ensure the purchase of quality parts whether new or rebuilt.

In the case of emergency repairs, a management strategy must be agreeable to performing the best quality repairs that are practical—even under breakdown conditions. This is a better approach for the long term.

Cause: Service opens or invades a closed system

Systems that appear to be functioning properly and are “broken into” by maintenance personnel doing preventive maintenance can do more harm than good.

A high-pressure hydraulic system with servo valves and a mandate for an extremely high fluid cleanliness level is compromised any time a hydraulic filter is changed, a fluid sample is taken, and certainly any time a system component is changed. The actions of the most diligent craft personnel may cause considerable contamination to be added to a closed system. It is not uncommon in these situations that after service is provided, the equipment becomes erratic—at least for a little while. If, as a result of the service, the machine goes down, it should be tagged with an infant mortality code and tracked in a database or maintenance management system.

For computer control systems with newly installed circuit boards or sensitive electronic components, the act of installing the board may disturb the connections of neighboring components causing erratic behavior or eventual equipment meltdown.

In each of these cases, a knowledgeable person surveys the situation and recognizes the machine failure as infant mortality.

Solution

Once again, the corrective action for a failure of this type is availability of proper procedures, accountability that procedures are followed, and training of craft personnel along with an expectation to use the knowledge on the job. The corrective action is not associated with increasing the frequency of a PM already in a system or conducting an RCM analysis.

The payoff of identifying and fixing early-life failures

After capturing significant data and with a number of work orders in your database or maintenance management system, let’s assume that patterns of work have been established for your organization as indicated in the table below:

Example of percentage of equipment failure types with infant mortality being the highest

After a review of the table, a reasonable person may say that the first area to attack to improve reliability is infant mortality. In this example, if an organization did nothing but eliminate all instances of infant mortality in their organization, equipment reliability would increase 46%. This is without any investment in technological solutions or expensive consultants.

Another way to look at it: If your maintenance organization had good, quality procedures, a management expectation that those procedures are followed, a trained, knowledgeable maintenance crew, and repair parts of high quality, equipment reliability could be increased as much as 46%. Not bad.

For any existing reliability program that wants to improve, the first step is to determine the amount of infant mortality making up the organization’s work pattern. This may be the biggest piece of the reliability puzzle that can be reasonably addressed with available resources.