KEPCO, INC.: REDUNDANCY AND HOT SWAPPING

KEPCO
VIDEOS

KEPCO
ARTICLES

KEPCO
NEWSLETTERS

NEWS
RELEASES

KEPCO
CURRENTS

Redundancy And Hot Swapping Keep Systems Running If You Plan To Remain Viable In Markets Where Downtime Is Unacceptable, High-Availability Systems Are A Prerequisite.

BY Paul O'Boyle and Steve Kugler

The need for systems to tolerate failures without interrupting service has grown in step with the rapid deployment of electronic systems in applications such as data processing, financial and equities services, telecommunications, and transportation. Many of these systems have on-line; real-time; and, in some cases, safety-related, requirements. Along with our dependency on these systems comes a rising intolerance for downtime that would impact customer service, sales, and, ultimately, corporate profits.

Regardless of the reliability and backup contingencies built into the system, if power is not available, the system won't operate. N+1 redundancy and hot swapping offer a means of ensuring uninterrupted operation if a power fault occurs. Implementation of fault tolerant power systems is now becoming standard for many systems and networks where even short periods of downtime can significantly affect processes or services (Fig. 1).

Before a reliable N+1 scheme can be implemented, a number of critical questions must be answered:

How much power do I need?
Do I need current sharing?
How does current sharing work?
How do I detect failures?

Articles
TOUR

In addition, basic terminology, such as the difference between OR-ing, blocking, isolation, and steering diodes, can often lead to confusion.

A critical link in all electrical/electronic industrial systems is the need to convert source power, typically from the ac mains, to isolated, regulated, and conditioned low-voltage dc power. N+1 redundancy ensures continuity of this power through the use of a power conversion system that can tolerate a fault to one component without losing system functionality. An extension of this, an N+2 system, can tolerate a fault to two components. Closely linked to redundancy is the concept of hot swapping, the ability to remove and replace a failed unit without removing power to the system or compromising system functionality. This feature becomes important once a failure occurs in the N+1 system. The benefits of N+1 redundancy are greatly reduced if the system must be powered down in order to locate and replace the failed unit.

How Much Power?

To evaluate how much power is needed for an N+1 redundancy system, first establish the power required for the load. Then, determine the minimum number of units that can supply that power. Add one for an N+1 system, add two for an N+2 system, and so on. For example, if you're operating a 48-V rail that must maintain a minimum of 2 A to its load, you might use two 2-A units in parallel. This technique is N+1 redundancy, since the system could still supply 2 A to the load if one unit failed. If you use three 1-A units in parallel, you would still have N+1 redundancy, since the system could still supply 2 A to the load if one unit failed. If you used three 2-A units, you would now have N+2 redundancy, where the system will tolerate the failure of two units simultaneously without affecting system operation.

Do I Need Current Sharing?

An example of a simple N+1 redundant system might be three 1-A power supplies connected in parallel to supply a load with 2-A. If current sharing is not implemented, the nature of voltage sources dictates that the power supply with the highest voltage (one will always be slightly higher than the others) will supply all the current to the load until it reaches its current limit. It then becomes a current stabilizer, and its voltage drops to the level of the next-highest supply which assumes the role of voltage stabilizer. Only after it goes into current mode will the next unit assume part of the load, and so on. The last unit, the one with the smallest voltage setting, ends up controlling the voltage and provides the least current.

The principal disadvantage of this method is that the power supplies will be loaded unequally, leading to increased power-supply failure rates. A further disadvantage is that the voltage seen by the load will change slightly as each unit goes into its current limit and passes voltage control to the next lowest unit, leading to degraded transient recovery in the event of a power module failure. To improve reliability and transient recovery, modern power supplies designed for N+1 redundancy must have either a passive or active means of forced current sharing so that all units share the current load equally (Fig. 2). Current sharing results in lower operating temperatures and reduced failure rates, as well as improved response time, while the cost and complexity of the actual circuitry is low.

How Does Current Sharing Work?

Passive current sharing can be implemented in applications where poor load regulation can be tolerated. If two poorly regulated supplies are connected in parallel, the output voltage decreases as the output current through the power supply set at the higher voltage increases. At some point, the second power supply will begin to supply more of the current, resulting in automatic load sharing. This design requires no additional circuitry and is fail-safe, but depends on poor regulation to operate properly.

Active or forced current sharing is a form of master-slave operation in which the current supplied by the master is measured and the other units are controlled to match the current, ensuring that they are all equally loaded (Fig. 3). The load-share or current-share bus signal line represents the highest current output of any one supply. This line is fed back to each power supply in the system and compared to the actual current being supplied by that unit. If the current-share-bus (CSB) voltage represents an unequal load share, the CSB voltage will be higher than the current-share line for that unit, requiring that unit to supply more current. The nature of the circuit dictates that it will stabilize when all units are supplying an equal share of the current. A requirement of N+1 redundant systems employing load sharing is that all modules in the system must be able to be preset to the voltage being supplied by the system.

Detecting Failure

N+1 redundancy offers an enormous increase in system reliability since two units would have to fail simultaneously for the power system to fail. If, for example, the failure rate of one unit is 10 x 105 (equivalent to a MTBF (mean-time between failures) of 100,000 hours), the failure rate of the N+1 system is approximately (1 x 105) x (1 x 105) or 1 x 1010, equivalent to a MTBF of 10 billion hours. This amounts to an enormous increase in reliability. Obviously, for the N+1 redundant system to work, it must recognize that a failure has occurred and be capable of indicating which component or unit failed without shutting down the system. The detection scheme should accurately and consistently identify and localize a failure to a specific replaceable module. Without built-in circuits to identify and localize the fault, it is impossible to determine which unit needs servicing while maintaining system power. The same circuits that monitor power-supply outputs in order to detect fault conditions can also be used to provide indications that a fault in a particular module exists. Relay contacts can be a convenient way of providing electrically isolated fault indications.

Fault detection goes hand in hand with fault isolation, which isolates the power system from any adverse effects of a failure. These functions are the basic elements of any fault-tolerant power system. The following examples are offered to illustrate the difficulties involved.

Output Low Faults - Consider the method for detecting an output low (undervoltage) failure. For a single power supply, the fault detector need only monitor output voltage (or current, in current-stabilized applications) to determine if the output is operating within specification. Any power-bus fault must be caused by the failure of the one and only power supply. In the case of the simplest N+1 system, that of two output-paralleled power supplies, the task becomes much more complex. If the output of one of the power supplies fails low, the other power supply will continue to support the load. The fault detector must be capable of determining that a fault has occurred, and which of the two power converters is defective so that the power system can be serviced. The problem intensifies when three or more power converters comprise the power system.

Several schemes can be implemented to address this problem. The most direct method is to insert a diode in series with each output between the power converter output and the power bus, and to monitor the output of the power converter itself. Called OR-ing, blocking, isolation, or steering diodes, these diodes all perform similarly in different circuits, while performing all four functions. The diodes isolate (block) the output of each unit from the power bus, allowing the fault detector to report a failure, while OR-ing (or steering) the current output of each unit to be applied to the load. The diodes isolate the current share bus of each unit so that a failure of any unit doesn't affect overall functioning of the N+1 system, while at the same time performing an OR function to allow the voltage representing the power supply providing maximum current to control the current-share bus (Fig. 3, again).

In the event of an output low fault, in this example, the diode blocks the power-bus voltage from forcing the power converter output high, and the fault detector of the defective converter can now measure and report the output failure.

There are problems with this approach, however. Since all of the load current drawn by the power bus flows through these diodes, they are normally quite large and expensive, and in most applications require some amount of heatsinking. These diodes are essential only for hot-swapping applications. If redundancy and fault indication are needed, but not on-line replacement (hot swapping), efficiency can be greatly improved by eliminating the blocking diodes. Furthermore, if the blocking diode fails shorted, as is most common in these applications, the fault detector cannot detect low-output failures unless additional circuitry is employed to monitor the voltage drop across the diode. This circuitry must be capable of distinguishing voltage drops of the same order of magnitude as the output ripple voltage, and in applications where power bus load current varies significantly, this technique can be very inconsistent.

A solution that avoids most of the above problems is to use blocking diodes and monitor both output voltage and current of each module (Fig. 4). In the event of a power module failure in which the output fails low, the fault detector will sense that no current is supplied to the load and indicate a fault, even while the bus voltage remains high. This form of fault detection also works with a shorted blocking diode, since module current, not voltage, is the key parameter being monitored. The only requirement is that forced load sharing be used, since the load-share signal forms the basis for the operation of the module current detector.

Output High Faults - Similar problems exist for output-high (overvoltage) failures. Consider the same two power converters operating in N+1 redundancy, now with output-blocking diodes installed. If one converter fails output high, the second converter senses an overvoltage condition and stops delivering output power. This feature avoids the pitfall of having all of the output-paralleled power converters follow the defective module into overvoltage (often called "selective overvoltage"), but now both power supplies will show an output fault. If the power supply with the output-high fault shuts down. the second converter will recover and the fault signal will be valid. However, if overvoltage shutdown does not occur, it is impossible to determine which power converter failed.

A better way is for the fault detector to monitor both the power bus voltage and the current delivered by each power supply to determine whether or not each power converter is operating properly. This technique improves the accuracy of the detector circuit while negating the need for blocking diodes.

This method is not entirely foolproof, however, since it cannot detect shorted blocking diodes, nor does it eliminate their need in hot-swap applications. Nevertheless, it is the most complete and accurate method presently available to determine operating status of output-paralleled power converters while on-line, with only a modest increase in circuit complexity.

Compensating For Load Failures

In a conventional, single-power-supply configuration, the maximum overload current delivered to the power bus in the event of a load failure is determined by the power rating of the power supply and/or the maximum current limit setting. The use of high-redundancy power systems (N+2, N+3, etc.) creates special problems, especially in telecommunications applications where the power supply must operate in both voltage- and current-stabilized output regulation modes. Excess capacity can be dangerous if the power bus is shorted, and all of the power supplies deliver their maximum output current through the system's load wiring. Thermal damage and even insulation fires are possible.

Solutions include distributed-load-protection devices (fuses, circuit breakers, thermistors, etc.), and sizing of load wiring based on maximum possible current delivery of the power system. Many power supply designs include either fixed or optional time-out circuits as part of the overcurrent-protection circuitry. These circuits shut down the power supply after 10 to 30 seconds on the assumption that long-term overloads represent major load problems that have already compromised the system. Be aware that this is not a viable option for power supplies supporting battery-based power buses, such as in many telecommunication applications, where long-term current-stabilized operation is a normal operating condition.

Other Considerations

Connector design must be carefully considered to permit removal and insertion of power modules with power applied. The connectors used for hot-swapping must incorporate protection against arcing, since nonenergized contacts will come into contact with energized components. Reliable connector mating also must be a consideration, often calling for guide bars, guide pins, or blind-mate connectors to ensure easy installation and removal.

A properly designed hot-swap system should include protection against the possibility of installing the wrong module, for example, one with a different output voltage. Most importantly, the power bus must be protected from excessive transients due to the charging of the output power capacitor which can occur during insertion or extraction of the power supply module.

True fault-tolerant power systems should address loss of source power as well as loss of power conversion. Indeed, many fault-tolerant power systems require separately generated and protected power sources for each of the multiple power converters used to supply the dc power bus. Others use either on-line or off-line uninterruptible power sources (UPSs) with battery or generator backup in the event of primary power loss. Still others, most notably telecommunication systems, use a distributed power architecture consisting of a combination of all of the above applied to both source and load circuits.

The burden of these additional protective functions adds significant life-cycle costs which must be considered against performance requirements. For instance, using on-line UPSs for source-power redundancy involves inrush start-up current of the power converters, while specification of off-line UPS requires knowledge of the correct relationship between output ride-through time and UPS transfer time to preserve power-bus integrity. Batteries create their own overhead burdens in the form of maintenance, charging requirements, and environmental considerations.

Available Options

Power supplies incorporating many of the features discussed above are available from several manufacturers, among them HC Power (HC1010 Series), Lambda/Qualidyne (MPS Series), Antec, Switching Power Inc., International Power Source, and Kepco Inc. (HSP and HSF Series). All represent products specifically designed for fault-tolerant power systems used in the international marketplace. They include such features as wide-range (universal) input with power factor correction, internally-mounted output isolation diodes, forced load-sharing circuitry, blind-mate connectors and fault detector circuitry with both visual and electrical indicators. The Kepco HSP Series logical fault detector with selective overvoltage shutdown provides accurate fault detection and fault isolation both with and without the optional isolation diode.

Paul O'Boyle is a senior design engineer at Kepco and specializes in new product development. He holds a BSEE from the Polytechnic Institute of Brooklyn, N.Y. His previous design experience includes military and industrial power converters. He is currently engineering group leader for switch-mode power-supply development.

Steve Kugler is supervisor of technical writing/webmaster at Kepco and has a BA in English from Lehman College. He spent 25+ years working in military, commercial, and marketing technical documentation.

HSPs in RA 60 Rack Adapter Fig. 1. - The HSP Series of 1000-W power supplies comply with many requirements of the fault tolerant systems now becoming pervasive in dat processing, telecom, financial, and transportation systems, These requirements include N+1 redundancy with current sharing, hot swappability, and universal input-voltage ranges

[Return to Article]

3 Power supplies

Fig. 2. - Three power supplies paralleled with forced current sharing. Blocking diodes D1, D2, and D3 keep each supply isolated from each other., allowing the system to continue operating if one power supply fails.

[Return to Article]

Fig. 3

Fig. 3. - This simplified schematic diagram of a current sharing circuit shows blocking diodes. The load-share or signal line represents the highest current output of any one supply. This is fed back to each power supply in the system and compared to the actual current being supplied by that unit

[Return to Article]

Fig. 4. - Simplified fault detection circuit that monitors output voltage and current. The circuit detects faults even if the blocking diode shorts out. The load-sharing signal, which essentially determines which power supplies are providing current to the load, and how much, would be used for current monitoring in real-life N+1 systems

3 Power supplies

[Return to Article]

Products • Support • Literature • Contact Us • Careers • About

KEPCO, INC. • 131-38 SANFORD AVENUE • FLUSHING, NY. 11355 U.S.A.
TEL (718) 461-7000 • FAX (718) 767-1102
www.kepcopower.com • email: hq@kepcopower.com

Products

Applications

Support

Literature

Contact Us

About

Careers

BATTERY CHARGERS

Redundancy and Hot Swapping

Redundancy And Hot Swapping Keep Systems Running If You Plan To Remain Viable In Markets Where Downtime Is Unacceptable, High-Availability Systems Are A Prerequisite.

How Much Power?

Do I Need Current Sharing?

How Does Current Sharing Work?

Detecting Failure

Compensating For Load Failures

Other Considerations

Available Options