Safety engineering is an applied science strongly related to systems engineering and the subset System Safety Engineering. Safety engineering assures that a life-critical system behaves as needed even when pieces fail.
Ideally, safety-engineers take an early design of a system, analyze it to find what faults can occur, and then propose safety requirements in design specifications up front and changes to existing systems to make the system safer. In an early design stage, often a fail-safe system can be made acceptably safe with a few sensors and some software to read them. Probabilistic fault-tolerant systems can often be made by using more, but smaller and less-expensive pieces of equipment.
Far too often, rather than actually influencing the design, safety engineers are assigned to prove that an existing, completed design is safe. If a safety engineer then discovers significant safety problems late in the design process, correcting them can be very expensive. This type of error has the potential to waste large sums of money.
The exception to this conventional approach is the way some large government agencies approach safety engineering from a more proactive and proven process perspective, known as "system safety". The system safety philosophy is to be applied to complex and critical systems, such as commercial airliners, complex weapon systems, spacecraft, rail and transportation systems, air traffic control system and other complex and safety-critical industrial systems. The proven system safety methods and techniques are to prevent, eliminate and control hazards and risks through designed influences by a collaboration of key engineering disciplines and product teams. Software safety is a fast growing field since modern systems functionality are increasingly being put under control of software. The whole concept of system safety and software safety, as a subset of systems engineering, is to influence safety-critical systems designs by conducting several types of hazard analyses to identify risks and to specify design safety features and procedures to strategically mitigate risk to acceptable levels before the system is certified.
Additionally, failure mitigation can go beyond design recommendations, particularly in the area of maintenance. There is an entire realm of safety and reliability engineering known as Reliability Centered Maintenance (RCM), which is a discipline that is a direct result of analyzing potential failures within a system and determining maintenance actions that can mitigate the risk of failure. This methodology is used extensively on aircraft and involves understanding the failure modes of the serviceable replaceable assemblies in addition to the means to detect or predict an impending failure. Every automobile owner is familiar with this concept when they take in their car to have the oil changed or brakes checked. Even filling up one's car with fuel is a simple example of a failure mode (failure due to fuel exhaustion), a means of detection (fuel gauge), and a maintenance action (filling the car's fuel tank).
For large scale complex systems, hundreds if not thousands of maintenance actions can result from the failure analysis. These maintenance actions are based on conditions (e.g., gauge reading or leaky valve), hard conditions (e.g., a component is known to fail after 100 hrs of operation with 95% certainty), or require inspection to determine the maintenance action (e.g., metal fatigue). The RCM concept then analyzes each individual maintenance item for its risk contribution to safety, mission, operational readiness, or cost to repair if a failure does occur. Then the sum total of all the maintenance actions are bundled into maintenance intervals so that maintenance is not occurring around the clock, but rather, at regular intervals. This bundling process introduces further complexity, as it might stretch some maintenance cycles, thereby increasing risk, but reduce others, thereby potentially reducing risk, with the end result being a comprehensive maintenance schedule, purpose built to reduce operational risk and ensure acceptable levels of operational readiness and availability.
The two most common fault modeling techniques are called failure mode and effects analysis and fault tree analysis. These techniques are just ways of finding problems and of making plans to cope with failures, as in probabilistic risk assessment. One of the earliest complete studies using this technique on a commercial nuclear plant was the WASH-1400 study, also known as the Reactor Safety Study or the Rasmussen Report.
Failure modes and effects analysisEdit
- Main article: Failure mode and effects analysis
Failure Mode and Effects Analysis (FMEA) is a bottom-up, inductive analytical method which may be performed at either the functional or piece-part level. For functional FMEA, failure modes are identified for each function in a system or equipment item, usually with the help of a functional block diagram. For piece-part FMEA, failure modes are identified for each piece-part component (such as a valve, connector, resistor, or diode). The effects of the failure mode are described, and assigned a probability based on the failure rate and failure mode ratio of the function or component.
Failure modes with identical effects can be combined and summarized in a Failure Mode Effects Summary. When combined with criticality analysis, FMEA is known as Failure Mode, Effects, and Criticality Analysis or FMECA, pronounced "fuh-MEE-kuh".
Fault tree analysisEdit
- Main article: Fault tree analysis
Fault tree analysis (FTA) is a top-down, deductive analytical method. In FTA, initiating primary events such as component failures, human errors, and external events are traced through Boolean logic gates to an undesired top event such as an aircraft crash or nuclear reactor core melt. The intent is to identify ways to make top events less probable, and verify that safety goals have been achieved.
FTA may be qualitative or quantative. When failure and event probabilites are unknown, qualitative fault trees may be analyzed for minimal cut sets. For example, if any minimal cut set contains a single base event, then the top event may be cause by a single failure. Quantitative FTA is used to compute top event probability, and usually requires computer software such as CAFTA from the Electric Power Research Institute or SAPHIRE from the Idaho National Laboratory.
Some industries use both fault trees and event trees. An event tree starts from an undesired initiator (loss of critical supply, component failure etc) and follows possible further system events through to a series of final consequences. As each new event is considered, a new node on the tree is added with a split of probabilities of taking either branch. The probabilities of a range of "top events" arising from the initial event can then be seen.
Usually a failure in safety-certified systems is acceptable if, on average, less than one life per 109 hours of continuous operation is lost to failure. Most Western nuclear reactors, medical equipment, and commercial aircraft are certified to this level. The cost versus loss of lives has been considered appropriate at this level (by FAA for aircraft under Federal Aviation Regulations).
Probabilistic fault tolerance: adding redundancy to equipment and systemsEdit
Once a failure mode is identified, it can usually be prevented entirely by adding extra equipment to the system. For example, nuclear reactors contain dangerous radiation, and nuclear reactions can cause so much heat that no substance might contain them. Therefore reactors have emergency core cooling systems to keep the temperature down, shielding to contain the radiation, and engineered barriers (usually several, nested, surmounted by a containment building) to prevent accidental leakage.
Most biological organisms have a certain amount of redundancy: multiple organs, multiple limbs, etc.
For any given failure, a fail-over or redundancy can almost always be designed and incorporated into a system.
When does safety stop, where does reliability begin? Edit
Assume there is a new design for a submarine. In the first case, as the prototype of the submarine is being moved to the testing tank, the main hatch falls off. This would be easily defined as an unreliable hatch. Now the submarine is submerged to 10,000 feet, whereupon the hatch falls off again, and all on board are killed. The failure is the same in both cases, but in the second case it becomes a safety issue. Most people tend to judge risk on the basis of the likelihood of occurrence. Other people judge risk on the basis of their magnitude of regret, and are likely unwilling to accept risk no matter how unlikely the event. The former make good reliability engineers, the latter make good safety engineers.
Now let us say there is a need to design a Humvee with a rocket launcher attached. The reliability engineer could make a good case for installing launch switches all over the vehicle, making it very likely someone can reach one and launch the rocket. The safety engineer could make an equally compelling case for putting only two switches at opposite ends of the vehicle which must both be thrown to launch the rocket, thus ensuring the likelihood of an inadvertent launch was small. An additional irony is that it is unlikely that the two engineers can reconcile their differences, in which case a manager who doesn't understand the technology could choose one design over the other based on other criteria, like cost of manufacturing.
Inherent fail-safe designEdit
When adding equipment is impractical (usually because of expense), then the least expensive form of design is often "inherently fail-safe". The typical approach is to arrange the system so that ordinary single failures cause the mechanism to shut down in a safe way (for nuclear power plants, this is termed a passively safe design, although more than ordinary failures are covered).
One of the most common fail-safe systems is the overflow tube in baths and kitchen sinks. If the valve sticks open, rather than causing an overflow and damage, the tank spills into an overflow.
Inherent fail-safes are common in medical equipment, traffic and railway signals, communications equipment, and safety equipment.
It is also common practice to plan for the failure of safety systems through containment and isolation methods. The use of isolating valves, also known as the block and bleed manifold, is very common in isolating pumps, tanks, and control valves that may fail or need routine maintenance. In addition, nearly all tanks containing oil or other hazardous chemicals are required to have containment barriers set up around them to contain 100% of the volume of the tank in the event of a catastrophic tank failure. Similarly, long pipelines have remote-closing valves periodically installed in the line so that in the event of failure, the entire pipeline is not lost. The goal of all such containment systems is to provide means of limiting the damage done by a failure to a small localized area.
- Earthquake engineering
- Effective Safety Training
- Forensic engineering
- Hazard and operability study
- Nuclear safety
- Process Safety Management
- Risk assessment
- Risk management
- Safety life cycle
- Workplace safety
General references Edit
- Lees, Frank (2005). Loss Prevention in the Process Industries (3 ed.). Elsevier. ISBN 9780750675550.
- Kletz, Trevor (1984). Cheaper, safer plants, or wealth and safety at work: notes on inherently safer and simpler plants. I.Chem.E.. ISBN 0852951671.
- Kletz, Trevor (2001). An Engineer’s View of Human Error (3 ed.). I.Chem.E.. ISBN 0852954301.
- Kletz, Trevor (1999). HAZOP and HAZAN (4 ed.). Taylor & Francis. ISBN 0852954212.
- Lutz, Robyn R. (2000). Software Engineering for Safety: A Roadmap. The Future of Software Engineering. ACM Press. ISBN 1581132530. http://www.cs.ucl.ac.uk/staff/A.Finkelstein/fose/finallutz.pdf. Retrieved 31 August 2006.
- Grunske, Lars; Kaiser, Bernhard; Reussner, Ralf H. (2005). Specification and Evaluation of Safety Properties in a Component-based Software Engineering Process. Springer. http://se.informatik.uni-oldenburg.de/pubdb_files/pdf/gr06s.pdf. Retrieved 31 August 2006.
- US DOD (10 February 2000). Standard Practice for System Safety. Washington, DC: US DOD. MIL-STD-822D. http://www.faa.gov/library/manuals/aviation/risk_management/ss_handbook/media/app_h_1200.PDF. Retrieved 31 August 2006.
- US FAA (30 December 2000). System Safety Handbook. Washington, DC: US FAA. http://www.faa.gov/library/manuals/aviation/risk_management/ss_handbook/. Retrieved 31 August 2006.
- NASA (16 December 2008). Agency Risk Management Procedural Requirements. NASA. NPR 8000.4A. http://nodis3.gsfc.nasa.gov/displayDir.cfm?Internal_ID=N_PR_8000_004A_.
- IT 16 (september 2016) http://www.ecsdev.org/index.php/proceedings
- American Society of Safety Engineers (official website)
- Board of Certified Safety Professionals (official website)
- System Safety Society (official website)
- The Safety and Reliability Society(official website)