Robert Schwarz, James Kuchar, Daniel Hastings, John Deyst
Department of Aeronautics and Astronautics
Massachusetts Institute of Technology
Cambridge, MA USA
Stephan Kolitz
The Charles Stark Draper Laboratory
Cambridge, MA USA
ABSTRACT
A probabilistic model of satellite systems has been developed to examine the impact of task automation on cost and reliability. The model (called SOCRATES) has been implemented using Markov modeling techniques and includes a Graphical User Interface. SOCRATES facilitates the interactive construction of satellite system architectures and the rapid analysis of different automation levels. Satellite systems can be stored and modified and can include multiple satellites, ground control stations, functions, and payloads. Automation cost and reliability models are provided as inputs to the model which then outputs expected life cycle cost and system reliability as functions of time. An example application of SOCRATES to a simplified geostationary communications satellite is provided. Based on the cost and reliability models that are used in this example, the expected revenues for different types of automation are determined and compared.
The degree of automation can vary along a continuum ranging from fully-manual human control, to human supervisory control, to a fully automated (no human) system [6]. In general, as the level of automation is increased, fewer tasks need to be performed by the human operator and operational costs decrease [7]. In addition, when properly applied, automation can result in a reduction in the rate of human-caused errors (e.g., mistyped commands). However, because an automated system may not be as flexible as a human operator in managing unanticipated situations, reliability could also decrease. Finally, the development costs associated with a highly automated system may outweigh the operational cost benefits. Thus, the appropriate degree of automation requires a careful trade study between cost and reliability. In order to address these issues, tools are needed which can predict the effect that automation has on costs and system reliability. Such tools will enable system engineers to identify those functions that should be automated and those that should employ a higher level of human involvement. In order to develop these tools, a fundamental methodology for investigating the tradeoffs between cost and reliability is required.
As shown in Figure 1B, the system reliability may decrease, increase, or be unaffected by the level of automation used depending on the function being automated. For functions that are simple, well understood, or periodic, such as routine stationkeeping on a geostationary satellite, the reliability may increase with increasing automation (Function A in Figure 1B). This trend results when human errors are more likely than software errors and the impact of unanticipated situations is expected to be negligible. For other functions, where events are complex, rare, or unexpected, the system reliability may decrease as humans are removed from the loop (Function C in Figure 1B). This occurs when automation generates problems or is unable to resolve problems that could have been resolved by a human. Alternatively, there may be some functions for which the reliability is nearly independent of the level of automation (Function B).
An increase in system reliability translates directly into a decrease in operating costs due to the lower likelihood of replacement or loss of system functionality. Thus, the potential for failure (or "unreliability") is represented by an opportunity cost which represents revenues forgone as a result of increased system down-time. Determination of opportunity costs requires additional data such as the relationship between a particular function and revenue. The combination of the two curves from Figures 1A and 1B are represented in Figure 1C, showing the overall life cycle cost. In this example, there exists an optimal level of automation at which life cycle cost is minimized. The primary focus of the methodology presented here is to quantify the curves in Figure 1.
For commercial systems, functions associated with the spacecraft payload also generate revenue for the system. These functions may not be associated with any single subsystem or satellite but may be carried out only through the coordination of a constellation of satellite payloads. Because these functions are associated with the entire system, they are termed Mission Objectives. Note that this definition of Mission Objectives serves to focus the functional decomposition and emphasize which functions are associated with revenues, and hence opportunity costs. Otherwise, Mission Objectives are treated in the same manner as other functions.
An event is defined to occur when a function exceeds its performance envelope. Events can represent unforeseen occurrences due to hardware failure or anomalous phenomena as well as routine variations in operation such as periodic drifting that requires small thrust corrections. Once an event occurs, some action is required to return the system to a normal operating condition. This action is predicated on a series of tasks that must be performed. Once the event has been detected, it must be communicated to a decision maker. This decision maker must determine the actions that must be taken to bring the system back into its performance envelope and then implement this action. Events occur with some probability, and there is some probability that they are remedied by the decision maker. The combination of these probabilities leads to a measure of the reliability of the system and ultimately to its operational cost.
Markov models are based on the premise that the physical condition of the system can be broken down into discrete, mutually exclusive states each of which can represent the system at a given point in time. The system transitions from state to state with time and the probability that the system will be in some state at some future time can be determined. Probabilities are assigned to describe the likelihood of transitioning from any state to another in a given time interval. These transition probabilities can be determined from historical data, or can be based on another model describing the satellite function. Note also that in general the transition probabilities need not be constant with time [9].
Under the methodology outlined here, a separate Markov model is created for each function of the system. For example, there could be one Markov model to describe roll attitude error, another for pitch attitude error, and a third to describe the state of the power system (e.g., operational or failed). In general, these models may be linked because the probability of resolving an attitude error, for example, may require that the power and communications functions are operational. Each functional Markov model contains states that describe fully operational and degraded modes as well as how that function is repaired following an event. The probability of an event (or failure) can be approximated by using statistical failure data for similar functions or can be determined from a fault tree model of the reliability of the function's components. The recovery process is generally more complicated and involves several intermediate states, termed Recovery Processes, as shown in Figure 2.
The probability that an event occurs and the transition probabilities from one state to the next in the Recovery Process are dependent on the level of automation. For example, whether or not a human is in the loop will affect the probability that a solution is determined. In general, each transition in the process requires a processor element and, if automated, software or hardware. Depending on the level of automation, the transition probabilities are determined using statistical data on human reliability and fault trees to determine the probability that all the required components are functioning.
As described earlier, some transitions may require that several functions are operating in order to occur. For example, the probability of recovering from an attitude error may depend on the probability that both the power and communications functions are operational. To simplify the computations, the Markov models are run in parallel using the assumption that the operational probabilities are independent between functions. For example, if the overall probability that power is operational is 0.9, and the overall probability that the communications function is operational is 0.9, the methodology here assumes that the probability that both are operational simultaneously is 0.9 x 0.9 = 0.81. This independence in general will not be the case because the state of the communications system may be conditional on the state of the power system. That is, the probability that communications are operational may be 1 if power is operational but 0 if power is not operational. In that case, the probability that both power and communications are operational is 0.9, not 0.81. Therefore, the calculations performed by the parallel Markov models, which assume independence, are not exact. However, for functions that are weakly coupled, the conditional and overall probabilities are similar, and the error has been found to be relatively small.
In general, the control of a remotely-operated vehicle such as a satellite may be accomplished by automating the vehicle itself or by automating the ground control center. Many combinations of LOA are therefore possible. Table 1 shows one example set of LOAs.
LOA | Description |
---|---|
Fully Automated | Satellite performs function with no communication to ground. No ability to recover if satellite processor stalls. |
Paging | Satellite performs function but ground segment notified if processor stalls. In such a case, ground controller then resolves problem using data filtered from the satellite. |
Supervision | Satellite performs function but ground segment supervises recovery activities. Ground is aware of the satellite's progress (through filtered data from the satellite) and can override any actions, but is not required to intervene unless stall occurs. |
Cueing | Satellite performs function and suggests possible solutions. Ground segment must verify all solutions (using filtered data) before they can be implemented. |
Data Filtering | Satellite downlinks raw telemetry data to ground. Ground segment then filters and processes the data and a human controller performs the function. |
No Automation | Satellite downlinks raw telemetry data to ground. A human controller then performs the function without any filtering or processing. |
Operating costs will also change. As automation increases, the overall workload of human operators will decrease. Each operator will be responsible for more functions or satellites, and fewer operators will be needed to maintain the constellation. This can result in a significant reduction in operating costs. Not only is the number of operators reduced, but there is also a reduction in support staff and overhead associated with these operators. However, if the software that has been implemented is unreliable, humans will be needed to resolve processor stalls.
The opportunity costs are linked to reliability. For commercial ventures, down time on a satellite results in a loss of revenues. This loss is an opportunity cost, and since automation affects reliability and repair times, automation also has an effect on this opportunity cost. Often the relationship between cost and reliability is not so clear. For scientific missions, the "revenue" is actually science, and it is unclear how to convert scientific discovery into dollars that can be traded against other costs. Also, there are further costs which are attributed to excessive down time. These costs may take the form of a loss of customer satisfaction, or a loss of public support for the program. Thus, there may be subjective factors that require that the system reliability be higher than would be optimal in a cost-only sense.
Figure 4 shows a summary of the development (white) and operating costs (black) for varying levels of automation. The operating costs decrease as LOA increases due to fewer personnel as the amount of required human intervention decreases.
The development costs have a more complex relationship with LOA. Although the satellite hardware and launch costs are nearly constant, the ground station development costs and software costs vary considerably. The software costs were obtained by applying a cost of $190 per source line of code (sloc) for ground software, and $375 per sloc for space software [10]. For Fully Automated, Paging, Supervision, and Cueing LOAs, the fault detection and correction software resides on the spacecraft; thus this software costs $375 per sloc. For Data Filtering and No Automation, the software resides on the ground ($190 per sloc). In addition to the fault detection and recovery software, there is also a need for software to perform automation-related tasks. For example, with Data Filtering, software is required to perform the data reduction task independent of the fault detection and recovery tasks. Ground station facilities and equipment costs were determined as suggested in Wertz & Larson with the assumption that these costs are proportional to the total ground software cost [10].
The combination of software development and ground facility costs results in a development cost curve that increases from No Automation to Supervision LOA, and then decreases slightly towards Fully Automated LOA. This behavior is due to the fact that, as modeled here, software costs are low relative to facilities costs. When development costs are coupled with operating costs, Figure 4 shows that in this example it is more cost-effective to move to a Fully Automated system. Whether a change in reliability with a Fully Automated system will offset the reduced development and operating costs is shown in the next section.
An example set of transition probabilities for a Fully Automated attitude control function is shown in Table 3. A complete set of transition probabilities for all functions at all LOA is available from the authors [11]. Software reliabilities were determined using several assumptions because existing models in software reliability are relatively immature [12]. Most are formed on the basis that software errors are evenly distributed throughout the code. For the purpose of this example, it is assumed that space-based software has a baseline reliability of 0.9999 for each 1000 sloc. It is also assumed that the code is executed once each event, although all portions (subroutines) of the code may not be run. Code which is twice as long will contain twice as many errors: a 2000-line section of code has a reliability of 0.9998. Also taken from Wertz & Larson were hardware reliabilities for basic satellite components. For simplicity, the details of the reliability and cost models are not given here since the example is meant to demonstrate the methodology, and not to be an exhaustive case study. A human reliability model was also constructed based on observed human error probabilities [13].
State Transition | Transition Probability |
---|---|
Failure | 0.086 per 10 years |
Failure Detection | 0.99964 per event |
Failure Communication | 0.99993 per event |
Solution Determination | 0.99914 per event |
Solution Implementation | 0.99986 per event |
Figure 5 shows the payload's operational probability over the lifetime of the system as a function of the LOA. The curves are not perfectly smooth due to the time steps used in the Markov engine. As shown in Figure 5, the data are grouped into three main categories. These categories are determined by the assigned primary processor. For the Fully Automated LOA, the reliability is determined by the on-board processor alone. In this example, the reliability of Full Automation is high; thus the operational probability of the Fully Automated system is higher than that at other LOAs. For Paging, Supervision, and Cueing, a common level of processor reliability is assumed: reliability is therefore similar across these LOA. However, for Paging, Supervision, and Cueing, the human operator performs some tasks (with a relatively low reliability). This introduces additional failure modes, and the reliability decreases from the Fully Automated level. For Data Filtering and No Automation, the processing shifts entirely to the generally less-reliable human.
Note that in this example, processor stall was not considered. That is, robustness to unexpected situations has been overestimated in this example. In reality, at all LOA except Fully Automated, humans can play some role in decision making and resolve deficiencies in the automation. Thus, the reliabilities for Fully Automated LOA generally higher than would probably result from a more detailed analysis.
Figure 7 shows the result of combining the development and operating costs with the expected revenues from Figure 6. Because a Fully Automated LOA is both more reliable (Fig. 5) and has lower cost (Fig. 4), it results in the largest expected profit. Paging and Supervision LOAs result in less expected profit due to higher operating costs and lower reliability as human operators become a larger part of the process. Cueing LOA performs better than Supervision and Paging due to lower development costs and slightly higher reliability because the human operator and software must both agree to each action that is performed. Because of the much larger operating costs and lower reliabilities associated with Data Filtering and No Automation LOAs, a human-intensive system design appears to be the least effective choice in this example.
The example above demonstrates the methodology used in determining the degree of automation that leads to the minimal life cycle costs for a given system. This example was simplified to ease understanding of the methodology. For an actual study, stall behavior would have to be accounted for. In addition, the cost model must be re-evaluated to ensure that the assumptions made are valid for highly-autonomous systems. This example also only examined a few of many possible combinations of LOA. A more complete study would require analysis over a wider range of automation conditions.
A software tool (SOCRATES) has been written to allow system engineers to enter cost and reliability data and to specify a fault tree representation of the system architecture through a graphical user interface. The tool allows the user to interactively vary the levels of automation for each function, and resulting costs and reliabilities can be plotted for comparison. As more advanced human and software reliability and cost models are made available, they can be incorporated into SOCRATES since the tool has been written in a modular manner. This will allow the tool to mature during later stages of refinement.
Although some functional interdependence has been accounted for, such as the fact that the attitude control system requires the power system to be operational, a deeper level of dependence is not yet modeled. Updates are made to the operational probabilities of the separate functions with time, but there is no calculation of the conditional operational probabilities. In order to capture the conditional probabilities, the models would have to be merged at a that greatly increases the model's complexity and the computer resources required to drive it. This makes modeling a moderately-sized system unmanageable. However, SOCRATES could be linked to dedicated Markov modeling tool such as Draper Laboratory's CAME tool that is able to handle large interdependent systems [14].
The methodology will also be enhanced to aid in performing parametric sensitivity studies. The sensitivity of system reliability or revenue can be determined as functions of inputs such as cost, component failure rates, and software / human errors. Thus, efforts can be focused on those parameters that have the greatest impact on cost and reliability even when exact values are uncertain.
[1] Anderson, Christine M. USAF Phillips Lab Presentation. Chief Satellite Control & Simulation Division.
[2] Farmer, Mike, and Culver, Randy. "The Challenges of Low-Cost Automated Satellite Operations." Loral Federal Services Corp. Colorado Springs, CO, 1996.
[3] Hornstein, Rhoda Shalter. "Reengineering the Space Operations Infrastructure: A Progress Report From NASA's Cost Less Team for Mission Operations." NASA, Washington, DC, 1995.
[4] Hovanessian, S.A., Raghavan, S.H., and Taggart, D.A. "Lifeline, A Concept for Automated Satellite Supervision." Aerospace Report TOR-93(3516)-1 Aerospace Corp., El Segundo, CA, August, 1993.
[5] Smith, Dan. "Operations Innovations for 48-Satellite Globalstar Constellation." Globalstar Satellite Operations Control Center, Loral AeroSys, Seabrook, MD, 1996.
[6] Sheridan, Thomas, Telerobotics, Automation, and Human Supervisory Control, MIT Press, Cambridge, MA 1992.
[7] Hornstein, Rhoda Shalter. "On-Board Autonomous Systems: Cost Remedy for Small Satellites or Sacred Cow?" 46th International Astronautical Congress, Oslo, Norway, Oct. 2-6, 1995.
[8] Babcock, Philip S. IV. "An Introduction to Reliability Modeling of Fault-Tolerant Systems." CSDL-R-1899, The Charles Stark Draper Laboratory, Cambridge, MA, 1986.
[9] Scheafer, Richard L. Introduction to Probability and Its Applications. PWS Kent Publishing Co., Boston, MA, 1990.
[10] Wertz, James R., and Larson, Wiley J. Space Mission Analysis and Design. Second Edition. Microcosm, Torrance CA, and Kluwer Academic Publishers, Boston, MA, 1992.
[11] Schwarz, Robert E., "SOCRATES Demonstration Packet", MIT Department of Aeronautics and Astronautics Report, May, 1996.
[12] Misra, Krishna B., New Trends in System Reliability Evaluation, Elsevier, NY, 1993.
[13] Park, Kyung S. Advances in Human Factors / Ergonomics Vol. 7: Human Reliability. Elsevier, NY, 1987.
[14] Babcock, P.S., G. Rosch, J.J. Zinchuk. "An Automated Environment for Optimizing Fault-Tolerant Systems Designs". Reliability and Maintainability Symposium, Orlando, FL, January, 1991.