A PROBABILISTIC MODEL FOR THE DETERMINATION OF THE EFFECTS OFAUTOMATION OF SATELLITE OPERATIONS ON LIFE CYCLE COSTS

Robert Schwarz, James Kuchar, Daniel Hastings, John Deyst
Department of Aeronautics and Astronautics
Massachusetts Institute of Technology
Cambridge, MA USA

Stephan Kolitz
The Charles Stark Draper Laboratory
Cambridge, MA USA

ABSTRACT

A probabilistic model of satellite systems has been developed to examine the impact of task automation on cost and reliability. The model (called SOCRATES) has been implemented using Markov modeling techniques and includes a Graphical User Interface. SOCRATES facilitates the interactive construction of satellite system architectures and the rapid analysis of different automation levels. Satellite systems can be stored and modified and can include multiple satellites, ground control stations, functions, and payloads. Automation cost and reliability models are provided as inputs to the model which then outputs expected life cycle cost and system reliability as functions of time. An example application of SOCRATES to a simplified geostationary communications satellite is provided. Based on the cost and reliability models that are used in this example, the expected revenues for different types of automation are determined and compared.

INTRODUCTION

Current levels of automation in satellite systems reflect an incremental evolution that is based on a high level of human involvement. Even for a single satellite, operations costs can represent a significant portion of life cycle costs. In addition, human error continues to be a major cause of spacecraft anomalies and failures. With the introduction of large constellations or clusters of satellites, some automation of operations will be required to reduce costs while maintaining reliability [1-5]. It is often not clear, however, what level of automation is appropriate or which tasks should be automated.

The degree of automation can vary along a continuum ranging from fully-manual human control, to human supervisory control, to a fully automated (no human) system [6]. In general, as the level of automation is increased, fewer tasks need to be performed by the human operator and operational costs decrease [7]. In addition, when properly applied, automation can result in a reduction in the rate of human-caused errors (e.g., mistyped commands). However, because an automated system may not be as flexible as a human operator in managing unanticipated situations, reliability could also decrease. Finally, the development costs associated with a highly automated system may outweigh the operational cost benefits. Thus, the appropriate degree of automation requires a careful trade study between cost and reliability. In order to address these issues, tools are needed which can predict the effect that automation has on costs and system reliability. Such tools will enable system engineers to identify those functions that should be automated and those that should employ a higher level of human involvement. In order to develop these tools, a fundamental methodology for investigating the tradeoffs between cost and reliability is required.

PROBLEM STATEMENT

Figure 1 qualitatively represents the cost and reliability characteristics with respect to an increasing level of automation. Initially, as low levels of automation are introduced into the system, the operating costs decrease, principally due to a decrease in the number of human operators (Fig. 1A). At some point, however, the increase in design and development costs (due to software development) begins to outweigh the decrease in operation costs.

As shown in Figure 1B, the system reliability may decrease, increase, or be unaffected by the level of automation used depending on the function being automated. For functions that are simple, well understood, or periodic, such as routine stationkeeping on a geostationary satellite, the reliability may increase with increasing automation (Function A in Figure 1B). This trend results when human errors are more likely than software errors and the impact of unanticipated situations is expected to be negligible. For other functions, where events are complex, rare, or unexpected, the system reliability may decrease as humans are removed from the loop (Function C in Figure 1B). This occurs when automation generates problems or is unable to resolve problems that could have been resolved by a human. Alternatively, there may be some functions for which the reliability is nearly independent of the level of automation (Function B).


Figure 1: Effect of Automation on Cost and Reliability

An increase in system reliability translates directly into a decrease in operating costs due to the lower likelihood of replacement or loss of system functionality. Thus, the potential for failure (or "unreliability") is represented by an opportunity cost which represents revenues forgone as a result of increased system down-time. Determination of opportunity costs requires additional data such as the relationship between a particular function and revenue. The combination of the two curves from Figures 1A and 1B are represented in Figure 1C, showing the overall life cycle cost. In this example, there exists an optimal level of automation at which life cycle cost is minimized. The primary focus of the methodology presented here is to quantify the curves in Figure 1.

METHODOLOGY

The methodology that follows is based on a functional decomposition of the system. Typical functions include stationkeeping, attitude control, payload operation, power, and so on. Functions can also be complex, requiring several satellites working in combination (e.g., Global Positioning System). Generally, each function is required to operate within a certain performance envelope. For example, a satellite may be required to maintain attitude within a given error bound.

For commercial systems, functions associated with the spacecraft payload also generate revenue for the system. These functions may not be associated with any single subsystem or satellite but may be carried out only through the coordination of a constellation of satellite payloads. Because these functions are associated with the entire system, they are termed Mission Objectives. Note that this definition of Mission Objectives serves to focus the functional decomposition and emphasize which functions are associated with revenues, and hence opportunity costs. Otherwise, Mission Objectives are treated in the same manner as other functions.

An event is defined to occur when a function exceeds its performance envelope. Events can represent unforeseen occurrences due to hardware failure or anomalous phenomena as well as routine variations in operation such as periodic drifting that requires small thrust corrections. Once an event occurs, some action is required to return the system to a normal operating condition. This action is predicated on a series of tasks that must be performed. Once the event has been detected, it must be communicated to a decision maker. This decision maker must determine the actions that must be taken to bring the system back into its performance envelope and then implement this action. Events occur with some probability, and there is some probability that they are remedied by the decision maker. The combination of these probabilities leads to a measure of the reliability of the system and ultimately to its operational cost.

MARKOV MODELING

Satellite failures and recovery procedures are probabilistic by nature. In addition, satellite systems are often complex, containing many components, functions, and failure modes. Therefore, it is reasonable to represent these systems with a stochastic model. Since satellites are typically designed to be very reliable and contain redundant systems, it is desirable to choose a model that captures the characteristics of all probable failure modes without requiring extensive computation. Markov models not only capture the behavior of highly reliable systems, but can do so with significant savings in computational time compared to other modeling techniques such as Monte Carlo Simulations [8]. Run times can be large, however, if the system model requires many states. Additionally, Markov models are well suited to problems in which events occur sequentially (e.g., a failure event, followed by recovery steps to return to normal operation).

Markov models are based on the premise that the physical condition of the system can be broken down into discrete, mutually exclusive states each of which can represent the system at a given point in time. The system transitions from state to state with time and the probability that the system will be in some state at some future time can be determined. Probabilities are assigned to describe the likelihood of transitioning from any state to another in a given time interval. These transition probabilities can be determined from historical data, or can be based on another model describing the satellite function. Note also that in general the transition probabilities need not be constant with time [9].

Under the methodology outlined here, a separate Markov model is created for each function of the system. For example, there could be one Markov model to describe roll attitude error, another for pitch attitude error, and a third to describe the state of the power system (e.g., operational or failed). In general, these models may be linked because the probability of resolving an attitude error, for example, may require that the power and communications functions are operational. Each functional Markov model contains states that describe fully operational and degraded modes as well as how that function is repaired following an event. The probability of an event (or failure) can be approximated by using statistical failure data for similar functions or can be determined from a fault tree model of the reliability of the function's components. The recovery process is generally more complicated and involves several intermediate states, termed Recovery Processes, as shown in Figure 2.


Figure 2: Event-Recovery Process

The probability that an event occurs and the transition probabilities from one state to the next in the Recovery Process are dependent on the level of automation. For example, whether or not a human is in the loop will affect the probability that a solution is determined. In general, each transition in the process requires a processor element and, if automated, software or hardware. Depending on the level of automation, the transition probabilities are determined using statistical data on human reliability and fault trees to determine the probability that all the required components are functioning.

As described earlier, some transitions may require that several functions are operating in order to occur. For example, the probability of recovering from an attitude error may depend on the probability that both the power and communications functions are operational. To simplify the computations, the Markov models are run in parallel using the assumption that the operational probabilities are independent between functions. For example, if the overall probability that power is operational is 0.9, and the overall probability that the communications function is operational is 0.9, the methodology here assumes that the probability that both are operational simultaneously is 0.9 x 0.9 = 0.81. This independence in general will not be the case because the state of the communications system may be conditional on the state of the power system. That is, the probability that communications are operational may be 1 if power is operational but 0 if power is not operational. In that case, the probability that both power and communications are operational is 0.9, not 0.81. Therefore, the calculations performed by the parallel Markov models, which assume independence, are not exact. However, for functions that are weakly coupled, the conditional and overall probabilities are similar, and the error has been found to be relatively small.

LEVELS OF AUTOMATION

In general, a human operator interacts with a ground control station which uplinks commands to a processor on the satellite. The satellite processor implements commands and downlinks data to the ground station which may then present the data to the human operator. Thus the human, the ground station, and the satellite can each be considered as separate information processors. The interactions between the human and ground station and between the ground station and satellite can each be automated to varying degrees. The level of automation (LOA) changes the manner in which an event is resolved. Specifically, the LOA determines whether the primary decision maker is the satellite's on board processor, the ground computer, or the human operator. Depending on the LOA, a secondary or tertiary processor may also take over in the event that the primary processor fails to complete the recovery process. For example, if the satellite processor is unable to resolve a problem, it may cue a human operator who then performs the necessary actions. This inability of the primary processor to resolve an event is referred to as a processor stall. When an event occurs, each processor has some probability of successfully resolving the problem or of stalling. Additionally, each processor has an associated mean time to completion (mttc) for the task. The mttc affects the time it takes to resolve the problem, and therefore impacts down-time and revenues.

In general, the control of a remotely-operated vehicle such as a satellite may be accomplished by automating the vehicle itself or by automating the ground control center. Many combinations of LOA are therefore possible. Table 1 shows one example set of LOAs.

Table 1: Example Levels of Automation
LOADescription
Fully AutomatedSatellite performs function with no communication to ground. No ability to recover if satellite processor stalls.
PagingSatellite performs function but ground segment notified if processor stalls. In such a case, ground controller then resolves problem using data filtered from the satellite.
SupervisionSatellite performs function but ground segment supervises recovery activities. Ground is aware of the satellite's progress (through filtered data from the satellite) and can override any actions, but is not required to intervene unless stall occurs.
CueingSatellite performs function and suggests possible solutions. Ground segment must verify all solutions (using filtered data) before they can be implemented.
Data FilteringSatellite downlinks raw telemetry data to ground. Ground segment then filters and processes the data and a human controller performs the function.
No AutomationSatellite downlinks raw telemetry data to ground. A human controller then performs the function without any filtering or processing.

COST FACTORS

The life cycle costs attributed to automation are separated into two main branches: development / operating costs, and opportunity costs. The former results from direct expenses while the latter is related to lost revenue due to system down-time caused by failures. The development and operating costs are further separated into three components: software, hardware, and personnel. As automation is introduced, there is a need for additional software. This software cost, along with the increased processing and power requirements will result in an increase in development costs. There may also be a need for additional hardware as a result of automation. Conversely, automation could eliminate the need for some hardware such as computer interfaces for human operators.

Operating costs will also change. As automation increases, the overall workload of human operators will decrease. Each operator will be responsible for more functions or satellites, and fewer operators will be needed to maintain the constellation. This can result in a significant reduction in operating costs. Not only is the number of operators reduced, but there is also a reduction in support staff and overhead associated with these operators. However, if the software that has been implemented is unreliable, humans will be needed to resolve processor stalls.

The opportunity costs are linked to reliability. For commercial ventures, down time on a satellite results in a loss of revenues. This loss is an opportunity cost, and since automation affects reliability and repair times, automation also has an effect on this opportunity cost. Often the relationship between cost and reliability is not so clear. For scientific missions, the "revenue" is actually science, and it is unclear how to convert scientific discovery into dollars that can be traded against other costs. Also, there are further costs which are attributed to excessive down time. These costs may take the form of a loss of customer satisfaction, or a loss of public support for the program. Thus, there may be subjective factors that require that the system reliability be higher than would be optimal in a cost-only sense.

SOCRATES

SOCRATES (Satellite Operations Cost and Reliability Analysis Toolkit for the Evaluation of Systems) is a software tool developed by MIT and the Charles Stark Draper Laboratory that links a graphical user interface with a Markov modeling approach. The user is able to interactively define the system in detail, perform the functional decomposition, and specify the levels of automation for each function. Cost, revenue, and reliability data for each function (based on its level of automation) are required as inputs. The tool then generates data files that are used by the Markov engine to generate the cost and reliability outputs by which candidate systems are evaluated. SOCRATES is modular in that subsystems and satellites can be saved or modified and used in future studies. Currently, the tool is implemented on a Sun workstation using C++ for the Markov engine, and tcl/tk to run the graphical user interface.

EXAMPLE APPLICATION OF SOCRATES

This section illustrates the methodology through an analysis of a generic GEO communication satellite. SOCRATES was used to create the functional model, generate the cost and Markov models, and to perform the analysis. It is assumed that the satellite is automated such that the level of automation is the same among all of its functions using the categories from Table 1. This assumption is a simplification, but it allows the methodology to be easily demonstrated.

FUNCTIONAL DIAGRAM

Figure 3 shows the functional decomposition of the system. Note that the Mission Objective (the communications payload) is drawn directly from the system level and not from the satellite as are all other space-based functions. This reflects the fact that both satellite and ground segments are required for the payload to provide revenue. Also, for simplicity, the subsystems are represented by single functions; in general, each subsystem would have several functions.


Figure 3: GEO Communications Satellite

COST MODEL

The cost model is based on models given in Wertz & Larson [10], including software code sizes, throughput, and costs. This model is a version of the U.S. Air Force's cost model for unmanned space systems. A complete description of the cost model for this example is available from the authors [11].

Figure 4 shows a summary of the development (white) and operating costs (black) for varying levels of automation. The operating costs decrease as LOA increases due to fewer personnel as the amount of required human intervention decreases.


Figure 4: Development & Operating Cost vs. LOA (Present Worth Value (PWV): i=12%)

The development costs have a more complex relationship with LOA. Although the satellite hardware and launch costs are nearly constant, the ground station development costs and software costs vary considerably. The software costs were obtained by applying a cost of $190 per source line of code (sloc) for ground software, and $375 per sloc for space software [10]. For Fully Automated, Paging, Supervision, and Cueing LOAs, the fault detection and correction software resides on the spacecraft; thus this software costs $375 per sloc. For Data Filtering and No Automation, the software resides on the ground ($190 per sloc). In addition to the fault detection and recovery software, there is also a need for software to perform automation-related tasks. For example, with Data Filtering, software is required to perform the data reduction task independent of the fault detection and recovery tasks. Ground station facilities and equipment costs were determined as suggested in Wertz & Larson with the assumption that these costs are proportional to the total ground software cost [10].

The combination of software development and ground facility costs results in a development cost curve that increases from No Automation to Supervision LOA, and then decreases slightly towards Fully Automated LOA. This behavior is due to the fact that, as modeled here, software costs are low relative to facilities costs. When development costs are coupled with operating costs, Figure 4 shows that in this example it is more cost-effective to move to a Fully Automated system. Whether a change in reliability with a Fully Automated system will offset the reduced development and operating costs is shown in the next section.

RELIABILITY MODEL

Although the primary focus of the methodology is to minimize life cycle costs (and thus maximize return on investment), reliability must also be considered. For commercial systems, satellite failures that temporarily bring the system down result in periods during which revenue cannot be generated. This results in an opportunity cost which must be considered when calculating the life cycle cost of the system. For the purposes of this example, the profit generated over the life cycle accounts for lost revenue directly.

An example set of transition probabilities for a Fully Automated attitude control function is shown in Table 3. A complete set of transition probabilities for all functions at all LOA is available from the authors [11]. Software reliabilities were determined using several assumptions because existing models in software reliability are relatively immature [12]. Most are formed on the basis that software errors are evenly distributed throughout the code. For the purpose of this example, it is assumed that space-based software has a baseline reliability of 0.9999 for each 1000 sloc. It is also assumed that the code is executed once each event, although all portions (subroutines) of the code may not be run. Code which is twice as long will contain twice as many errors: a 2000-line section of code has a reliability of 0.9998. Also taken from Wertz & Larson were hardware reliabilities for basic satellite components. For simplicity, the details of the reliability and cost models are not given here since the example is meant to demonstrate the methodology, and not to be an exhaustive case study. A human reliability model was also constructed based on observed human error probabilities [13].

Table 3: Example Transition Probabilities: Fully Automated Attitude Control Function
State TransitionTransition Probability
Failure0.086 per 10 years
Failure Detection0.99964 per event
Failure Communication0.99993 per event
Solution Determination0.99914 per event
Solution Implementation0.99986 per event

Figure 5 shows the payload's operational probability over the lifetime of the system as a function of the LOA. The curves are not perfectly smooth due to the time steps used in the Markov engine. As shown in Figure 5, the data are grouped into three main categories. These categories are determined by the assigned primary processor. For the Fully Automated LOA, the reliability is determined by the on-board processor alone. In this example, the reliability of Full Automation is high; thus the operational probability of the Fully Automated system is higher than that at other LOAs. For Paging, Supervision, and Cueing, a common level of processor reliability is assumed: reliability is therefore similar across these LOA. However, for Paging, Supervision, and Cueing, the human operator performs some tasks (with a relatively low reliability). This introduces additional failure modes, and the reliability decreases from the Fully Automated level. For Data Filtering and No Automation, the processing shifts entirely to the generally less-reliable human.


Figure 5: Payload Operational Probability vs. Time

Note that in this example, processor stall was not considered. That is, robustness to unexpected situations has been overestimated in this example. In reality, at all LOA except Fully Automated, humans can play some role in decision making and resolve deficiencies in the automation. Thus, the reliabilities for Fully Automated LOA generally higher than would probably result from a more detailed analysis.

EXPECTED REVENUE AND PROFIT

A revenue of $35 million per month was arbitrarily assumed for a fully operational satellite. When combined with the reliability curve in Figure 5, the total expected revenue that results is shown in Figure 6.


Figure 6: System Revenues vs. Time

Figure 7 shows the result of combining the development and operating costs with the expected revenues from Figure 6. Because a Fully Automated LOA is both more reliable (Fig. 5) and has lower cost (Fig. 4), it results in the largest expected profit. Paging and Supervision LOAs result in less expected profit due to higher operating costs and lower reliability as human operators become a larger part of the process. Cueing LOA performs better than Supervision and Paging due to lower development costs and slightly higher reliability because the human operator and software must both agree to each action that is performed. Because of the much larger operating costs and lower reliabilities associated with Data Filtering and No Automation LOAs, a human-intensive system design appears to be the least effective choice in this example.


Figure 7: System Profit vs. Time

The example above demonstrates the methodology used in determining the degree of automation that leads to the minimal life cycle costs for a given system. This example was simplified to ease understanding of the methodology. For an actual study, stall behavior would have to be accounted for. In addition, the cost model must be re-evaluated to ensure that the assumptions made are valid for highly-autonomous systems. This example also only examined a few of many possible combinations of LOA. A more complete study would require analysis over a wider range of automation conditions.

CONCLUSIONS

PROBLEM STATEMENT & SOLUTION / TOOL

The methodology presented in this paper quantifies the effects of automation through a reliability-based approach that captures both direct and opportunity costs in the calculation of life cycle costs. The end-to-end system is decomposed into functions, where each function's event-recovery process is represented with a Markov model. Each transition in the Markov model is linked to the processor reliabilities and mean times to completion, as well as the net reliability of the hardware components and other functions which are determined through the use of fault trees. The individual reliabilities of the functions are calculated while maintaining some degree of interdependence among functions. From these reliabilities, the overall system reliability is determined, thus producing the opportunity costs which must be added to production, development, and operating costs.

A software tool (SOCRATES) has been written to allow system engineers to enter cost and reliability data and to specify a fault tree representation of the system architecture through a graphical user interface. The tool allows the user to interactively vary the levels of automation for each function, and resulting costs and reliabilities can be plotted for comparison. As more advanced human and software reliability and cost models are made available, they can be incorporated into SOCRATES since the tool has been written in a modular manner. This will allow the tool to mature during later stages of refinement.

CAPABILITIES AND LIMITATIONS

As stated above, once mature, the methodology above will allow system engineers to estimate the effects that automating any function will have on the overall system reliability and life cycle cost. This in turn will lead to more confidence in making decisions in an industry that is traditionally conservative as a result of the high risks and costs involved. Future versions of the software tool will make extensive use of databases to allow baseline systems to be modified with little additional effort. In addition, the software tool has been written in a modular manner that allows for modification of the cost model as well as human and software reliability models. This will allow the basic backbone of the tool to be re-used and adapted as higher-fidelity models become available.

Although some functional interdependence has been accounted for, such as the fact that the attitude control system requires the power system to be operational, a deeper level of dependence is not yet modeled. Updates are made to the operational probabilities of the separate functions with time, but there is no calculation of the conditional operational probabilities. In order to capture the conditional probabilities, the models would have to be merged at a that greatly increases the model's complexity and the computer resources required to drive it. This makes modeling a moderately-sized system unmanageable. However, SOCRATES could be linked to dedicated Markov modeling tool such as Draper Laboratory's CAME tool that is able to handle large interdependent systems [14].

FUTURE WORK

Future work is underway to develop methods to manage conditional reliabilities in a manner that offers significant fidelity with a minor increase in required computational resources. In addition, the cost model will be enhanced so that it is applicable to highly autonomous systems. Models will also be developed to better represent software and human reliabilities. Work will also be performed to generate databases of anomaly frequencies as a function of mission and orbit, as well as an outline for the generation of generic satellite bus and ground station architecture. SOCRATES will also be linked to databases that will be used to generate baseline satellite or control center architectures, thereby reducing the amount of effort needed to perform future trade studies.

The methodology will also be enhanced to aid in performing parametric sensitivity studies. The sensitivity of system reliability or revenue can be determined as functions of inputs such as cost, component failure rates, and software / human errors. Thus, efforts can be focused on those parameters that have the greatest impact on cost and reliability even when exact values are uncertain.

REFERENCES

[1] Anderson, Christine M. USAF Phillips Lab Presentation. Chief Satellite Control & Simulation Division.

[2] Farmer, Mike, and Culver, Randy. "The Challenges of Low-Cost Automated Satellite Operations." Loral Federal Services Corp. Colorado Springs, CO, 1996.

[3] Hornstein, Rhoda Shalter. "Reengineering the Space Operations Infrastructure: A Progress Report From NASA's Cost Less Team for Mission Operations." NASA, Washington, DC, 1995.

[4] Hovanessian, S.A., Raghavan, S.H., and Taggart, D.A. "Lifeline, A Concept for Automated Satellite Supervision." Aerospace Report TOR-93(3516)-1 Aerospace Corp., El Segundo, CA, August, 1993.

[5] Smith, Dan. "Operations Innovations for 48-Satellite Globalstar Constellation." Globalstar Satellite Operations Control Center, Loral AeroSys, Seabrook, MD, 1996.

[6] Sheridan, Thomas, Telerobotics, Automation, and Human Supervisory Control, MIT Press, Cambridge, MA 1992.

[7] Hornstein, Rhoda Shalter. "On-Board Autonomous Systems: Cost Remedy for Small Satellites or Sacred Cow?" 46th International Astronautical Congress, Oslo, Norway, Oct. 2-6, 1995.

[8] Babcock, Philip S. IV. "An Introduction to Reliability Modeling of Fault-Tolerant Systems." CSDL-R-1899, The Charles Stark Draper Laboratory, Cambridge, MA, 1986.

[9] Scheafer, Richard L. Introduction to Probability and Its Applications. PWS Kent Publishing Co., Boston, MA, 1990.

[10] Wertz, James R., and Larson, Wiley J. Space Mission Analysis and Design. Second Edition. Microcosm, Torrance CA, and Kluwer Academic Publishers, Boston, MA, 1992.

[11] Schwarz, Robert E., "SOCRATES Demonstration Packet", MIT Department of Aeronautics and Astronautics Report, May, 1996.

[12] Misra, Krishna B., New Trends in System Reliability Evaluation, Elsevier, NY, 1993.

[13] Park, Kyung S. Advances in Human Factors / Ergonomics Vol. 7: Human Reliability. Elsevier, NY, 1987.

[14] Babcock, P.S., G. Rosch, J.J. Zinchuk. "An Automated Environment for Optimizing Fault-Tolerant Systems Designs". Reliability and Maintainability Symposium, Orlando, FL, January, 1991.