Note: This content is accessible to all versions of every browser. However, this browser does not seem to support current Web standards, preventing the display of our site's design details.


Fault tolerance in large scale systems: Hybrid and distributed approaches


Marta Capiluppi


The increasing complexity of modern dynamical systems calls for new methods to face the problems related to the study of these systems. The aim of this thesis is to review and introduce new methods for diagnosis of complex systems. A system can be defined as complex when it can be decomposed in many subsystems with different tasks and cooperating to achieve a main global goal. Moreover a complex system generally involves communication policies among the composing subsystems and a structure depending on how these subsystems are connected. Complex systems have many examples in different field, from the automotive to the process engineering, from aeronautical applications to manufacturing plants. Since the term “complex” is fairly general, in this work some specifications are introduced, with a growing level of complexity. A basic representation of a class of complex systems is given by distributed systems. They are systems in which the main function is distributed to subsystems, called nodes, interconnected through a network. Each subsystem has its own function and the way in which they are connected is related to the global function the system has to perform. The notion of distributed systems is known both in the computer field and in the control field. The main difference concerns the nature of the nodes. In distributed computing the nodes are clearly computers, hence the system is generally cast in the discrete event setting. In distributed control systems the continuous dynamics of the nodes can be taken into account, because the nodes are dynamical systems. In a more general framework the study of distributed dynamical systems can be referred to the study of the characteristics of hybrid systems. Hybrid systems are systems in which the continuous and the discrete dynamics interact. In this thesis fault tolerance in distributed systems is studied at different levels of deepness. First an architecture for the decentralised representation of distributed systems is recalled. The architecture is designed, following the distributed nature of the systems, in a modular and hierarchical way. To this end, a functional decomposition of the system is performed with the aim of finding out the nodes of the system performing different sub-tasks (also called functions or functionalities) fundamental for the fulfilment of the main goal of the system. The following step is to further decompose the nodes with the aim of decentralising the tasks of fault detection and reconfiguration. This is obtained using the same functional policy associated with structural analysis of the subsystems. This step is useful in large scale systems because local diagnosis is capable to detect at lower level the causes of a failure and the counteraction time can be reduced in comparison to a centralised technique. A method is presented in which detectable losses of functions (called failure modes) are obtained using structural analysis and a functional graph. The failures, i.e. losses of functions, are detected using standard residual analysis tools, but in this case residual signals are not index of the physical fault occurrence, but they are composed to create indexes of the failure occurrence. With the same considerations the reconfiguration strategy is distributed throughout the system, creating local specification for the new modes of operation (working modes) in case of failure. The specifications are designed considering only the discrete transitions due to the fault events. To this end the theory of supervision of discrete event systems is used. As stated above, a deeper analysis of the distributed system reveals the continuous dynamics of the nodes. The interaction between these dynamics and the discrete dynamics given by the communication policies and the working mode changes leads to the need of using a more fitting representation. This is given by the hybrid systems. Unfortunately for this kind of systems few diagnostic methods have been presented in literature. The last part of this thesis recalls some of them and introduces the modelling of faults in different classes of hybrid systems. The methods for fault diagnosis are compared and some possible extensions and improvements are mentioned. The final level of complexity considered in this work is given by the distributed hybrid systems. A distributed system is in its general representation hybrid. But when considering a distributed system in which the nodes are hybrid systems as well, the result is a distributed hybrid system. One of its possible configuration is given by networks of hybrid systems. Some attempts to deal with this kind of representation have been done in the last years. Here a method to handle faults in the connections of networks of hybrid systems is introduced. The method uses the qualitative representation of hybrid systems and the automata representation of communication channels. The diagnostic task is decentralised with the use of fault detection units inside each node and a supervisor for each couple of communicating nodes. The main idea is to distinguish between faults in the nodes and faults on the connections comparing the results of the fault detection units and the results of the supervisor.


Type of Publication:

(03)Ph.D. Thesis

File Download:

Request a copy of this publication.
(Uses JavaScript)
% Autogenerated BibTeX entry
@PhDThesis { Xxx:2007:IFA_2800,
    author={Marta Capiluppi},
    title={{Fault tolerance in large scale systems: Hybrid and
	  distributed approaches}},
Permanent link