While the SimGrid project finds its roots in the Grid research community, it was mainly centered around distributed application scheduling. Hence, computation and communication models have been favored. Yet the generic "Grid" term also encompasses DataGrids in which the main focus is on storage and large data transfers. The main objective of this task will be to extend SimGrid with the lacking models and APIs for storage resources. The work done in this task will be driven by two practical use cases.
Loosely coupled large distributed systems such as Peer-to-Peer (P2P) networks or Volunteer Computing (VC) systems are characterized by their heterogeneity, their dynamicity and their low connectivity. Designing efficient solutions under such conditions often requires decentralized algorithms exploiting any locality present in the underlying physical infrastructure. We propose to study such systems through two different case studies:
This task aims at adding new interfaces to SimGrid in order to enable the simulation of applications in IaaS (Infrastructure as a Service) clouds. First, SimGrid must allow conducting experiments on cloud resources management from the cloud-provider side. Great challenges have been recently raised by virtualization. Virtualization opens a wide range of possibilities in the management of virtual resources with respect to physical ones. The expected benefit of this work is to enable the comparison of different resource management strategies. Examples of issues we could help study are: effects of anticipated provisioning of virtual machines (VM) to face peak demands, allocation algorithms to map VMs to physical hosts, or placement of VM images with respect to storage capabilities and the impact on VM startup or migration. The simulation will allow to evaluate these complex settings at the light of different metrics, e.g., performance, energy consumption, resource usage, that are critical to the performance and profitability of IaaS platforms. Second, e.g., must allow the simulation of a completely autonomous cloud platform in order to conduct experiments from the cloud-client side. Cloud resources are increasingly used by applications, but the problem of dimensioning and selecting the right resources or billing model is often solved in an empirical and sub-optimal way. We expect simulation to be a valuable approach for a IaaS client to evaluate and compare different strategies of resource renting, using metrics such as performance and economic cost.
The evaluation of proposals for IaaS solutions through real experiments shares the same issues (scalability, reproducibility, etc.) than studies carried out previously on grids. It is even more difficult in situations where we have no knowledge and control over physical resources. Further, experiments involving commercial clouds involve a billing process with which it is difficult to cope from a practical view-point. For all these reasons, the simulation of IaaS clouds can be largely beneficial to the scientific community.
The work started in USS SimGrid enabled to predict the performance of classical benchmarks on small-scale commodity clusters. This work was mainly related to the design and validation of accurate network models and simple performance models for a small set of regular applications. This task will be mainly devoted to push further the work already done to take into account the specific constraints related to high performance computing. The objective is to design and implement new models to be able to simulate the behavior of complex applications running on top of modern high performance platforms. Such systems include multicore nodes with complex memory hierarchies, high performance networks, and potentially I/O subsystems. These tools will be very useful to the high performance community, either to validate their applications on very specific platforms or to tackle algorithmic issues (like for example task scheduling) which are very sensitive to the characteristics of the underlying platform. Our goal is to better understand the impact of the platform on the performance of complex applications, which will reveal precious both to improve these applications and for capacity planning. Note that the expected outcome of this task is to provide specific models which will help the developers of complex high performance applications to validate and understand the key issues to reach a "good" performance.
The atypical scale of targeted applications induces severe requirements on the performance of the simulation kernel while the diversity of addressed domainsrequires a high level of modularity. Addressing jointly both challenges results in difficult software design choices, which will be proposed and assessed in this task.
As explained in the previous sections, the first subtasks of all domain tasks will be devoted to the development of ad hoc models. Yet, many domains share common characteristics and models from one domain are likely to be reused in another domain and even improved if needed. The goal of Task 6 is thus twofold.
First, it will support the domain tasks with the experience of model development for SimGrid. Modeling is the art of trade-off. A sound model obviously needs to reflect the phenomenon under study. To this end, it is tempting to design extremely detailed and precise models. Yet, such "microscopic" models often reveal very slow and not adapted to large systems. Furthermore, the more complex a model, the harder it will be to instantiate, and thus, the more likely the study will be compromised. Sound models need to be developed with the intuition of experts from the domains and the experience of model developers from the simulation community. Furthermore, the validation and the implementation of models require several steps that experts from application domains may not be familiar with.
Second, many domains have common interests in the modeling of specific resources. Task 6 will be responsible for keeping track of experience and developments made in each domain. When trying to apply a model from a domain to another domain, several adjustments will probably have to be done. Even though the code may not be reusable as such, knowing about common pitfalls or for which workload such model was designed will reveal crucial to reuse experience from other domains.
Any performance study relies on a set of assumptions that are often neither precisely stated nor assessed. For example, when studying the throughput of a system, one assumes it has entered steady-state. When evaluating a data-driven scheduling algorithm, one assumes the corresponding input workload will put heavy load on the network. When evaluating the fairness of a resource sharing algorithms, one expects the workload to be such that users truly interfere with each others. When using an input trace from a real system, one may implicitly expect it to be truly random whereas parasitic purely deterministic behavior may pollute it and thus compromise other higher-level hypothesis testing mechanisms. It is also generally assumed that the system (or the simulator) behaves as expected and complies to its specifications.
Most of these hypothesis are difficult to state and proceeding to a visual analysis to check them and to guide intuition is thus of the uttermost importance. If such analysis shows that the system behavior goes against common sense, then this discrepancy has to be investigated either to check for modeling flaws or to confirm that an unexpected phenomenon was discovered.
Since simulation-based studies are much cheaper than real experiments, most studies rely on campaigns comprising thousands of simulation. Usually, the rationale behind a large number of experiments is to ensure some statistical confidence in the estimation of the quality of a given approach. Yet, performing such study raises several issues that all need to be addressed to perform a sound study.