Project Description

Application Domains

  • WP1 [Data]Grid
  • WP2 Peer-to-Peer and Volunteer Computing
  • WP3 Clouds
  • WP4 High Performance Computing

Simulation Pillars

  • WP5 Efficient Simulation Kernel
  • WP6 Concepts and Models
  • WP7 Analysis and Visualization
  • WP8 Support to Experimental Methodology

WP 1: [Data]Grid

While the SimGrid project finds its roots in the Grid research community, it was mainly centered around distributed application scheduling. Hence, computation and communication models have been favored. Yet the generic "Grid" term also encompasses DataGrids in which the main focus is on storage and large data transfers. The main objective of this task will be to extend SimGrid with the lacking models and APIs for storage resources. The work done in this task will be driven by two practical use cases.

  • Distributed Data-Management Infrastructure for LHC Experiment
The distributed data management system designed by the CERN for the ATLAS experiment backs up the Large Hadron Collider (LHC). This system handles petabytes of data made available to a large number of scientists, who use them daily. Conducting any experiment on the infrastructure itself almost inevitably results in unacceptable service degradations for users, and resorting to simulation is mandatory. In this use case, storage is seen at a very coarse grain. Each grid site represent a storage element that stores large collections of files, and the problem is then to replicate and move these collections among the site.
  • Scheduling File Request on a Hierarchical Storage System
Within a given grid site, storage is arranged in a hierarchical structure: ranging from tapes to parallel file system through mass storage. To get performance, it is necessary to optimize the file requests made to the hierarchical storage system. The benefit of simulation in this context is twofold: help in the design of improved request scheduling algorithms, or allow them to evaluate different solutions thanks to performance prediction. Such a simulator requires to model storage components at each level of the hierarchy at a finer grain.

Top of the page

WP 2: Peer-to-Peer and Volunteer Computing

Loosely coupled large distributed systems such as Peer-to-Peer (P2P) networks or Volunteer Computing (VC) systems are characterized by their heterogeneity, their dynamicity and their low connectivity. Designing efficient solutions under such conditions often requires decentralized algorithms exploiting any locality present in the underlying physical infrastructure. We propose to study such systems through two different case studies:

  • Replica Placement in P2P systems
We consider the problem of data management, with a specific focus on VoD applications, where the dynamic nature of the storage units (P2P nodes), the connectivity artifacts (firewalls and NATs) and the high network latencies require the careful placement of replicas and the possibility for the application to quickly obtain a replica, even in the presence of node failures. This requires the design and the validation of simple platform models since we cannot assume that the topology is known in this context, but locality plays an important role in performance.
  • Exploiting Affinities in Volunteer Computing Systems
BOINC is the most popular VC infrastructure today with over 580,000 hosts that deliver over 2.3 Petaflop per day. BOINC projects usually have hundreds of thousands of independent tasks and are interested in maximizing the overall throughput. Each project has its own server which is responsible for distributing work units to clients, recovering results and validating them. The BOINC scheduling algorithms are complex and have been used for many years now. Their efficiency and fairness have been assessed in the context of throughput oriented projects. Yet, the affinity of particular projects with specific clients (e.g., because of the characteristics of their hardware or of their availability patterns) is rarely taken into account from a multi-project exploitation perspective. Game theory inspired techniques could be used to improve the overall efficiency of such systems.

Top of the page

WP 3: IaaS Clouds

This task aims at adding new interfaces to SimGrid in order to enable the simulation of applications in IaaS (Infrastructure as a Service) clouds. First, SimGrid must allow conducting experiments on cloud resources management from the cloud-provider side. Great challenges have been recently raised by virtualization. Virtualization opens a wide range of possibilities in the management of virtual resources with respect to physical ones. The expected benefit of this work is to enable the comparison of different resource management strategies. Examples of issues we could help study are: effects of anticipated provisioning of virtual machines (VM) to face peak demands, allocation algorithms to map VMs to physical hosts, or placement of VM images with respect to storage capabilities and the impact on VM startup or migration. The simulation will allow to evaluate these complex settings at the light of different metrics, e.g., performance, energy consumption, resource usage, that are critical to the performance and profitability of IaaS platforms. Second, e.g., must allow the simulation of a completely autonomous cloud platform in order to conduct experiments from the cloud-client side. Cloud resources are increasingly used by applications, but the problem of dimensioning and selecting the right resources or billing model is often solved in an empirical and sub-optimal way. We expect simulation to be a valuable approach for a IaaS client to evaluate and compare different strategies of resource renting, using metrics such as performance and economic cost.

The evaluation of proposals for IaaS solutions through real experiments shares the same issues (scalability, reproducibility, etc.) than studies carried out previously on grids. It is even more difficult in situations where we have no knowledge and control over physical resources. Further, experiments involving commercial clouds involve a billing process with which it is difficult to cope from a practical view-point. For all these reasons, the simulation of IaaS clouds can be largely beneficial to the scientific community.

Top of the page

WP 4: High Performance Computing

The work started in USS SimGrid enabled to predict the performance of classical benchmarks on small-scale commodity clusters. This work was mainly related to the design and validation of accurate network models and simple performance models for a small set of regular applications. This task will be mainly devoted to push further the work already done to take into account the specific constraints related to high performance computing. The objective is to design and implement new models to be able to simulate the behavior of complex applications running on top of modern high performance platforms. Such systems include multicore nodes with complex memory hierarchies, high performance networks, and potentially I/O subsystems. These tools will be very useful to the high performance community, either to validate their applications on very specific platforms or to tackle algorithmic issues (like for example task scheduling) which are very sensitive to the characteristics of the underlying platform. Our goal is to better understand the impact of the platform on the performance of complex applications, which will reveal precious both to improve these applications and for capacity planning. Note that the expected outcome of this task is to provide specific models which will help the developers of complex high performance applications to validate and understand the key issues to reach a "good" performance.

Top of the page

WP 5: Simulation Kernel

The atypical scale of targeted applications induces severe requirements on the performance of the simulation kernel while the diversity of addressed domainsrequires a high level of modularity. Addressing jointly both challenges results in difficult software design choices, which will be proposed and assessed in this task.

Top of the page

WP 6: Concepts and Models

As explained in the previous sections, the first subtasks of all domain tasks will be devoted to the development of ad hoc models. Yet, many domains share common characteristics and models from one domain are likely to be reused in another domain and even improved if needed. The goal of Task 6 is thus twofold.

First, it will support the domain tasks with the experience of model development for SimGrid. Modeling is the art of trade-off. A sound model obviously needs to reflect the phenomenon under study. To this end, it is tempting to design extremely detailed and precise models. Yet, such "microscopic" models often reveal very slow and not adapted to large systems. Furthermore, the more complex a model, the harder it will be to instantiate, and thus, the more likely the study will be compromised. Sound models need to be developed with the intuition of experts from the domains and the experience of model developers from the simulation community. Furthermore, the validation and the implementation of models require several steps that experts from application domains may not be familiar with.

Second, many domains have common interests in the modeling of specific resources. Task 6 will be responsible for keeping track of experience and developments made in each domain. When trying to apply a model from a domain to another domain, several adjustments will probably have to be done. Even though the code may not be reusable as such, knowing about common pitfalls or for which workload such model was designed will reveal crucial to reuse experience from other domains.

Top of the page

WP 7: Analysis and Visualization

Any performance study relies on a set of assumptions that are often neither precisely stated nor assessed. For example, when studying the throughput of a system, one assumes it has entered steady-state. When evaluating a data-driven scheduling algorithm, one assumes the corresponding input workload will put heavy load on the network. When evaluating the fairness of a resource sharing algorithms, one expects the workload to be such that users truly interfere with each others. When using an input trace from a real system, one may implicitly expect it to be truly random whereas parasitic purely deterministic behavior may pollute it and thus compromise other higher-level hypothesis testing mechanisms. It is also generally assumed that the system (or the simulator) behaves as expected and complies to its specifications.

Most of these hypothesis are difficult to state and proceeding to a visual analysis to check them and to guide intuition is thus of the uttermost importance. If such analysis shows that the system behavior goes against common sense, then this discrepancy has to be investigated either to check for modeling flaws or to confirm that an unexpected phenomenon was discovered.

Top of the page

WP 8: Support to Experimental Methodology

Since simulation-based studies are much cheaper than real experiments, most studies rely on campaigns comprising thousands of simulation. Usually, the rationale behind a large number of experiments is to ensure some statistical confidence in the estimation of the quality of a given approach. Yet, performing such study raises several issues that all need to be addressed to perform a sound study.

Top of the page