In the previous post I describe a typical workflow for data-intensive science. Sizeable IT infrastructure is required to handle the data rates and the data volume from the data acquisition and fast feedback via the iterative analysis to the archive. The required infrastructure can be provided by high-performance computing (HPC), cloud computing, or a hybrid approach. In this blog series I focus on HPC based data-intensive science, because I have more experience with HPC than with cloud computing. So, how does HPC support data-intensive science?
High-performance computing (HPC) describes an IT architecture and an operating model to provide IT resources for a broad range of scientific applications. In the past HPC systems have mostly been used for compute intensive simulations but nowadays HPC systems morph to superfacilities that are able to handle both, compute-intensive simulations and data-intensive analysis of acquired data. Theory and experiments go hand in hand where the results of the simulation influence the planning and the preparation of experiments and the analysis results of acquired data influences the theory that build the models for the simulation.
HPC systems include sizeable compute, storage and network resources. They are capable to store the incoming acquired data in a central storage and to connect it to a compute cluster for iterative analysis. To succeed in data-intensive science, scientists need to plan for the data workflow, for instance by integrating HPC resources into the experiment. Meanwhile there are many proven examples that demonstrate how HPC systems improve the outcomes of data-intensive science.
In 2014 and 2015 I worked with DESY to integrate HPC resources into their data-intensive science workflows. There are a paper and a video available that describe the requirements and the value of integrating HPC resources into the experiments from the scientist point of view. DESY published a paper about this project and my employer, I am with IBM, created a case study with a nice video.
Scientists and HPC professionals must work together to effectively integrate HPC resources into data-intensive scientific experiments. Scientists need to understand HPC concepts, techniques and best practices to understand how HPC can improve the analysis of acquired data. And HPC professionals need understand that the analysis of acquired data is putting a different workload on an HPC system than traditional compute-intensive simulation.
So, in the first four posts I motivated the need for HPC in data intensive science. In the next set of posts, I will introduce basic HPC concepts and later I will discuss various aspects on how to improve data-intensive science using HPC.