In the previous post I characterize data-intensive science as research and engineering efforts where the storage, the management and the analysis of acquired data requires special considerations to enable the overall scientific effort. So, what does a typically workflow in data-intensive science look like?
The figure below depicts a typical workflow in data-intensive science. The data is flowing from the left to the right and the processing of the data can be divided in three phases:
- Data acquisition and fast feedback during experiment execution,
- Analysis of the acquired data,
- Publication and preservation of results.
The experiment setup includes one or more instruments that spill out measured data. The data rates and the burstiness of the data rates vary depending on the instruments. The process of reading the data from the instruments and feeding it into an IT system is referred as data acquisition. In data-intensive science the IT infrastructure must be capable to cope with high data rates which sum up to high data volumes.
Scientific experiments are running in real-time. In many cases the setup and execution of a scientific experiment is expensive, and some experiments cannot be repeated at all. Therefore, the IT infrastructure for data-intensive science needs to be capable to cope with peak workloads. Whatever else is happening in the IT infrastructure: The data acquisition should have highest priority and other activities should never cause data acquisition to degrade or fail.
Scientists need immediate access to the acquired data for fast feedback. A typical requirement is to visualize a subset of the acquired data to assure that the experiment is running as expected and that useful data is acquired. The fast feedback is running in parallel to the data acquisition in real-time or near real-time. The IT infrastructure needs to be capable to handle both concurrently.
The setups of scientific experiments are located outside the data center, because nobody runs a genome sequencer or a test vehicle for autonomous driving inside the data center. That means the acquired data needs to be transferred from the instrument where it is generated to the data center where it is analyzed and archived. Some experiment setups envision a real-time streaming of the acquired data from the experiment hall to the data center or cloud, though most setups that I have seen have an ingest buffer close to the instruments.
The acquired data needs to be transferred to a data center or public cloud to get insight out of the acquired data, either by copying it from the ingest buffer or by streaming it immediately from the instruments. The IT infrastructure in the data center includes a central storage to store the huge amounts of acquired data and sizeable compute resources to analyze the acquire data.
The analysis of acquired data is an iterative process including data exploration, visualization, modelling and large-scale batch processing. Some analysis results must be available within a few minutes or hours to enable scientists to plan the next steps for currently running experiments. But most of the analysis will be one after the experiment is completed. For instance, it takes hours to days to compute all steps of the genomics pipeline to process the raw data produced by a genome sequencer. For other efforts it can takes months or years to publish a paper, to develop a new drug or to develop a model that drives a car autonomous.
Once the analysis is completed, acquired data and derived results need to be archived. It is a scientific best practice to keep all data that support a scientific publication for ten year. Legal requirements may require to keep data much longer. Archived data is typically no longer accessed.
So, IT systems for data intensive science need to support the end-to-end workflow from the real-time data acquisition and fast feedback via the iterative analysis to the long-term archive. In the next post I will discuss how high-performance computing (HPC) can support data intensive science.