In the previous post I claim that data-intensive science forces organizations to adopt high-performance computing (HPC). I heard the phrase “data-intensive science” the first time in 2012 when I worked with a research institution on large-scale file services for scientists. Since then I heard “data-intensive science” from multiple clients, though in 2019 “data-intensive science” has still less than 100K Google hits and Wikipedia redirects “data-intensive science” to “data-intensive computing”. So, what is data-intensive science?
I use the term data-intensive science to describe research and engineering efforts where the storage, the management and the analysis of acquired data requires special considerations to enable the scientific effort. Standard IT-equipment like laptops and single workstations are not enough to handle the amount of acquired data. The whole research or engineering effort depends on the availability of sizeable storage, compute and network resources. The research or engineering effort will suffer or fail without respective IT infrastructure that provides the required capacity, performance and stability.
Some scientific fields are data-intensive since decades. Examples include high-energy physics, astronomy and the exploration of oil and gas. The Worldwide LHC Computing Grid (WLCG) provides global distributed IT resources to store, distribute, manage and analyze the data acquired at the Large Hadron Collider (LHC) at CERN. Like supercomputers these IT infrastructures are architected, operated and used by teams which have long and deep expertise in data-intensive science.
Advancements in cameras and sensors and new devices such genome sequencers, super microscopes and mobile devices increase the amount and the variety of data sources in science. An extreme example is the Large Pixel Detector (LPD) at the European XFEL that generates 4.5 million high-resolution images per second. Teams with long and deep experience in data-intensive science have the required skills to build and use IT infrastructure that can handle the continuously increasing data rates and volumes.
The sprawl of data-intensive instruments turns whole scientific fields into data-intensive science. The Human Genome Project (HGP) took 13 years, 3 billion dollars funding and a global collaboration to complete the sequencing of the first human genome in 2003. Since 2015 the costs to sequence a human genome are about 1.000 dollars and it takes only a couple of days including the sequencing process and all subsequent calculations. Nowadays many scientific groups in academia and industry have access to genome sequencers and need to store and analyze the acquired data.
Genome sequencers produce a data set of a few hundred gigabyte for each sequenced human genome. Scientific groups need to analyze and store thousands of such data sets to advance science, to develop new drugs or to apply personalized treatment. With the adoption and commoditization of genome sequencers, biologists, pharmacists, and doctors require sizeable IT infrastructure just to do their job. In contrasts to physicists those occupational groups have only limited experience and skills to handle huge amounts of data.
Right now, a similar change is happing in the automotive industry. Autonomous vehicles are equipped with cameras, radar, lidar and other sensors that generate huge amount of data. Car manufacturers and suppliers capture data in test vehicles that needs to be transferred to central data centers. Each test vehicle generates several 10 TB per day. The data centers for autonomous driving development need to be equipped with respective IT resources to process the data generated by a fleet of tens or hundreds of test vehicles. Car manufactures know how to design and build cars at scale. Now they must learn to acquire, store, analyze and manage huge data at scale as well.
In the next post I will describe a typical data flow in data-intensive science and show how it relates to high-performance computing (HPC).