Skip to main content
Computing, Environment and Life Sciences

DataStates: Scalable Data Management for AI/HPC

New approach reduces the complexity of data management, taking advantage of heterogeneous storage at large scale to improve performance.

The ability of data to be FAIR (findable, accessible, interoperable, and reusable) has been emphasized as one of the grand challenges of scientific data management. The DataStates project is exploring a new data model centered around the notion of data states, which are intermediate representations of datasets that are automatically recorded into a lineage when tagged by applications with hints, constraints, and persistency semantics (e.g., expiration date).

Such an approach effectively reduces the complexity of data management, while enabling an optimized runtime to take advantage of heterogeneous storage to achieve high performance and scalability of I/O in portable fashion on a variety of leadership-class supercomputing platforms.

DataStates is particularly useful in the development of systematic AI approaches that manipulate learning models, which can be viewed as data states that are continuously evolving as more training samples are presented to it. In this case, many alternatives can be explored by capturing a snapshot of the model during training and forking into alternative directions (e.g., by changing the structure of the model or the training data presented to it) whose history is recorded. This forms the core of many techniques to discover the optimal architecture and hyperparameters of a learning model and to explain their learned patterns, potential for generalization, and robustness through sensitivity analysis.