Researchers from national laboratories and universities will be demonstrating new tools and technologies for accelerating data transfer, improving application performance and increasing energy efficiency in a series of demos scheduled across SC24.
MONDAY, NOV. 18
Station 1
7:00 P.M. EST
Speaker:
Eli Dart
IRI Fusion Pathfinder Multi-Facility Demo
Associated Organizations: Lawrence Berkeley Laboratory, Argonne National Laboratory
» Abstract
The DOE’s Integrated Research Infrastructure aims to empower researchers to meld DOE’s world-class research tools, infrastructure, and user facilities seamlessly and securely in novel ways to radically accelerate discovery and innovation. This demo will showcase the Fusion Pathfinder Project which uses multiple HPC facilities, ALCF and NERSC, to analyze data from a fusion experiment at the DIII-D National Fusion Facility.
8:00 P.M. EST
Speaker:
Lois Curfman McInnes
PESO: Partnering for Scientific Software Ecosystem Stewardship Opportunities
Associated Organizations: Argonne National Laboratory, Brookhaven National Laboratory, Los Alamos National Laboratory, Lawrence Berkeley Laboratory, Lawrence Livermore National Laboratory, Oak Ridge National Laboratory, Pacific Northwest National Laboratory, Sandia National Laboratories
» Abstract
This presentation will introduce the newly established PESO project (https://pesoproject.org), which supports software-ecosystem stewardship and advancement, collaborating through the Consortium for the Advancement of Scientific Software (CASS). PESO’s vision is that investments by the U.S. Department of Energy (DOE) in software have maximum impact through a sustainable scientific software ecosystem consisting of high-quality libraries and tools that deliver the latest high-performance algorithms and capabilities to serve application needs at DOE and beyond. Key PESO goals are (1) enabling applications to leverage robust, curated scientific libraries and tools, especially in pursuit of improvement in high-end capabilities and energy efficiency by leveraging accelerator (GPU) devices, and (2) emphasizing software product quality, the continued fostering of software product communities, and the delivery of products, while advancing workforce inclusivity and sustainable career paths. PESO delivers and supports software products via Spack and E4S, and PESO provides porting and testing platforms leveraged across product teams to ensure code stability and portability. PESO also facilitates the delivery of other products, such as AI/ML libraries, as needed by the HPC community. PESO collaborates with CASS to transform independently developed products into a portfolio whose total is much more than the sum of its parts—establishing a trusted software ecosystem essential to DOE’s mission.
Station 2
7:00 P.M. EST
Speaker:
Sutanay Choudhury
AI-guided Hypothesis Generation and Design of Catalysts with Complex Morphologies and Reaction Networks
Associated Organizations: Pacific Northwest National Laboratory
» Abstract
We present an AI-driven framework for catalyst discovery, combining linguistic reasoning with quantum chemistry feedback. Our approach uses large language models (LLMs) to generate hypotheses and graph neural networks (GNNs) to evaluate 3D atomistic structures. The iterative process incorporates structural evaluation, reaction pathways, and stability assessments. Automated planning methods guide the exploration, rivaling expert-driven approaches. This integration of language-guided reasoning and computational chemistry feedback accelerates trustworthy catalyst discovery for sustainable chemical processes.
8:00 P.M. EST
Speaker:
Paul Lin
Accurate in-situ in-transit analysis of particle diffusion for large-scale tokamak simulations
Associated Organizations: Lawrence Berkeley Laboratory, Oak Ridge National Laboratory, Princeton Plasma Physics Laboratory
» Abstract
In turbulent and stochastic magnetic fusion plasma, the plasma transport analysis using grid quantities often misses important physics, such as subgrid phenomena and particle auto-correlation. Traditional in-line analysis significantly enhances memory footprint and slows down the main simulation. An accurate in-situ in-transit workflow is presented here to demonstrate a streaming analysis method focused on large-scale data generated from plasma particles in magnetic confinement fusion simulations, specifically using XGC. This workflow streamlines the computational data from simulation nodes to dedicated analysis nodes using asynchronous data movement. It thus enables parallelization of the data analysis and the main simulation without affecting the memory footprint and speed of the main simulation, offering insights into fusion plasma behavior with unparalleled detail and efficiency in near real time.
TUESDAY, NOV. 19
Station 1
10:00 A.M. EST
Speaker:
Lois Curfman McInnes
PESO: Partnering for Scientific Software Ecosystem Stewardship Opportunities
Associated Organizations: Argonne National Laboratory, Brookhaven National Laboratory, Los Alamos National Laboratory, Lawrence Berkeley Laboratory, Lawrence Livermore National Laboratory, Oak Ridge National Laboratory, Pacific Northwest National Laboratory, Sandia National Laboratories
» Abstract
This presentation will introduce the newly established PESO project (https://pesoproject.org), which supports software-ecosystem stewardship and advancement, collaborating through the Consortium for the Advancement of Scientific Software (CASS). PESO’s vision is that investments by the U.S. Department of Energy (DOE) in software have maximum impact through a sustainable scientific software ecosystem consisting of high-quality libraries and tools that deliver the latest high-performance algorithms and capabilities to serve application needs at DOE and beyond. Key PESO goals are (1) enabling applications to leverage robust, curated scientific libraries and tools, especially in pursuit of improvement in high-end capabilities and energy efficiency by leveraging accelerator (GPU) devices, and (2) emphasizing software product quality, the continued fostering of software product communities, and the delivery of products, while advancing workforce inclusivity and sustainable career paths. PESO delivers and supports software products via Spack and E4S, and PESO provides porting and testing platforms leveraged across product teams to ensure code stability and portability. PESO also facilitates the delivery of other products, such as AI/ML libraries, as needed by the HPC community. PESO collaborates with CASS to transform independently developed products into a portfolio whose total is much more than the sum of its parts—establishing a trusted software ecosystem essential to DOE’s mission.
11:00 A.M. EST
Speaker:
Vardan Gyurjyan
(Amitoj Singh)
Practical Hardware Accelerated Real-Time Multi-facility Streaming Workflow
Associated Organizations: Jefferson Laboratory
» Abstract
The CLAS12 experiment at Jefferson Lab produces immense volumes of raw data that demand immediate processing to enable prompt physics analysis. We have developed an innovative approach that seamlessly streams this data in real-time across the ESnet6 network backbone to multiple high-performance computing (HPC) facilities, including the Perlmutter supercomputer at NERSC, the Defiant cluster at Oak Ridge National Laboratory, and additional computational resources in NSF FABRIC testbed. These distributed resources collaboratively process the data stream and return results to Jefferson Lab in real time for validation, persistence, and final analysis—all accomplished without buffering, temporary storage, data loss, or latency issues.
This achievement is underpinned by three cutting-edge technologies developed by ESnet and Jefferson Lab. (1) EJFAT (ESnet JLab FPGA Accelerated Transport) is a high-speed data transport and load-balancing mechanism that utilizes FPGA acceleration to optimize real-time data transmission over ESnet6. (2) JIRIAF (JLab Integrated Research Infrastructure Across Facilities) provides a framework that streamlines resource management and optimizes HPC workloads across heterogeneous environments by leveraging Kubernetes and Virtual Kubelet to manage resources within user space dynamically. (3) ERSAP (Environment for Real-Time Streaming, Acquisition and Processing) is a reactive, actor-model, flow-based programming framework that decomposes data processing applications into small, monofunctional actors. This decomposition allows for independent scaling and optimization of each actor and context-aware data processing, facilitating operation in heterogeneous environments and utilizing diverse accelerators.
Our demonstration confirms that real-time remote data stream processing over high-speed networks without intermediate storage is feasible and highly efficient. This approach represents a significant advancement in data analysis workflows for large-scale physics experiments, offering a scalable and resilient solution for real-time scientific computing.
12:00 P.M. EST
Speaker:
Sameer Shende
E4S: Extreme-scale Scientific Software Stack
Associated Organizations: Argonne National Laboratory, University of Oregon
» Abstract
The Extreme-scale Scientific Software Stack (E4S) [https://e4s.io] is a curated, Spack based software distribution of 100+ HPC, EDA, and AI/ML packages. The Spack package manager is a core component of E4S and it is a platform for product integration and deployment of performance evaluation tools such as TAU, HPCToolkit, DyninstAPI, PAPI, etc. and supports both bare-metal and containerized deployment for CPU and GPU platforms. E4S provides a Spack binary cache and a set of base and full-featured container images with support for GPUs from NVIDIA, AMD, and Intel. E4S is a community effort to provide open-source software packages for developing, deploying, and running scientific applications and tools on HPC platforms. It has built a comprehensive, coherent software stack that enables application developers to productively develop highly parallel applications that effectively target diverse exascale architectures. E4S supports commercial cloud platforms, including AWS, Azure, and GCP with a Remote Desktop that simplifies performance tool integration and deployment. It also includes a container launch tool (e4s-cl) that allows binary distribution of applications by substituting MPI in the containerized application with the system MPI. It features e4s-alc, a tool to customize base container images provided by E4S by adding packages using Spack and OS package managers. These containers are available for download from the E4S website and DockerHub. This talk will describe performance tool and runtime systems integration issues with GPU runtimes in E4S and instrumentation APIs exposed for performance evaluation tools. To meet the needs of computational scientists to evaluate the performance of their parallel, scientific applications, we will focus on the use of E4S and the TAU Performance System(R) [http://tau.uoregon.edu] for performance data collection, analysis, and performance optimization. The talk will cover the runtime systems that TAU supports to transparently insert instrumentation hooks in the application to observe CPU and GPU performance and will provide updates on TAU and E4S’ latest features. Both TAU and E4S are released in liberal open-source licenses and supported by U.S. Department of Energy’s Exascale Computing Project [https://www.exascaleproject.org] and the PESO Project [https://pesoproject.org].
This achievement is underpinned by three cutting-edge technologies developed by ESnet and Jefferson Lab. (1) EJFAT (ESnet JLab FPGA Accelerated Transport) is a high-speed data transport and load-balancing mechanism that utilizes FPGA acceleration to optimize real-time data transmission over ESnet6. (2) JIRIAF (JLab Integrated Research Infrastructure Across Facilities) provides a framework that streamlines resource management and optimizes HPC workloads across heterogeneous environments by leveraging Kubernetes and Virtual Kubelet to manage resources within user space dynamically. (3) ERSAP (Environment for Real-Time Streaming, Acquisition and Processing) is a reactive, actor-model, flow-based programming framework that decomposes data processing applications into small, monofunctional actors. This decomposition allows for independent scaling and optimization of each actor and context-aware data processing, facilitating operation in heterogeneous environments and utilizing diverse accelerators.
Our demonstration confirms that real-time remote data stream processing over high-speed networks without intermediate storage is feasible and highly efficient. This approach represents a significant advancement in data analysis workflows for large-scale physics experiments, offering a scalable and resilient solution for real-time scientific computing.
1:00 P.M. EST
Speaker:
David Rogers
IRI Early Technologies and Applications Demos
Associated Organizations: Argonne National Laboratory, Lawrence Berkeley Laboratory, Oak Ridge National Laboratory, SLAC National Accelerator Laboratory, ESNet
» Abstract
The DOE’s Integrated Research Infrastructure aims to empower researchers to meld DOE’s world-class research tools, infrastructure, and user facilities seamlessly and securely in novel ways to radically accelerate discovery and innovation. The tools and technology emerging from this effort are already beginning to open up new avenues to how we use experimental and computational scientific instruments together. This session will show video demos of technologies and applications that define the current state of practice for time-sensitive, data integration-intensive, and long-term campaign patterns. As entry points into this new design space, they show a glimpse of what could be achieved if we coordinate to meet the difficult challenges of deploying IRI.
2:00 P.M. EST
Speaker:
Seongmin Kim
Distributed Quantum Approximate Optimization Algorithm for Large-Scale Optimization
Associated Organizations: Oak Ridge National Laboratory
» Abstract
The Quantum Approximate Optimization Algorithm (QAOA) has demonstrated potential for solving combinatorial optimization problems on near-term quantum computing systems. However, QAOA encounters challenges when handling high-dimensional problems, primarily due to the large number of qubits required and the complexity of deep circuits, which constrain its scalability for practical applications. In this study, we introduce a distributed QAOA (DQAOA) that utilizes a high-performance computing-quantum computing (HPC-QC) integrated system. DQAOA employs distributed computing techniques to break down large tasks into smaller sub-tasks, which are then processed on the HPC-QC system. The global solution is iteratively refined by aggregating sub-solutions obtained from DQAOA, facilitating convergence towards the optimal solution. We demonstrate that DQAOA is capable of handling large-scale optimization problems (e.g., 1,000-bit problems), achieving a high approximation ratio (~99%) with a short time-to-solution (~276 s). To extend the applicability of this algorithm to material science, we have further developed an active learning algorithm integrated with DQAOA (AL-DQAOA), which combines machine learning, DQAOA, and active data production in an iterative process. Using AL-DQAOA, we successfully optimize photonic structures, demonstrating that our approach makes it feasible to solve real-world optimization problems using gate-based quantum computing. We anticipate that DQAOA will be applicable to a wide range of optimization challenges, with AL-DQAOA finding broader applications in material science.
3:00 P.M. EST
Speaker:
Sameer Shende
E4S: Extreme-scale Scientific Software Stack
Associated Organizations: Argonne National Laboratory, University of Oregon
» Abstract
The Extreme-scale Scientific Software Stack (E4S) [https://e4s.io] is a curated, Spack based software distribution of 100+ HPC, EDA, and AI/ML packages. The Spack package manager is a core component of E4S and it is a platform for product integration and deployment of performance evaluation tools such as TAU, HPCToolkit, DyninstAPI, PAPI, etc. and supports both bare-metal and containerized deployment for CPU and GPU platforms. E4S provides a Spack binary cache and a set of base and full-featured container images with support for GPUs from NVIDIA, AMD, and Intel. E4S is a community effort to provide open-source software packages for developing, deploying, and running scientific applications and tools on HPC platforms. It has built a comprehensive, coherent software stack that enables application developers to productively develop highly parallel applications that effectively target diverse exascale architectures. E4S supports commercial cloud platforms, including AWS, Azure, and GCP with a Remote Desktop that simplifies performance tool integration and deployment. It also includes a container launch tool (e4s-cl) that allows binary distribution of applications by substituting MPI in the containerized application with the system MPI. It features e4s-alc, a tool to customize base container images provided by E4S by adding packages using Spack and OS package managers. These containers are available for download from the E4S website and DockerHub. This talk will describe performance tool and runtime systems integration issues with GPU runtimes in E4S and instrumentation APIs exposed for performance evaluation tools. To meet the needs of computational scientists to evaluate the performance of their parallel, scientific applications, we will focus on the use of E4S and the TAU Performance System(R) [http://tau.uoregon.edu] for performance data collection, analysis, and performance optimization. The talk will cover the runtime systems that TAU supports to transparently insert instrumentation hooks in the application to observe CPU and GPU performance and will provide updates on TAU and E4S’ latest features. Both TAU and E4S are released in liberal open-source licenses and supported by U.S. Department of Energy’s Exascale Computing Project [https://www.exascaleproject.org] and the PESO Project [https://pesoproject.org].
4:00 P.M. EST
Speaker:
Amir Shehata
HPC/QC Integration Framework
Associated Organizations and partnerships: Oak Ridge National Laboratory
» Abstract
In recent years, quantum computing has demonstrated the potential to revolutionize specific algorithms and applications by solving problems exponentially faster than classical computers. However, its widespread adoption for general computing remains a future prospect. This paper discusses the integration of quantum computing within High-Performance Computing (HPC) environments, focusing on a resource management framework designed to streamline quantum simulators’ use and enhance runtime performance and efficiency.
5:00 P.M. EST
Speaker:
Jean Luca Bez
Drishti: I/O Insights for All
Associated Organization and partnerships: Lawrence Berkeley Laboratory
» Abstract
The complexity of the HPC I/O stack combined with gaps in the state-of-the-art profiling tools creates a barrier that does not help end-users and scientific application developers solve the I/O performance problems they encounter. Closing this gap requires cross-layer analysis combining multiple metrics and, when appropriate, drilling down to the source code. Dristhi is a multi-source interactive analysis framework for I/O to visualize traces, highlight bottlenecks, and help understand the behavior of applications. The framework contains an interactive I/O trace analysis component for end-users to visually inspect their applications’ I/O behavior, focusing on areas of interest and getting a clear picture of common root causes of I/O performance bottlenecks. Drishti maps common and well-known bottlenecks based on heuristic-based automatic detection of I/O performance bottlenecks and solution recommendations that users can implement.
Station 2
10:00 A.M. EST
Speaker:
Sai Munikoti
Next-Generation AI Tools for Environmental Review and Permitting Efficiency
Associated Organizations: Pacific Northwest National Laboratory
» Abstract
The National Environmental Policy Act (NEPA) of 1969, is a bedrock and enduring environmental law in the United States with the express intent of fostering a productive harmony between humans and the environment for present and future generations. The NEPA statute and implementing regulations of the Council on Environmental Quality establish procedures requiring all federal agencies to consider and communicate the environmental effects in their planning and decisions to the public. Thus, agencies conduct environmental review and prepare a written document disclosing the potential environmental impact of proposed actions. Additionally, various studies are conducted by the scientific community to measure the impact of actions/projects on environment that eventually support NEPA review. Environmental data serves as a fundamental block in streamlining NEPA reviews where rich information contained in historical review and scientific documents could enable us to efficiently retrieve, analyze and find patterns that can inform future NEPA reviews. Currently documents are distributed across several agencies and its raw (PDF) form restricts from searchable and coupling with AI applications. We present two AI-ready data platforms, SearchNEPA and WindAssist that offers seamless access to past review and scientific documents. SearchNEPA is a cloud-driven AI-ready search tool that enables fast retrieval of policy-relevant information from over 28k environmental review documents. We developed several data standardization and augmentation techniques to improve the quality and access to the data records with the construction of NEPA ontology that include data objects such as project, process, document, public involvement, comments, and GIS. SearchNEPA runs on (i) single low-cost data storage in the cloud that accommodates diverse data types, (ii) open, standardized, and AI-compatible storage formats that facilitate comprehensive and fast data consumption, and (iii) end-to-end streaming that automates PDF data ingestion, metadata enrichment, and application endpoints. SearchNEPA opens a wide range of applications for both insight and foresight on NEPA performance and risk, including deep searches at multiple hierarchy, chatting with a collection of NEPA documents with Retrieval Augmented Generation (RAG) techniques, analytics from historical reviews that can inform future studies and geo-visualization of the NEPA projects through GIS information extracted through their documents. More recently, we publicly released a text corpus of data from more than 28k NEPA documents that powered SearchNEPA, the National Environmental Policy Act Text Corpus (NEPATEC1.0). WindAssist is a multimodal AI-driven data and knowledge discovery platform for wind energy-related documents collected from the Tethys database consisting of scientific articles, technical reports, and NEPA Reviews. It curates and unifies diverse data modalities into a structured AI-ready form via multimodal vectorization that can facilitate efficient information access. Specifically, we leverage large language model-based parsing to extract text, images, and tables from 2709 PDF documents comprising of 125k text chunks and 30k images. The platform offers a multimodal search tool, WindAssist-Search, which independently retrieves text and images that are semantically relevant to the query. In addition to the search, the multimodal conversation assistant tool, WindAssist-Chat, compiles the retrieved documents and images to generate responses to user queries. Unlike traditional keyword-based searches and chat assistant tools, our database ensures comprehensive data retrieval at a granular level, down to individual pages or text within images, and tools facilitate effective use of information to craft high quality responses. SearchNEPA and WindAssist aligns with the DOE efforts to transform the DOE’s vast repository of data into high quality AI-ready form and accelerate deployment of clean energy by streamlining siting and permitting processes via providing a one-stop-platform for various federal agencies (via OneID, login.gov, etc.) to (i) manage NEPA review and scientific documents, (ii) search for relevant information and (iii) fetch hidden insights which altogether support agency specific NEPA workflows.
11:00 A.M. EST
Speaker:
Joaquin Chung
(+ Flavio Castro)
SciStream: Enabling Data Streaming between Science Instruments and HPC Nodes
Associated Organizations: Argonne National Laboratory
» Abstract
Memory-to-memory data streaming between scientific instruments and remote high-performance computing (HPC) nodes has emerged as a key requirement to enable online processing of high-volume and high-velocity data for feature detection, experiment steering, and other purposes. In contrast to file transfer between scientific facilities for which a well-defined architecture exists in the form of science DMZ, data transfer nodes (DTN) and the associated tools, there is no well-defined infrastructure to enable efficient and secure memory-to-memory data streaming between scientific instruments and HPC nodes. It is especially important as both scientific instruments and HPC nodes lack direct external network connectivity. SciStream establishes a well-defined architecture and control protocols with an open-source implementation to enable distributed scientific workflows to use their choice of data streaming tools to move data from scientific instruments’ memory to HPC nodes’ memory. In this demo, we will describe the architecture and protocols that SciStream uses to establish authenticated and transparent connections between producers and consumers, and show how to stream data through SciStream. We will also discuss our experience integrating and running real-world scientific applications with SciStream.
12:00 P.M. EST
Speaker:
Valentine Anantharaj
A Digital Twin Toward Monitoring and Forecasting of Severe Weather Events in the Future
Associated Organizations: Oak Ridge National Laboratory
» Abstract
Significant improvements in numerical weather prediction (NWP) can be attributed to modern satellite observing systems and the data assimilation techniques that enable the incorporation of satellite observations in NWP models. When new instruments are developed it is essential to understand their usefulness for weather prediction and climate monitoring, especially extreme events, in the context of existing observations. The value added by a new satellite instrument can be evaluated using a digital twin surrogate of the satellite instrument and by performing Observing System Simulation Experiments. The next generation NOAA Geostationary Extended Observations (GeoXO) satellites are planned to be operational in the early 2030s. The GeoXO payload will include an atmospheric sounder instrument (GXS) that will enable timely and improved forecasting of hurricanes and severe weather. We will demonstrate and discuss the potential advantages of GXS, using a digital twin prototype for the GXS, toward forecasting a hypothetical severe weather event. The prototype GXS digital twin was developed at the University of Wisconsin-Madison Space Science and Engineering Center, using data from a global 1-km seasonal simulation completed by the European Center for Medium Range Weather Forecasts at the Oak Ridge Leadership Computing Facility via a DOE ASCR INCITE award.
1:00 P.M. EST
Speaker:
Derek Mariscal
Sidekick System: AI-enabled High Repetition Rate Laser experiment software
Associated Organizations: Lawrence Livermore National Laboratory
» Abstract
State-of-the-art physics drivers, such as Lawrence Livermore National Laboratory’s National Ignition Facility and ITER, along with multi-million-dollar university-scale facilities, are not optimal platforms for pioneering wide-ranging explorations in digital infrastructure and closed-loop control schemes using AI, as they are engaged in critical scientific research with expensive equipment. To address this challenge, we are developing modular, flexible small-scale surrogate facilities, termed “sidekick facilities,” which replicate the complex non-physics aspects of closed-loop autonomous operations and enhance data generation and acquisition rates.
2:00 P.M. EST
Speaker:
Benjamin Mintz
Interconnected Science Ecosystem (INTERSECT) for Autonomous Laboratories
Associated Organizations: Oak Ridge National Laboratory
» Abstract
The convergence of recent advancements in artificial intelligence, robotics, advanced instrumentation, edge and high-performance computing, and high-throughput networks is ushering in a transformative era for self-driving autonomous laboratories. Yet, without a coordinated scientific approach, the landscape risks becoming fragmented with disparate solutions that lack seamless integration, leading to siloed “smart” labs and user facilities. The Interconnected Science Ecosystem (INTERSECT) initiative is co-designing a common ecosystem and building interoperable “self-driving” autonomous laboratories across the ORNL directorates. The primary goal of the INTERSECT initiative is to lead and coordinate smart laboratory efforts across ORNL to unlock unprecedented efficiencies and groundbreaking new research approaches/outcomes.
3:00 P.M. EST
Speaker:
Paul Lin
Accurate in-situ in-transit analysis of particle diffusion for large-scale tokamak simulations
Associated Organizations: Lawrence Berkeley Laboratory, Oak Ridge National Laboratory, Princeton Plasma Physics Laboratory
» Abstract
In turbulent and stochastic magnetic fusion plasma, the plasma transport analysis using grid quantities often misses important physics, such as subgrid phenomena and particle auto-correlation. Traditional in-line analysis significantly enhances memory footprint and slows down the main simulation. An accurate in-situ in-transit workflow is presented here to demonstrate a streaming analysis method focused on large-scale data generated from plasma particles in magnetic confinement fusion simulations, specifically using XGC. This workflow streamlines the computational data from simulation nodes to dedicated analysis nodes using asynchronous data movement. It thus enables parallelization of the data analysis and the main simulation without affecting the memory footprint and speed of the main simulation, offering insights into fusion plasma behavior with unparalleled detail and efficiency in near real time.
4:00 P.M. EST
Speaker:
Alex Lovell-Troy
OpenCHAMI: Open Source HPC System Management for future generations of DoE HPC
Associated Organizations: Los Alamos National Laboratory
» Abstract
Several DoE labs are collaboratively building a set of composable system management tools with the goal of being capable of running the next generation(s) of large HPC procurements. Alex Lovell-Troy from LANL will demonstrate several of the components including hardware discovery, federation, and inventory management and discuss the overall architecture of the project. This will be especially useful for system administrators responsible for managing HPC systems. The OpenCHAMI consortium is focused on reducing toil for sysadmins with responsibility for multiple systems.
5:00 P.M. EST
Speaker:
Derek Mariscal
Sidekick System: AI-enabled High Repetition Rate Laser experiment software
Associated Organizations: Lawrence Livermore National Laboratory
» Abstract
State-of-the-art physics drivers, such as Lawrence Livermore National Laboratory’s National Ignition Facility and ITER, along with multi-million-dollar university-scale facilities, are not optimal platforms for pioneering wide-ranging explorations in digital infrastructure and closed-loop control schemes using AI, as they are engaged in critical scientific research with expensive equipment. To address this challenge, we are developing modular, flexible small-scale surrogate facilities, termed “sidekick facilities,” which replicate the complex non-physics aspects of closed-loop autonomous operations and enhance data generation and acquisition rates.
WEDNESDAY, NOV. 20
Station 1
10:00 A.M. EST
Speaker:
David Rogers
IRI Early Technologies and Applications Demos
Associated Organizations: Argonne National Laboratory, Lawrence Berkeley Laboratory, Oak Ridge National Laboratory, SLAC National Accelerator Laboratory, ESNet, Lawrence Livermore National Laboratory
» Abstract
The DOE’s Integrated Research Infrastructure aims to empower researchers to meld DOE’s world-class research tools, infrastructure, and user facilities seamlessly and securely in novel ways to radically accelerate discovery and innovation. The tools and technology emerging from this effort are already beginning to open up new avenues to how we use experimental and computational scientific instruments together. This session will show video demos of technologies and applications that define the current state of practice for time-sensitive, data integration-intensive, and long-term campaign patterns. As entry points into this new design space, they show a glimpse of what could be achieved if we coordinate to meet the difficult challenges of deploying IRI.
11:00 A.M. EST
Speaker:
Craig Vineyard
(+ Christian Mayr)
The SpiNNaker2 Neuromorphic Computing Architecture – LLMs, Optimization, & AI/ML
Associated Organizations: Sandia National Laboratories
» Abstract
Inspired by principles of the brain, SpiNNaker2 is a many-core neuromorphic chip designed for large-scale asynchronous processing. The flexibility provided by its reconfigurability, scalability afforded by its real-time, large-scale mesh, and native support for hybrid acceleration of symbolic spiking and deep neural networks make SpiNNaker2 a unique computing platform. Sandia National Laboratories has partnered with SpiNNcloud to explore computational advantages neuromorphic computing can enable for a variety of applications. This demo will showcase the SpiNNaker2 architecture across a range of applications – large language models (LLMs), optimizations, and AI/ML.
12:00 P.M. EST
Speaker:
Craig Vineyard
(+ Christian Mayr)
The SpiNNaker2 Neuromorphic Computing Architecture – LLMs, Optimization, & AI/ML
Associated Organizations: Sandia National Laboratories
» Abstract
Inspired by principles of the brain, SpiNNaker2 is a many-core neuromorphic chip designed for large-scale asynchronous processing. The flexibility provided by its reconfigurability, scalability afforded by its real-time, large-scale mesh, and native support for hybrid acceleration of symbolic spiking and deep neural networks make SpiNNaker2 a unique computing platform. Sandia National Laboratories has partnered with SpiNNcloud to explore computational advantages neuromorphic computing can enable for a variety of applications. This demo will showcase the SpiNNaker2 architecture across a range of applications – large language models (LLMs), optimizations, and AI/ML.
1:00 P.M. EST
Speaker:
Jim Brandt
Data-Driven Autonomous Operations
Associated Organizations and parternships: Sandia National Laboratories
» Abstract
This demonstration shows how the Lightweight Distributed Metric Service’s (LDMS) capabilities enable autonomous operations through the capability to gather and perform run time analysis of application and system data and to provide a real-time feedback path to optimize resource utilization.
2:00 P.M. EST
Speaker:
Vardan Gyurjyan
(Amitoj Singh)
Practical Hardware Accelerated Real-Time Multi-facility Streaming Workflow
Associated Organizations: Jefferson Laboratory
» Abstract
The CLAS12 experiment at Jefferson Lab produces immense volumes of raw data that demand immediate processing to enable prompt physics analysis. We have developed an innovative approach that seamlessly streams this data in real-time across the ESnet6 network backbone to multiple high-performance computing (HPC) facilities, including the Perlmutter supercomputer at NERSC, the Defiant cluster at Oak Ridge National Laboratory, and additional computational resources in NSF FABRIC testbed. These distributed resources collaboratively process the data stream and return results to Jefferson Lab in real time for validation, persistence, and final analysis—all accomplished without buffering, temporary storage, data loss, or latency issues.
This achievement is underpinned by three cutting-edge technologies developed by ESnet and Jefferson Lab. (1) EJFAT (ESnet JLab FPGA Accelerated Transport) is a high-speed data transport and load-balancing mechanism that utilizes FPGA acceleration to optimize real-time data transmission over ESnet6. (2) JIRIAF (JLab Integrated Research Infrastructure Across Facilities) provides a framework that streamlines resource management and optimizes HPC workloads across heterogeneous environments by leveraging Kubernetes and Virtual Kubelet to manage resources within user space dynamically. (3) ERSAP (Environment for Real-Time Streaming, Acquisition and Processing) is a reactive, actor-model, flow-based programming framework that decomposes data processing applications into small, monofunctional actors. This decomposition allows for independent scaling and optimization of each actor and context-aware data processing, facilitating operation in heterogeneous environments and utilizing diverse accelerators.
Our demonstration confirms that real-time remote data stream processing over high-speed networks without intermediate storage is feasible and highly efficient. This approach represents a significant advancement in data analysis workflows for large-scale physics experiments, offering a scalable and resilient solution for real-time scientific computing.
3:00 P.M. EST
Speaker:
Craig Vineyard
TOPNMC: Exploring Neuromorphic Computing Impact
Associated Organizations and Partnerships: Sandia National Laboratories
» Abstract
The field of neuromorphic computing looks to the brain for inspiration in designing algorithms and architectures. As the field matures, it is crucial to explore the capabilities of top neuromorphic systems. Here we explore trends in the development of large-scale neuromorphic systems, as well as consider applications neuromorphic approaches may impact. By highlighting these advancements, we aim to demonstrate the emerging role neuromorphic computing may have for HPC.
4:00 P.M. EST
Speaker:
Verónica G. Melesse Vergara
DOE-NIH-NSF Collaboration: Deploying Biomedical Retrieval Augmented Generation pipelines on Frontier as part of the NAIRR Secure Pilot
Associated Organizations: Oak Ridge National Laboratory
» Abstract
This demonstration will showcase new capabilities being piloted by the Oak Ridge Leadership Computing Facility (OLCF) to support deployment of biomedical Large Language Model (LLM) pipelines on ORNL’s Frontier Citadel environment. This demonstration features work resulting from a collaboration between ORNL, NIH’s National Center for Advancing Translational Sciences (NCATS), and the National Science Foundation (NSF) as part of the NAIRR Secure Pilot program. The demonstration showcases biomedical application workflows through JupyterHub on OLCF’s Frontier supercomputer and a prototype for “Ask AIthena”, an intuitive chat interface using private and secure LLMs on OLCF’s Frontier supercomputer and NCATS’s DALÍ HPC system. Presenters: Nick Schaub, Hugo Hernández, Matt Ezell, Verónica Melesse Vergara
5:00 P.M. EST
Speaker:
Sutanay Choudhury
AI-guided Hypothesis Generation and Design of Catalysts with Complex Morphologies and Reaction Networks
Associated Organizations: Pacific Northwest National Laboratory
» Abstract
We present an AI-driven framework for catalyst discovery, combining linguistic reasoning with quantum chemistry feedback. Our approach uses large language models (LLMs) to generate hypotheses and graph neural networks (GNNs) to evaluate 3D atomistic structures. The iterative process incorporates structural evaluation, reaction pathways, and stability assessments. Automated planning methods guide the exploration, rivaling expert-driven approaches. This integration of language-guided reasoning and computational chemistry feedback accelerates trustworthy catalyst discovery for sustainable chemical processes.
Station 2
10:00 A.M. EST
Speaker:
Joaquin Chung
(+ Caitao Zhan)
SeQUeNCe, a Customizable Discrete-Event Simulator of Quantum Networks
Associated Organizations: Argonne National Laboratory
» Abstract
Quantum networks promise to deliver revolutionary applications such as distributing cryptographic keys with provable security, ultra-high-precision distributed sensing, and synchronizing clocks with unprecedented accuracy. Recent breakthroughs in quantum engineering have allowed experimental realizations of quantum network prototypes. A key engineering challenge is building networks that scale along the dimensions of node count, user count, distance, and application diversity. Achieving this goal requires advances in hardware engineering, network architectures and protocols. Quantum network simulations can help in understanding the tradeoffs of alternative quantum network architectures, optimizing quantum hardware, and developing a robust control plane. Simulator of QUantum Network Communication (SeQUeNCe) is a customizable, discrete-event quantum network simulator that models quantum hardware and network protocols. SeQUeNCe uses a modularized design that allows the testing of alternative quantum network protocols and hardware models and the study of their interactions. In this demo, we will introduce SeQUeNCe and present its design, interface, and capabilities.
11:00 A.M. EST
Speaker:
Imran Latif
Advancing Sustainability in Data Centers: Evaluation of Hybrid Air/Liquid Cooling Schemes for IT Payload using Sea Water
Associated Organizations: Brookhaven National Laboratory
» Abstract
The SINES DC facility in Sines, Portugal, developed by Start Campus, is a 1.2 GW data center campus poised to become Europe’s largest and most sustainable by 2030. This ambitious project features a groundbreaking seawater cooling system, using the ocean as a natural heat sink without depleting water resources. With an investment of €8.5 billion, SINES DC will operate entirely on renewable energy, aiming for an industry-leading Power Usage Effectiveness (PUE) of 1.1. Its first building, SIN01, is designed for up to 15 MW of IT load and incorporates both liquid and air-cooling technologies. The facility’s Liquid Cooled Lab (LCL) will support 1 MW of IT load, providing a testing environment for cutting-edge cooling technologies like direct-to-chip and immersion cooling. By utilizing ocean water, the campus achieved a Water Usage Effectiveness (WUE) of zero. BNL researchers lead the study and will be discussing the carbon reduction and chilled water flow optimization, setting a new benchmark for sustainability and efficiency in data center operations worldwide.
12:00 P.M. EST
Speaker:
Rohith Anand Varikoti
CACTUS: Harnessing Open-Source LLMs and Domain-Specific Tools for Advanced Chemistry Reasoning
Associated Organizations: Pacific Northwest National Laboratory
» Abstract
The rapid advancement of large language models (LLMs) has revolutionized various domains, including chemistry and molecular discovery. However, the ability of LLMs to access and reason over domain-specific knowledge and tools remains a significant challenge. In this demo, we present CACTUS (Chemistry AI Agent Connecting Tool-Usage to Science), an enhanced version of the CACTUS agent that leverages open-source LLMs and integrates domain-specific tools to enable accurate and efficient reasoning and problem-solving in chemistry. In this demo, I will discuss the performance of state-of-the-art open-source LLMs, including Gemma-7b, Falcon-7b, MPT-7b, Llama3-8b, and Mistral-7b, on a comprehensive benchmark of chemistry questions. I will also highlight the impact of domain-specific prompting and hardware configurations on model performance, emphasizing the significance of prompt engineering and the feasibility of deploying smaller models on consumer-grade hardware without compromising accuracy. We will present real-world applications, including molecular discovery and material design, where CACTUS aids in hypothesis testing and validation, accelerating the discovery process and enabling data-driven decision-making.
1:00 P.M. EST
Speaker:
Ana Gainaru
I/O Innovations for Modern HPC Workflows
Associated Organizations and partnerships: Oak Ridge National Laboratory
» Abstract
High-performance computing applications are increasingly executed as complex workflows integrating AI, visualization, and in-situ analysis. To meet the demands of these modern applications, middleware tools must adapt to the challenges of managing distributed datasets, complex queries, and large-scale data analysis. This demonstration showcases new features designed to efficiently support scientific campaigns involving multiple runs, terabytes of data per run, and the visualization and analysis of primary variables and derived data. We will highlight solutions for efficient local and in-situ data handling, ensuring optimal performance and usability for researchers.
2:00 P.M. EST
Speaker:
Ana Gainaru
I/O Innovations for Modern HPC Workflows
Associated Organizations and partnerships: Oak Ridge National Laboratory
» Abstract
High-performance computing applications are increasingly executed as complex workflows integrating AI, visualization, and in-situ analysis. To meet the demands of these modern applications, middleware tools must adapt to the challenges of managing distributed datasets, complex queries, and large-scale data analysis. This demonstration showcases new features designed to efficiently support scientific campaigns involving multiple runs, terabytes of data per run, and the visualization and analysis of primary variables and derived data. We will highlight solutions for efficient local and in-situ data handling, ensuring optimal performance and usability for researchers.
3:00 P.M. EST
Speaker:
Albert Vong
Live Experiment-time Analysis of Advanced Photon Source Experiment Data using ALCF’s Polaris Supercomputer
Associated Organizations: Argonne National Laboratory
» Abstract
This demonstration showcases the use of the ALCF Polaris supercomputer for processing data (both online and offline) from Advanced Photon Source (APS) experiments in near real-time. It shows how the APS utilizes the features of Globus Compute to create end-to-end data workflows connecting APS instruments with computing resources at the ALCF. And how ALCF infrastructure innovations, including beamline service accounts, facility allocations and instrument suballocations, and management nodes facilitate this integration. These capabilities will be demonstrated for a high-impact technique at the APS, X-ray Photon Correlation Spectroscopy (XPCS). The APS, located at Argonne National Laboratory, is a synchrotron light source funded by the U.S Department of Energy (DOE), Office of Science-Basic Energy Sciences (BES) to produce high-energy, high-brightness x-ray beams. The APS has become one of the largest scattering user facilities in the world, averaging 5,500 unique users and producing more than 2,000 scientific publications every year. Scientists from every state in the US and international users utilize the 68 beamlines at the APS to conduct cutting-edge basic and applied research in the fields of materials science, biological and life science, physics, chemistry, environmental, geophysical, and planetary science, and innovative x-ray instrumentation. As part of the APS Upgrade project, the facility has replaced the storage ring and is in the midst of commissioning new and upgraded beamlines and instruments. More than ever before, advanced computational approaches and technologies are essential in fully unlocking the scientific potential of the facility. The upgraded source opens the door for new measurement techniques and increases in throughput, which, coupled to technological advances in detectors, new multi-modal data, and advances in data analysis algorithms, including artificial intelligence and machine learning (AI/ML), will open a new era of synchrotron light source enabled research. In particular, the high-brightness, and increase in coherent x-ray flux at the new APS is leading to significant increases in data rates and experiment complexity that can only be addressed with advanced computing capabilities. Join us to see firsthand how collaboration between the APS and the ALCF is driving cutting-edge research forward.
4:00 P.M. EST
Speaker:
Alex Lovell-Troy
OpenCHAMI: Open Source HPC System Management for future generations of DoE HPC
Associated Organizations: Los Alamos National Laboratory
» Abstract
Several DoE labs are collaboratively building a set of composable system management tools with the goal of being capable of running the next generation(s) of large HPC procurements. Alex Lovell-Troy from LANL will demonstrate several of the components including hardware discovery, federation, and inventory management and discuss the overall architecture of the project. This will be especially useful for system administrators responsible for managing HPC systems. The OpenCHAMI consortium is focused on reducing toil for sysadmins with responsibility for multiple systems.
5:00 P.M. EST
Speaker:
Valentine Anantharaj
A digital twin toward monitoring and forecasting of severe weather events in the future
Associated Organizations: Oak Ridge National Laboratory
» Abstract
Significant improvements in numerical weather prediction (NWP) can be attributed to modern satellite observing systems and the data assimilation techniques that enable the incorporation of satellite observations in NWP models. When new instruments are developed it is essential to understand their usefulness for weather prediction and climate monitoring, especially extreme events, in the context of existing observations. The value added by a new satellite instrument can be evaluated using a digital twin surrogate of the satellite instrument and by performing Observing System Simulation Experiments. The next generation NOAA Geostationary Extended Observations (GeoXO) satellites are planned to be operational in the early 2030s. The GeoXO payload will include an atmospheric sounder instrument (GXS) that will enable timely and improved forecasting of hurricanes and severe weather. We will demonstrate and discuss the potential advantages of GXS, using a digital twin prototype for the GXS, toward forecasting a hypothetical severe weather event. The prototype GXS digital twin was developed at the University of Wisconsin-Madison Space Science and Engineering Center, using data from a global 1-km seasonal simulation completed by the European Center for Medium Range Weather Forecasts at the Oak Ridge Leadership Computing Facility via a DOE ASCR INCITE award.
THURSDAY, NOV. 21
Station 1
10:00 A.M. EST
Speaker:
Sai Munikoti
Next-Generation AI Tools for Environmental Review and Permitting Efficiency
Associated Organizations: Pacific Northwest National Laboratory
» Abstract
The National Environmental Policy Act (NEPA) of 1969, is a bedrock and enduring environmental law in the United States with the express intent of fostering a productive harmony between humans and the environment for present and future generations. The NEPA statute and implementing regulations of the Council on Environmental Quality establish procedures requiring all federal agencies to consider and communicate the environmental effects in their planning and decisions to the public. Thus, agencies conduct environmental review and prepare a written document disclosing the potential environmental impact of proposed actions. Additionally, various studies are conducted by the scientific community to measure the impact of actions/projects on environment that eventually support NEPA review. Environmental data serves as a fundamental block in streamlining NEPA reviews where rich information contained in historical review and scientific documents could enable us to efficiently retrieve, analyze and find patterns that can inform future NEPA reviews. Currently documents are distributed across several agencies and its raw (PDF) form restricts from searchable and coupling with AI applications. We present two AI-ready data platforms, SearchNEPA and WindAssist that offers seamless access to past review and scientific documents. SearchNEPA is a cloud-driven AI-ready search tool that enables fast retrieval of policy-relevant information from over 28k environmental review documents. We developed several data standardization and augmentation techniques to improve the quality and access to the data records with the construction of NEPA ontology that include data objects such as project, process, document, public involvement, comments, and GIS. SearchNEPA runs on (i) single low-cost data storage in the cloud that accommodates diverse data types, (ii) open, standardized, and AI-compatible storage formats that facilitate comprehensive and fast data consumption, and (iii) end-to-end streaming that automates PDF data ingestion, metadata enrichment, and application endpoints. SearchNEPA opens a wide range of applications for both insight and foresight on NEPA performance and risk, including deep searches at multiple hierarchy, chatting with a collection of NEPA documents with Retrieval Augmented Generation (RAG) techniques, analytics from historical reviews that can inform future studies and geo-visualization of the NEPA projects through GIS information extracted through their documents. More recently, we publicly released a text corpus of data from more than 28k NEPA documents that powered SearchNEPA, the National Environmental Policy Act Text Corpus (NEPATEC1.0). WindAssist is a multimodal AI-driven data and knowledge discovery platform for wind energy-related documents collected from the Tethys database consisting of scientific articles, technical reports, and NEPA Reviews. It curates and unifies diverse data modalities into a structured AI-ready form via multimodal vectorization that can facilitate efficient information access. Specifically, we leverage large language model-based parsing to extract text, images, and tables from 2709 PDF documents comprising of 125k text chunks and 30k images. The platform offers a multimodal search tool, WindAssist-Search, which independently retrieves text and images that are semantically relevant to the query. In addition to the search, the multimodal conversation assistant tool, WindAssist-Chat, compiles the retrieved documents and images to generate responses to user queries. Unlike traditional keyword-based searches and chat assistant tools, our database ensures comprehensive data retrieval at a granular level, down to individual pages or text within images, and tools facilitate effective use of information to craft high quality responses. SearchNEPA and WindAssist aligns with the DOE efforts to transform the DOE’s vast repository of data into high quality AI-ready form and accelerate deployment of clean energy by streamlining siting and permitting processes via providing a one-stop-platform for various federal agencies (via OneID, login.gov, etc.) to (i) manage NEPA review and scientific documents, (ii) search for relevant information and (iii) fetch hidden insights which altogether support agency specific NEPA workflows.
11:00 A.M. EST
Speaker:
Craig Vineyard
TOPNMC: Exploring Neuromorphic Computing Impact
Associated Organizations and partnerships: Sandia National Laboratories
» Abstract
The field of neuromorphic computing looks to the brain for inspiration in designing algorithms and architectures. As the field matures, it is crucial to explore the capabilities of top neuromorphic systems. Here we explore trends in the development of large-scale neuromorphic systems, as well as consider applications neuromorphic approaches may impact. By highlighting these advancements, we aim to demonstrate the emerging role neuromorphic computing may have for HPC.
12:00 P.M. EST
Speaker:
Jim Brandt
Data-Driven Autonomous Operations
Associated Organizations and partnerships: Sandia National Laboratories
» Abstract
This demonstration shows how the Lightweight Distributed Metric Service’s (LDMS) capabilities enable autonomous operations through the capability to gather and perform run time analysis of application and system data and to provide a real-time feedback path to optimize resource utilization.
2:00 P.M. EST
Speaker:
Amir Shehata
HPC/QC Integration Framework
Associated Organizations and partnerships: Oak Ridge National Laboratory
» Abstract
In recent years, quantum computing has demonstrated the potential to revolutionize specific algorithms and applications by solving problems exponentially faster than classical computers. However, its widespread adoption for general computing remains a future prospect. This paper discusses the integration of quantum computing within High-Performance Computing (HPC) environments, focusing on a resource management framework designed to streamline quantum simulators’ use and enhance runtime performance and efficiency.
Station 2
10:00 A.M. EST
Speaker:
Rohith Anand Varikoti
CACTUS: Harnessing Open-Source LLMs and Domain-Specific Tools for Advanced Chemistry Reasoning
Associated Organizations: Pacific Northwest National Laboratory
» Abstract
The rapid advancement of large language models (LLMs) has revolutionized various domains, including chemistry and molecular discovery. However, the ability of LLMs to access and reason over domain-specific knowledge and tools remains a significant challenge. In this demo, we present CACTUS (Chemistry AI Agent Connecting Tool-Usage to Science), an enhanced version of the CACTUS agent that leverages open-source LLMs and integrates domain-specific tools to enable accurate and efficient reasoning and problem-solving in chemistry. In this demo, I will discuss the performance of state-of-the-art open-source LLMs, including Gemma-7b, Falcon-7b, MPT-7b, Llama3-8b, and Mistral-7b, on a comprehensive benchmark of chemistry questions. I will also highlight the impact of domain-specific prompting and hardware configurations on model performance, emphasizing the significance of prompt engineering and the feasibility of deploying smaller models on consumer-grade hardware without compromising accuracy. We will present real-world applications, including molecular discovery and material design, where CACTUS aids in hypothesis testing and validation, accelerating the discovery process and enabling data-driven decision-making.
11:00 A.M. EST
Speaker:
Benjamin Mintz
Interconnected Science Ecosystem (INTERSECT) for Autonomous Laboratories
Associated Organizations: Oak Ridge National Laboratory
» Abstract
The convergence of recent advancements in artificial intelligence, robotics, advanced instrumentation, edge and high-performance computing, and high-throughput networks is ushering in a transformative era for self-driving autonomous laboratories. Yet, without a coordinated scientific approach, the landscape risks becoming fragmented with disparate solutions that lack seamless integration, leading to siloed “smart” labs and user facilities. The Interconnected Science Ecosystem (INTERSECT) initiative is co-designing a common ecosystem and building interoperable “self-driving” autonomous laboratories across the ORNL directorates. The primary goal of the INTERSECT initiative is to lead and coordinate smart laboratory efforts across ORNL to unlock unprecedented efficiencies and groundbreaking new research approaches/outcomes.
1:00 P.M. EST
Speaker:
Max Lupo-Pasini
HydraGNN: a scalable graph neural network architecture for accelerated material discovery and design
Associated Organizations: Oak Ridge National Laboratory
» Abstract
Deep learning (DL) models have shown the potential to greatly accelerate first-principles calculations while maintaining a high degree of accuracy. In particular, graph neural networks (GNNs) are effective DL models for material science because they take advantage of the topological information of the data samples by representing the atomic configurations of each atomistic structure as graphs. In this representation, each atom is represented as a node in the graph, and the edges between nodes represent the bonding interactions between the atoms. In this demo, Max and his team will detail the utility and use of the HydraGNN, a distributed graph neural network architecture for robust and scalable training on large volumes of atomistic materials modeling data. Max and his team will discuss how to effectively scale HydraGNN models on the Perlmutter supercomputer at NERSC as well as the Summit and Frontier supercomputers at the OLCF.
2:00 P.M. EST
Speaker:
Max Lupo-Pasini
HydraGNN: a scalable graph neural network architecture for accelerated material discovery and design
Associated Organizations: Oak Ridge National Laboratory
» Abstract
Deep learning (DL) models have shown the potential to greatly accelerate first-principles calculations while maintaining a high degree of accuracy. In particular, graph neural networks (GNNs) are effective DL models for material science because they take advantage of the topological information of the data samples by representing the atomic configurations of each atomistic structure as graphs. In this representation, each atom is represented as a node in the graph, and the edges between nodes represent the bonding interactions between the atoms. In this demo, Max and his team will detail the utility and use of the HydraGNN, a distributed graph neural network architecture for robust and scalable training on large volumes of atomistic materials modeling data. Max and his team will discuss how to effectively scale HydraGNN models on the Perlmutter supercomputer at NERSC as well as the Summit and Frontier supercomputers at the OLCF.
Previous Years
» 2023
Monday, Nov. 13
Station 1
7 p.m.
Ann Gentile; Jim Brandt; Benjamin Schwaller; Tom Tucker (SNL) “AppSysFusion: ‘Always-on Monitoring’ on Sandia HPC Systems”
8 p.m.
Ann Gentile; Jim Brandt; Benjamin Schwaller; Tom Tucker (SNL) “AppSysFusion: ‘Always-on Monitoring’ on Sandia HPC Systems”
Monday, Nov. 13
Station 2
7 p.m.
Prasanna Balaprakash; Feiyi Wang; Sajal Dash; Junqi Yin; Dan Lu; Ashwin Aji; Leon Song (ORNL)“Accelerating scientific discoveries with DeepSpeed for Science and AMD-powered Frontier exascale supercomputer”
8 p.m.
Prasanna Balaprakash; Feiyi Wang; Sajal Dash; Junqi Yin; Dan Lu; Ashwin Aji; Leon Song (ORNL)“Accelerating scientific discoveries with DeepSpeed for Science and AMD-powered Frontier exascale supercomputer”
Tuesday, Nov. 14
Station 1
10 a.m.
Christian Trott; Bruno Turcksin; Daniel Arndt; Nevin Liber; Rahulkumar Gayatri; Nevin Liber; Sivasankaran Rajamanickam; Luc Berger-Vergiat (ANL, LBL)“Achieving performance portability with Kokkos”
11 a.m.
Hannah Parraga, Michael Prince (ANL)“Empowering Scientific Discovery at the APS with Integrated Computing”
12 p.m.
Mariam Kiran; Anastasiia Butko; Ren Cooper; Imtiaz Mahmud, Nirmalendu Patra; Matthew Verlie (ORNL, LBL)“5G on the Showfloor”
2 p.m.
Brad Richardson; Magne Haveraaen (Lawrence Berkeley Laboratory)“Fortran generics for 202y”
3 p.m.
Christine Simpson, Tom Uram, Rachana Ananthakrishnan, David Schissel, Hannah Parraga, Michael Prince (ANL)“Flexible cross-facility experimental data analysis at ALCF”
Tuesday, Nov. 14
Station 2
10 a.m.
Yao Xu; Gene Cooperman; Rebecca Hartman-Baker (LBL)“Transparent Checkpointing on Perlmutter for Long-Running Jobs”
11 a.m.
Thomas Applencourt; Abhishek Bagusetty (Argonne National Laboratory)“oneAPI and SYCL”
12 p.m.
Ann Gentile; Jim Brandt; Benjamin Schwaller; Tom Tucker (SNL)“AppSysFusion: ‘Always-on Monitoring’ on Sandia HPC Systems”
1 p.m.
Ann Gentile; Jim Brandt; Benjamin Schwaller; Tom Tucker (SNL)“AppSysFusion: ‘Always-on Monitoring’ on Sandia HPC System
2 p.m.
Jean Luca Bez; Hammad Ather; Suren Byna; John Wu (LBL)“Drishti: Where is the I/O bottleneck?”
3 p.m.
Ann Gentile; Jim Brandt; Benjamin Schwaller; Tom Tucker (SNL)“AppSysFusion: ‘Always-on Monitoring’ on Sandia HPC Systems”
4 p.m.
Ann Gentile; Jim Brandt; Benjamin Schwaller; Tom Tucker (SNL)“AppSysFusion: ‘Always-on Monitoring’ on Sandia HPC Systems”
5 p.m.
Ann Gentile; Jim Brandt; Benjamin Schwaller; Tom Tucker (SNL)“AppSysFusion: ‘Always-on Monitoring’ on Sandia HPC Systems”
Wednesday, Nov. 15
Station 1
10 a.m.
Caetano Melone (Lawrence Livermore National Laboratory)“Dynamically Allocating Resources for Spack CI Builds”
11 a.m.
Caetano Melone (Lawrence Livermore National Laboratory)“Dynamically Allocating Resources for Spack CI Builds”
12 p.m.
Ann Gentile; Jim Brandt; Benjamin Schwaller; Tom Tucker (SNL)“AppSysFusion: ‘Always-on Monitoring’ on Sandia HPC Systems”
1 p.m.
Mariam Kiran; Muneer Alshowkan; Brian Williams; Joseph Chapma (ORNL) “Quantum Networks a Reality”
2 p.m.
Yatish Kumar (Lawrence Berkeley National Laboratory) “Open Source ESnet P4 FPGA smartNIC”
3 p.m.
Flavio Castro; Joaquin Chung; Se-young Yu (ANL) “SciStream: Architecture and Toolkit for Data Streaming between Federated Science Instruments”
4 p.m.
Ann Gentile; Jim Brandt; Benjamin Schwaller; Tom Tucker (SNL) “AppSysFusion: ‘Always-on Monitoring’ on Sandia HPC Systems”
5 p.m.
Ann Gentile; Jim Brandt; Benjamin Schwaller; Tom Tucker (SNL) “AppSysFusion: ‘Always-on Monitoring’ on Sandia HPC Systems”
Wednesday, Nov. 15
Station 2
10 a.m.
Marco Minutoli (Pacific Northwest National Laboratory)“Maximizing the Influence at the ExaScale with Ripples”
11 a.m.
Imran Latif (Brookhaven National Laboratory)“CPU performance and power Optimization”
12 p.m.
Mariam Kiran; Anastasiia Butko; Ren Cooper; Imtiaz Mahmud, Nirmalendu Patra; Matthew Verlie (ORNL, LBL)“5G on the Showfloor”
1 p.m.
Christine Simpson, Tom Uram, Rachana Ananthakrishnan, David Schissel, Hannah Parraga, Michael Prince (ANL)“Flexible cross-facility experimental data analysis at ALCF”
2 p.m.
Imran Latif (Brookhaven National Laboratory) “CPU performance and power Optimization”
3 p.m.
Sam Wellborn; Bjoern Enders; Peter Ercius; Chris Harris; Deborah Bard (LBL)“Live Streaming of Large Electron Microscope Data to NERSC”
4 p.m.
Charles Shiflett (Lawrence Berkeley National Laboratory)“Long-distance high-speed data transfer with EScp”
Thursday, Nov. 16
Station 1
10 a.m.
Christian Trott; Bruno Turcksin; Daniel Arndt; Nevin Liber; Rahulkumar Gayatri; Nevin Liber; Sivasankaran Rajamanickam; Luc Berger-Vergiat (ANL, LBL)“Kokkos ecosystem beyond performance portable code”
11 a.m.
Flavio Castro; Joaquin Chung; Se-young Yu (ANL) “SciStream: Architecture and Toolkit for Data Streaming between Federated Science Instruments”
1 p.m.
Mariam Kiran; Muneer Alshowkan; Brian Williams; Joseph Chapma (ORNL)“Quantum Networks a Reality”
2 p.m.
Mariam Kiran; Muneer Alshowkan; Brian Williams; Joseph Chapma (ORNL)“Quantum Networks a Reality”
Thursday, Nov. 16
Station 2
10 a.m.
Christian Mayr (Sandia National Laboratory)“The SpiNNaker2 Neuromorphic Computing Architecture”
11 a.m.
Christian Mayr (Sandia National Laboratory)“The SpiNNaker2 Neuromorphic Computing Architecture”
2 p.m.
Ramesh Balakrishnan (Argonne National Laboratory)“Large Eddy Simulation of Turbulent flows in a Classroom”
» 2022
Monday, Nov. 14
Station 1
7 p.m.
Peer-Timo Bremer (Lawrence Livermore National Laboratory) – “Autonomous ‘Laser’ Experiments”
8 p.m.
Peer-Timo Bremer (Lawrence Livermore National Laboratory) – “Autonomous ‘Laser’ Experiments”
Monday, Nov. 14
Station 2
7 p.m.
James Brandt (SNL) – “AppSysFusion: Providing Run Time Insight Using Application and System Data”
8 p.m.
James Brandt (SNL) – “AppSysFusion: Providing Run Time Insight Using Application and System Data”
Tuesday, Nov. 15
Station 1
10 a.m.
Shahzeb Siddiqui (LBNL)“NERSC Spack Infrastructure Project – Leverage Gitlab for automating Software Stack Deployment”
11 a.m.
Sunita Chandrasekaran (BNL)“Using Frontier for CAAR Plasma-In-Cell (PIC) on GPU application”
12 p.m.
Mariam Kiran (LBNL)“Global Petascale to Exascale – Networks go beyond lab border with 5G”
1 p.m.
Mariam Kiran (LBNL)“Global Petascale to Exascale – Networks go beyond lab border with 5G”
2 p.m.
Lee Liming (Argonne National Laboratory)“Automating Beamline Science at Scale with Globus”
3 p.m.
Tom Scogland (Lawrence Livermore National Laboratory)“Flux: Next Generation Resource Management”
4 p.m.
Sunita Chandrasekaran (BNL)“Using Frontier for CAAR Plasma-In-Cell (PIC) on GPU application”
5 p.m.
Sunita Chandrasekaran (BNL)“Using Frontier for CAAR Plasma-In-Cell (PIC) on GPU application”
Tuesday, Nov. 15
Station 2
10 a.m.
James Brandt (SNL)“AppSysFusion: Providing Run Time Insight Using Application and System Data”
11 a.m.
James Brandt (SNL)“AppSysFusion: Providing Run Time Insight Using Application and System Data”
12 p.m.
Sunita Chandrasekaran (BNL) “Using Frontier for CAAR Plasma-In-Cell (PIC) on GPU application”
1 p.m.
Peer-Timo Bremer (Lawrence Livermore National Laboratory)“Autonomous ‘Laser’ Experiments”
2 p.m.
Peer-Timo Bremer (Lawrence Livermore National Laboratory)“Autonomous ‘Laser’ Experiments”
3 p.m.
Peer-Timo Bremer (Lawrence Livermore National Laboratory)“Autonomous ‘Laser’ Experiments”
4 p.m.
James Brandt (SNL)“AppSysFusion: Providing Run Time Insight Using Application and System Data”
5 p.m.
James Brandt (SNL)“AppSysFusion: Providing Run Time Insight Using Application and System Data”
Wednesday, Nov. 16
Station 1
10 a.m.
Hubertus (Huub) Van Dam (BNL)“Chimbuko: Workflow Performance Analysis @Exascale”
11 a.m.
Lee Liming (Argonne National Laboratory)“Automating Beamline Science at Scale with Globus”
1 p.m.
Peer-Timo Bremer (Lawrence Livermore National Laboratory) “Autonomous ‘Laser’ Experiments”
2 p.m.
Peer-Timo Bremer (Lawrence Livermore National Laboratory)“Autonomous ‘Laser’ Experiments”
3 p.m.
Peer-Timo Bremer (Lawrence Livermore National Laboratory)“Autonomous ‘Laser’ Experiments”
4 p.m.
Peer-Timo Bremer (Lawrence Livermore National Laboratory)“Autonomous ‘Laser’ Experiments”
5 p.m.
Peer-Timo Bremer (Lawrence Livermore National Laboratory)“Autonomous ‘Laser’ Experiments”
Wednesday, Nov. 16
Station 2
10 a.m.
Rajkumar Kettimuthu“SciStream: Architecture and Toolkit for Data Streaming between Federated Science Instruments”
11 a.m.
Rajkumar Kettimuthu“SciStream: Architecture and Toolkit for Data Streaming between Federated Science Instruments”
1 p.m.
Ramesh Balakrishnan (ANL)“Direct Numerical Simulation of Separating/Reattaching Turbulent Flow Over a Boeing Speed Bump at Very High Reynolds Numbers”
2 p.m.
James Brandt (SNL)“AppSysFusion: Providing Run Time Insight Using Application and System Data”
3 p.m.
James Brandt (SNL)“AppSysFusion: Providing Run Time Insight Using Application and System Data”
4 p.m.
James Brandt (SNL)“AppSysFusion: Providing Run Time Insight Using Application and System Data”
5 p.m.
James Brandt (SNL)“AppSysFusion: Providing Run Time Insight Using Application and System Data”
Thursday, Nov. 17
Station 1
10 a.m.
Ezra Kissel and Charles Shiftlett (LBNL)“Janus Container Management and the EScp Data Mover”
Thursday, Nov. 17
Station 2
10 a.m.
James Brandt (SNL)“AppSysFusion: Providing Run Time Insight Using Application and System Data”
11 a.m.
James Brandt (SNL)“AppSysFusion: Providing Run Time Insight Using Application and System Data”
1 p.m.
Ramesh Balakrishnan (ANL)“Large Eddy Simulation of Turbulent flows in a Classroom”
2 p.m.
Ramesh Balakrishnan (ANL)“Large Eddy Simulation of Turbulent flows in a Classroom”
» 2021
Tuesday, Nov. 16
10 a.m.
Anees Al Najjar (ORNL) – “Demonstrating the Functionalities of Virtual Federated Science Instrument Environment (VFSIE)”
11 a.m.
Shahzeb Siddiqui (Lawrence Berkeley National Laboratory) – “Building a Spack Pipeline in Gitlab”
12 p.m.
Joseph Insley (ANL and collaborators) – “Intel® oneAPI Rendering Toolkit: Interactive Rendering for Science at Scale”
1 p.m.
Mekena Metcalf, Mariam Kiran, and Anastasiia Butko (LBNL) – “Towards Autonomous Quantum Network Control”
2 p.m.
Laurie Stephey (LBNL), Jong Choi (ORNL), Michael Churchill (PPPL), Ralph Kube (PPPL), Jason Wang (ORNL) – “Streaming Data for Near Real-Time Analysis from the KSTAR Fusion Experiment to NERSC”
3 p.m.
Kevin Harms (Argonne National Laboratory and collaborators) – “DAOS + Optane For Heterogenous APPs”
Wednesday, Nov. 17
10 a.m.
Narasinga Rao Miniskar and Aaron Young (ORNL) – “An Efficient FPGA Design Environment for Scientific Machine Learning”
11 a.m.
Pieter Ghysels (LBNL) – “Preconditioning Large Scale High-Frequency Wave Equations with STRUMPACK and ButterflyPACK”
12 p.m.
Mariam Kiran, Nicholas Buraglio, and Scott Campbell (LBNL) – “Hecate: Towards Self-Driving Networks in the Real World”
2 p.m.
Bjoern Enders (LBNL) – “Supporting Data Workflows with the NERSC Superfacility API”
Thursday, Nov. 18
10 a.m.
Ezra Kissel (Lawrence Berkeley National Lab) – “Janus: High-Performance DTN-as-a-Service”
11 a.m.
Rajkumar Kettimuthu, Joaquin Chung, and Aniket Tekawade (ANL) – “AI-Steer: AI-Driven Online Steering of Light Source Experiments + SciStream: Architecture and Toolkit for Data Streaming between Federated Science Instruments”
12 p.m.
Prasanna Balaprakash (ANL) – “DeepHyper: Scalable Neural Architecture and Hyperparameter Search for Deep Neural Networks”
» 2019
Tuesday, Nov. 19
Station 1
10 a.m.
“Distributed Computing and Data Ecosystem: A Pilot Project by the Future Laboratory Computing Working Group” – Arjun Shankar (Oak Ridge National Laboratory, multi-lab)
We demonstrate progress and share findings from a federated Distributed Computing and Data Ecosystem (DCDE) pilot that incorporates tools, capabilities, services, and governance policies aiming to enable researchers across DOE science laboratories to seamlessly use cross-lab resources (i.e., scientific instruments, local clusters, large facilities, storage, enabling systems software, and networks). This pilot aims to present small research teams a range of distributed resources through a coherent and simple set of interfaces, to allow them to establish and manage experimental and computational pipelines and the related data lifecycle. Envisioned as a cross-lab environment, a DCDE would eventually be overseen by a governing body that includes the relevant stakeholders to create effective use and participation guidelines.
12 p.m.
“The Kokkos C++ Performance Portability Ecosystem” – Christian Trott (Sandia National Laboratory)
The Kokkos C++ Performance Portability Ecosystem is a production-level solution for writing modern C++ applications in a hardware-agnostic way. It is part of the U.S. Department of Energy’s Exascale Computing Project—the leading effort in the U.S. to prepare the HPC community for the next generation of supercomputing platforms. The Ecosystem consists of multiple libraries addressing the primary concerns for developing and maintaining applications in a portable way. The three main components are the Kokkos Core Programming Model; the Kokkos Kernels Math Libraries; and the Kokkos Profiling, Debugging, and Tuning Tools. Led by Sandia National Laboratories, the Kokkos team includes developers at five DOE laboratories.
2 p.m.
“Accelerating Interactive Experimental Science and HPC with Jupyter” – Matthew Henderson (Lawrence Berkeley National Laboratory)
Large-scale experimental science workflows require support for a unified, interactive, real-time platform that can manage a distributed set of resources connected to HPC systems. Here we demonstrate how the Jupyter platform plays a key role in this space—it provides the ease of use and interactivity of a web science gateway while allowing scientists to build custom, ad-hoc workflows in a composable way. Using real-world use cases from the National Center for Electron Microscopy and the Advanced Light Source, we show how Jupyter facilitates interactive analysis of data at scale on NERSC HPC resources.
Tuesday, Nov. 19
Station 2
11 a.m.
“Network Traffic Prediction for Flow and Bandwidth” – Mariam Kiran (Lawrence Berkeley National Laboratory)
Predicting traffic on network links can help engineers estimate the percentage bandwidth that will be utilized. Efficiently managing this bandwidth can allow engineers to have reliable file transfers and run networks hotter to send more data on current resources. Toward this end, ESnet researchers are developing advanced deep learning LSTM-based models as a library to predict network traffic for multiple future hours on network links. In this demonstration, we will show traffic peak predictions multiple hours into the future on complicated network topologies such as ESnet. We will also demonstrate how this can be used to configure network transfers to optimize network performance and utilize underused links.
1 p.m.
“Cinema Database Creation and Exploration” – John Patchett (Los Alamos National Laboratory)
Since cinema databases were first introduced in 2016, the ability for simulations and tools to produce them and viewers to explore them have improved. We will demonstrate ParaView Cinema Database creation and a number of viewers for exploring different types of cinema databases.
3 p.m.
“A Mathematical Method to Enable Autonomous Experimental Decision Making Without Human Interaction” – Marcus Noack (Lawrence Berkeley National Laboratory)
Modern scientific instruments are acquiring data at ever-increasing rates, leading to an exponential increase in the size of data sets. Taking full advantage of these acquisition rates will require corresponding advancements in the speed and efficiency of data analytics and experimental control. A significant step forward would come from automatic decision-making methods that enable scientific instruments to autonomously explore scientific problems —that is, to intelligently explore parameter spaces without human intervention, selecting high-value measurements to perform based on the continually growing experimental data set. Here, we develop such an autonomous decision-making algorithm based on Gaussian process regression that is physics-agnostic, generalizable, and operates in an abstract multi-dimensional parameter space. Our approach relies on constructing a surrogate model that fits and interpolates the available experimental data and is continuously refined as more data is gathered. The distribution and correlation of the data is used to generate a corresponding uncertainty across the surrogate model. By suggesting follow-up measurements in regions of greatest uncertainty, the algorithm maximally increases knowledge with each added measurement. This procedure is applied repeatedly, with the algorithm iteratively reducing model error and thus efficiently sampling the parameter space with each new measurement that it requests. The method was already used to steer several experiments at various beam lines at NSLS II and ALS. The results have been astounding; experiments that were only possible through constant monitoring by an expert were run entirely autonomously, discovering new science along the way.
Wednesday, Nov. 20
Station 1
10 a.m.
“Real-Time Analysis of Streaming Synchotron Data” – Tekin Bicer (Argonne National Laboratory)
Advances in detector technologies enable increasingly complex experiments and more rapid data acquisition for experiments carried on synchrotron light sources. The data generation rates, coupled with long experimentation times, necessitate the real-time analysis and feedback for timely insights about experiments. However, the computational demands for timely analysis of high-volume and high-velocity experimental data typically exceed the locally available resources and require utilization of large-scale clusters or supercomputers. In this demo, we will simulate experimental data generation at Advanced Photon Source beamlines and stream data to Argonne Leadership Computing Facility for real-time image reconstruction. The reconstructed images data will then be denoised and enhanced using machine learning techniques and streamed back for 2D or 3D volume visualization.
12 p.m.
“The Kokkos C++ Performance Portability Ecosystem” – Christian Trott (Sandia National Laboratory)
The Kokkos C++ Performance Portability Ecosystem is a production-level solution for writing modern C++ applications in a hardware-agnostic way. It is part of the U.S. Department of Energy’s Exascale Computing Project—the leading effort in the U.S. to prepare the HPC community for the next generation of supercomputing platforms. The Ecosystem consists of multiple libraries addressing the primary concerns for developing and maintaining applications in a portable way. The three main components are the Kokkos Core Programming Model; the Kokkos Kernels Math Libraries; and the Kokkos Profiling, Debugging, and Tuning Tools. Led by Sandia National Laboratories, the Kokkos team includes developers at five DOE laboratories.
2 p.m.
“Performance Evaluation using TAU and TAU Commander” – Sameer Shende (multi-lab)
TAU Performance System® is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Python, and Java. It is capable of gathering performance information through instrumentation of functions, methods, basic blocks, and statements. TAU supports HPC runtimes including MPI, pthread, OpenMP, CUDA, OpenCL, OpenACC, HIP, and Kokkos. All C++ language features are supported including templates and namespaces. The API also provides selection of profiling groups for organizing and controlling instrumentation. The instrumentation can be inserted in the source code using an automatic instrumentor tool based on the Program Database Toolkit (PDT), dynamically using DyninstAPI, at runtime using library preloading using tau_exec, and interpreter level instrumentation using Python, or even manually using the instrumentation API. TAU internally uses OTF2 to generate traces that may be visualized using the Vampir toolkit. TAU’s profile visualization tool, paraprof, provides graphical displays of all the performance analysis results in aggregate and single node/context/thread forms. The user can quickly identify sources of performance bottlenecks in the application using the graphical interface. In addition, TAU can generate event traces that can be displayed with the Vampir, Paraver or JumpShot trace visualization tools. TAU provides integrated instrumentation, measurement, and analysis capabilities in a cross-platform tool suite, plus additional tools for performance data management, data mining, and interoperation. The TAU project has developed strong interactions with the ASC/NNSA, ECP, and SciDAC. TAU has been ported to the leadership-class facilities at ANL, ORNL, LLNL, Sandia, LANL, and NERSC, including GPU Linux clusters, IBM, and Cray systems. TAU Commander simplifies the workflow of TAU and provides support for experiment management, instrumentation, measurement, and analysis tools.
4 p.m.
“Accelerating Interactive Experimental Science and HPC with Jupyter” – Matthew Henderson (Lawrence Berkeley National Laboratory)
Large-scale experimental science workflows require support for a unified, interactive, real-time platform that can manage a distributed set of resources connected to HPC systems. Here we demonstrate how the Jupyter platform plays a key role in this space—it provides the ease of use and interactivity of a web science gateway while allowing scientists to build custom, ad-hoc workflows in a composable way. Using real-world use cases from the National Center for Electron Microscopy and the Advanced Light Source, we show how Jupyter facilitates interactive analysis of data at scale on NERSC HPC resources.
Wednesday, Nov. 20
Station 2
11 a.m.
“Advances in HPC Monitoring, Run-Time Performance Analysis, Visualization, and Feedback” – Jim Brandt (Sandia National Laboratory)
During HPC system acquisition, significant consideration is given to the desired performance, which, in turn, drives the selection of processing components, memory, high-speed interconnects, file systems, and more. The achieved performance, however, is highly dependent on operational conditions and which applications are being run concurrently along with associated workflows. Therefore, the performance bottleneck discovery and assessment process are critical to performance optimization. HPC system monitoring has been a long-standing need for administrators to assess the health of their systems, detect abnormal conditions, and take informed actions when restoring system health. Moreover, users strive to understand how well their jobs run and what the architecture limits that restrict the performance of their jobs are. In this demonstration, we will present new features of a large-scale HPC monitoring framework called Lightweight Distributed Metric Service (LDMS). We will demonstrate a new anomaly detection capability that uses machine learning in conjunction with monitoring data to analyze system and application health in advanced HPC systems. We will also demonstrate a Top-down Microarchitecture Analysis (TMA) implementation that uses hardware performance counter data and computes a hierarchical classification of how an execution has utilized various parts of the hardware architecture. Our new Distributed Scalable Object Store (DSOS) will be presented and demonstrated, along with performance numbers that enable comparison with other current database technologies.
1 p.m.
“Network Traffic Prediction for Flow and Bandwidth” – Miriam Kiran (Lawrence Berkeley National Laboratory)
Predicting traffic on network links can help engineers estimate the percentage bandwidth that will be utilized. Efficiently managing this bandwidth can allow engineers to have reliable file transfers and run networks hotter to send more data on current resources. Toward this end, ESnet researchers are developing advanced deep learning LSTM-based models as a library to predict network traffic for multiple future hours on network links. In this demonstration, we will show traffic peak predictions multiple hours into the future on complicated network topologies such as ESnet. We will also demonstrate how this can be used to configure network transfers to optimize network performance and utilize underused links.
3 p.m.
“Tools and Techniques at NERSC for Cross-Facility Workflows” – Cory Snavely (Lawrence Berkeley National Laboratory)
Workflows that span instrument and computational facilities are driving new requirements in automation, data management and transfer, and development of science gateways. NERSC has initiated engagements with a number of projects that drive these emerging needs and is developing new capabilities to meet them. Learn more about NERSC’s plans and see demonstrations of a new API for interacting with NERSC systems and demonstrations of Spin, a Docker-based science gateway infrastructure.
5 p.m.
“Performance Evaluation using TAU and TAU Commander” – Nick Chaimov (multi-lab)
TAU Performance System® is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Python, and Java. It is capable of gathering performance information through instrumentation of functions, methods, basic blocks, and statements. TAU supports HPC runtimes including MPI, pthread, OpenMP, CUDA, OpenCL, OpenACC, HIP, and Kokkos. All C++ language features are supported including templates and namespaces. The API also provides selection of profiling groups for organizing and controlling instrumentation. The instrumentation can be inserted in the source code using an automatic instrumentor tool based on the Program Database Toolkit (PDT), dynamically using DyninstAPI, at runtime using library preloading using tau_exec, and interpreter level instrumentation using Python, or even manually using the instrumentation API. TAU internally uses OTF2 to generate traces that may be visualized using the Vampir toolkit. TAU’s profile visualization tool, paraprof, provides graphical displays of all the performance analysis results in aggregate and single node/context/thread forms. The user can quickly identify sources of performance bottlenecks in the application using the graphical interface. In addition, TAU can generate event traces that can be displayed with the Vampir, Paraver or JumpShot trace visualization tools. TAU provides integrated instrumentation, measurement, and analysis capabilities in a cross-platform tool suite, plus additional tools for performance data management, data mining, and interoperation. The TAU project has developed strong interactions with the ASC/NNSA, ECP, and SciDAC. TAU has been ported to the leadership-class facilities at ANL, ORNL, LLNL, Sandia, LANL, and NERSC, including GPU Linux clusters, IBM, and Cray systems. TAU Commander simplifies the workflow of TAU and provides support for experiment management, instrumentation, measurement, and analysis tools.
» 2018
Tuesday, Nov. 13
Station 1
10 a.m.
“Real-time Performance Analysis of Applications and Workflow” – Gyorgy Matyasfalvi, Brookhaven National Laboratory
As part of the ECP CODAR project, Brookhaven National Laboratory, in collaboration with the Oregon Universities TAU team, has developed unique capabilities to analyze, reduce and visualize single application and complete workflow performance data in-situ. The resulting tool enables the researchers to examine and explore their workflow performance as it is being executed.
11 a.m.
“BigData Express: Toward Predictable, Schedulable, and High-performance Data Transfer” – Wenji Wu, Fermilab
In DOE research communities, the emergence of distributed, extreme-scale science applications is generating significant challenges regarding data transfer. The data transfer challenges of the extreme-scale era are typically characterized by two relevant dimensions: high-performance challenges and time-constraint challenges. To meet these challenges, DOE’s ASCR office has funded Fermilab and Oak Ridge National Laboratory to collaboratively work on the BigData Express project (http://bigdataexpress.fnal.gov). BigData Express seeks to provide a schedulable, predictable, and high-performance data transfer service for DOE’s large-scale science computing facilities and their collaborators. Software defined technologies are key enablers for BigData Express. In particular, BigData Express makes use of software-defined networking (SDN) and software-defined storage (SDS) to develop a data-transfer-centric architecture to optimally orchestrate the various resources in an end-to-end data transfer loop. With end-to-end integration and coordination, network congestion and storage IO contentions are effectively reduced or eliminated. As a result, data transfer performance is significantly improved. BigData Express has recently gained growing attention in the communities. The BigData Express software is being deployed at multiple research institutions, which include UMD, StarLight, FNAL, KISTI (South Korea), UVA, and Ciena. Meanwhile, the BigData Express research team is collaborating with StarLight to deploy BigData Express at various research platforms, including Pacific Research Platform, National Research Platform, and Global Research Platform. It is envisioned that we are working toward building a multi-domain and multi-tenant software defined infrastructure (SDI) for high-performance data transfer.In this demo, we use BDE software to demonstrate bulk data movement over wide area networks. Our goal is to demonstrate that BDE can successfully address the high-performance and time-constraint challenges of data transfer to support extreme-scale science applications.
12 p.m.
“ParaView Running on a Cluster” – W. Alan Scott, Sandia National Laboratories
ParaView will be running with large data on a remote cluster at Sandia National Laboratories.
1 p.m.
“Performance Evaluation using TAU and TAU Commander” – Sameer Shende, Sandia National Laboratories/Univ. of Oregon
TAU Performance System is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Python, and Java.It is capable of gathering performance information through instrumentation of functions, methods, basic blocks, and statements. All C++ language features are supported including templates and namespaces. The API also provides selection of profiling groups for organizing and controlling instrumentation. The instrumentation can be inserted in the source code using an automatic instrumentor tool based on the Program Database Toolkit (PDT), dynamically using DyninstAPI, at runtime using library preloading using tau_exec, and interpreter level instrumentation using Python, or even manually using the instrumentation API. TAU internally uses OTF2 to generate traces that may be visualized using the Vampir toolkit.
2 p.m.
“High-Performance Multi-Mode X-ray Ptychography Reconstruction on Distributed GPUs” – Meifeng Lin, Brookhaven National Laboratory
X-ray ptychography is an important tool for reconstructing high-resolution specimen images from the scanning diffraction measurements. As an inverse problem, while there exists no unique solution to the ptychographical reconstructions, one of the best working approaches is the so-called difference map algorithm, based on which the illumination and object profiles are updated iteratively with the amplitude of their product constrained by the measured intensity at every iteration. Although this approach converges very fast (typically less than 100 iterations), it is a computationally intensive task and often requires several hours to retrieve the result on a single CPU, which is a disadvantage especially for beamline users who have limited access time. We accelerate this ptychography calculation by utilizing multiple GPUs and MPI communication. We take the scatter-and-gather approach by splitting the measurement data and sending each portion to a GPU node. Since data movement between the host and the device is expensive, the data is kept and the calculation is performed entirely on GPU, and only the updated probe and object are broadcasted at the end of each iteration. We show that our program has an excellent constant weak-scaling and enables users to obtain the results on the order of sub-minutes instead of hours, which is crucial for visualization, real-time feedback and efficient adjustment of experiments. This program is already put in production in the HXN beamline at NSLS-II, and a graphical user interface is also provided.
3 p.m.
“Innovative Architectures for Experimental and Observational Science: Bringing Compute to the Data” – David Donofrio, Lawrence Berkeley National Laboratory
As the volume and velocity of data generated by experiments continues to increase, we find the need to move data analysis and reduction operations closer to the source of the data to reduce the burden on existing HPC facilities that threaten to be overrun by the surge of experimental and observational data. Furthermore, remote facilities, including astronomy observatories, particle accelerators such as SLAC LCLS-II, etc. producing data do not necessarily have dedicated HPC facilities on-site. These remote sites are often power or space constrained, making the construction of a traditional data center or HPC facility unreasonable. Further complicating these scenarios, each experiment often needs a blend of specialized and programmable hardware that is closely tied to the needs of the individual experiment. We propose a hardware generation methodology based on open-source components to rapidly design and deploy these data filtering and analysis computing devices. He re we demonstrate a potential near-sensor, real-time, data processing solution developed using an innovative open-source hardware generation technique allowing potentially more effective use of experimental devices, such as electron microscopy.
4 p.m.
“Co-design of Parallelware Tools by Appentra and ORNL: Addressing the Challenges of Developing Future Exascale HPC Applications” – Oscar Hernandez, Oak Ridge National Laboratory
The ongoing 4-year collaboration between ORNL and Appentra is enabling the development of new tools using the Parallelware technology to address the needs of leading HPC centers. Firstly, we will present Parallelware Trainer (https://www.appentra.com/products/parallelware-trainer/), a scalable interactive teaching environment tool for HPC education and training targeting OpenMP and OpenACC for multicores and GPUs. And secondly, we will present Parallelware Analyzer, a new command-line reporting tool aimed at improving the productivity of HPC application developers. The tool has been selected as an SC18 Emerging Technology, and we will demonstrate how it can help to facilitate the analysis of data layout and data scoping across procedure boundaries. The presentation with make enphasis on the new and upcoming technical features of the underlying Parallelware technology.
Tuesday, Nov. 13
Station 2
11 a.m.
“Exciting New Developments in Large-Scale HPC Monitoring & Analysis” – Jim Brandt, Sandia National Laboratories
This demonstration will provide a brief overview of a suite of HPC monitoring and analysis tools developed in collaboration between Sandia National Laboratories (SNL), Los Alamos National Laboratories (LANL), and Open Grid Computing (OGC). The demonstration will provide a highlight overview of: 1) our Lightweight Distributed Metric Service (LDMS) for data collection, transport, and storage (including use case examples of system and job based analyses with visualization) and 2) Baler tool for mapping log messages into patterns and performing a variety of pattern based analyses. For additional information about our suite of HPC monitoring and analysis tools please email [email protected] or visit http://www.opengridcomputing.com/sc18
12 p.m.
“Accelerator Data Management in OpenMP 5.0” – Lingda Li, Brookhaven National Laboratory
Accelerator such as GPU play an essential role in today’s HPC systems. However, programming accelerators is often challenging. One of the most difficult part is to manage accelerator memory. OpenMP has supported accelerator offloading for a while and is gaining more and more usage. Currently, it requires users to explicitly manage memory mapping between host and accelerator, which needs a large amount of efforts from programmers. In OpenMP 5.0, user defined mapper and the support of unified memory are introduced to facilitate accelerator data management. We will introduce how to utilize these features to improve applications in this demo.
1 p.m.
“In situ Visualization with SENSEI” – Burlen Loring, Lawrence Berkeley National Laboratory
In situ visualization and analysis is an important component of the path to exascale computing. Coupling simulation codes directly to analysis codes reduces their I/O while increasing the temporal fidelity of the analysis. SENSEI, a light weight in situ frame work, gives simulations access to a diverse set of analysis back ends through a simple API and data model. SENSEI currently supports ParaView Catalyst, VisIt Libsim, ADIOS, Python, and VTK-m based back ends and is easy to extend. In this presentation we introduce SENSEI and demonstrate its use with the IAMR an AMReX based compressible Navier Stokes simulation code.
2 p.m.
“On-line Memory Coupling of the XGV1 Code to ParaView and VisIt Using the ADIOS Software Framework” – Scott Klasky, Oak Ridge National Laboratory
The trends in high performance computing, where far more data can be computed that can ever be stored, have made on line processing techniques an important area of research and development. In this demonstration, we show on line visualization of data from XGC1, a particle-in-cell code used to study the plasmas in fusion tokamak devices. We use the ADIOS software framework for the on line data management and use the production tools, ParaView and VisIt for the visualization of the simulation data.
3 p.m.
“Exciting New Developments in Large-Scale HPC Monitoring & Analysis” – Jim Brandt, Sandia National Laboratories
This demonstration will provide a brief overview of a suite of HPC monitoring and analysis tools developed in collaboration between Sandia National Laboratories (SNL), Los Alamos National Laboratories (LANL), and Open Grid Computing (OGC). The demonstration will provide a highlight overview of: 1) our Lightweight Distributed Metric Service (LDMS) for data collection, transport, and storage (including use case examples of system and job based analyses with visualization) and 2) Baler tool for mapping log messages into patterns and performing a variety of pattern based analyses. For additional information about our suite of HPC monitoring and analysis tools please email [email protected] or visit http://www.opengridcomputing.com/sc18
Wednesday, Nov. 14
Station 1
10 a.m.
“In situ Visualization of a Multi-physics Simulation on LLNL’s Sierra Supercomputer” – Cyrus Harrison, Lawrence Livermore National Laboratory
11 a.m.
“In Situ Visualization with SENSEI” – Burlen Loring, Lawrence Berkeley National Laboratory
In situ visualization and analysis is an important component of the path to exascale computing. Coupling simulation codes directly to analysis codes reduces their I/O while increasing the temporal fidelity of the analysis. SENSEI, a light weight in situ frame work, gives simulations access to a diverse set of analysis back ends through a simple API and data model. SENSEI currently supports ParaView Catalyst, VisIt Libsim, ADIOS, Python, and VTK-m based back ends and is easy to extend. In this presentation we introduce SENSEI and demonstrate its use with the IAMR an AMReX based compressible Navier Stokes simulation code.
12 p.m.
“Real-time Performance Analysis of Applications and Workflow” – Gyorgy Matyasfalvi, Brookhaven National Laboratory
As part of the ECP CODAR project, Brookhaven National Laboratory, in collaboration with the Oregon Universities TAU team, has developed unique capabilities to analyze, reduce and visualize single application and complete workflow performance data in-situ. The resulting tool enables the researchers to examine and explore their workflow performance as it is being executed.
1 p.m.
“Real-time Performance Analysis of Applications and Workflow” – Gyorgy Matyasfalvi, Brookhaven National Laboratory
As part of the ECP CODAR project, Brookhaven National Laboratory, in collaboration with the Oregon Universities TAU team, has developed unique capabilities to analyze, reduce and visualize single application and complete workflow performance data in-situ. The resulting tool enables the researchers to examine and explore their workflow performance as it is being executed.
2 p.m.
“Charliecloud: LANL’s Lightweight Container Runtime for HPC” – Reid Priedhorsky, Los Alamos National Laboratory
Charliecloud provides user-defined software stacks (UDSS) for HPC centers. This “bring your own software stack” functionality addresses needs such as: software dependencies that are numerous, complex, unusual, differently configured, or simply newer/older than what the center provides; build-time requirements unavailable within the center, such as relatively unfettered internet access; validated software stacks and configuration to meet the standards of a particular field of inquiry; portability of environments between resources, including workstations and other test and development system not managed by the center; consistent environments, even archivally so, that can be easily, reliably, and verifiably reproduced in the future; and/or usability and comprehensibility. Charliecloud uses Linux user namespaces to run containers with no privileged operations or daemons and minimal configuration changes on center resources. This simple approach avoids most security risks while maintaining access to the performance and functionality already on offer. Container images can be built using Docker or anything else that can generate a standard Linux filesystem tree. We will present a brief introduction to Charliecloud, then demonstrate running portable Charliecloud containers of various flavors at native speed, including hello world, traditional MPI, data-intensive (e.g., Apache Spark), and GPU-accelerated (e.g., TensorFlow).
3 p.m.
“Using Federated Identity to Improve the Superfacility User Experience” – Mark Day, Lawrence Berkeley National Laboratory
The superfacility vision combines multiple complementary user facilities into a virtual facility offering fundamentally greater capability than the standalone facilities provide on their own. For example, integrating beamlines at the Advanced Light Source (ALS), with HPC resources at NERSC via ESnet provides scientific capabilities unavailable at any single facility. This use of disparate facilities is not always convenient, and the logistics of setting up multiple user accounts, and managing multiple credentials adds unnecessary friction to the scientific process. We will demonstrate a simple portal, based on off-the-shelf technologies, that combines federated authentication with metadata collected at the time of the experiment and preserved at the HPC facility to allow a scientist to use their home institutional identity and login processes to access superfacility experimental data and results.
4 p.m.
“Innovative Architectures for Experimental and Observational Science: Bringing Compute to the Data” – David Donofrio, Lawrence Berkeley National Laboratory
As the volume and velocity of data generated by experiments continues to increase, we find the need to move data analysis and reduction operations closer to the source of the data to reduce the burden on existing HPC facilities that threaten to be overrun by the surge of experimental and observational data. Furthermore, remote facilities, including astronomy observatories, particle accelerators such as SLAC LCLS-II, etc. producing data do not necessarily have dedicated HPC facilities on-site. These remote sites are often power or space constrained, making the construction of a traditional data center or HPC facility unreasonable. Further complicating these scenarios, each experiment often needs a blend of specialized and programmable hardware that is closely tied to the needs of the individual experiment. We propose a hardware generation methodology based on open-source components to rapidly design and deploy these data filtering and analysis computing devices. He re we demonstrate a potential near-sensor, real-time, data processing solution developed using an innovative open-source hardware generation technique allowing potentially more effective use of experimental devices, such as electron microscopy.
Wednesday, Nov. 14
Station 2
10 a.m.
“Exciting New Developments in Large-scale HPC Monitoring & Analysis” – Jim Brandt, Sandia National Laboratories
This demonstration will provide a brief overview of a suite of HPC monitoring and analysis tools developed in collaboration between Sandia National Laboratories (SNL), Los Alamos National Laboratories (LANL), and Open Grid Computing (OGC). The demonstration will provide a highlight overview of: 1) our Lightweight Distributed Metric Service (LDMS) for data collection, transport, and storage (including use case examples of system and job based analyses with visualization) and 2) Baler tool for mapping log messages into patterns and performing a variety of pattern based analyses. For additional information about our suite of HPC monitoring and analysis tools please email [email protected] or visit http://www.opengridcomputing.com/sc18
11 a.m.
“Exciting New Developments in Large-scale HPC Monitoring & Analysis” – Jim Brandt, Sandia National Laboratories
This demonstration will provide a brief overview of a suite of HPC monitoring and analysis tools developed in collaboration between Sandia National Laboratories (SNL), Los Alamos National Laboratories (LANL), and Open Grid Computing (OGC). The demonstration will provide a highlight overview of: 1) our Lightweight Distributed Metric Service (LDMS) for data collection, transport, and storage (including use case examples of system and job based analyses with visualization) and 2) Baler tool for mapping log messages into patterns and performing a variety of pattern based analyses. For additional information about our suite of HPC monitoring and analysis tools please email [email protected] or visit http://www.opengridcomputing.com/sc18
1 p.m.
“ECP SDK Software in HPC Container Environments” – Sameer Shende, Sandia National Laboratories/Univ. of Oregon
The ECP SDK project is providing software developed under the ECP project using Spack [http://www.spack.io] as the primary means of software distribution. Using Spack, we have also created container images of packaged ECP ST products in the Docker, Singularity, Shifter, and Charliecloud environments that may be deployed on HPC systems. This demo will show how to use these container images for software development and describe packaging software in Spack. These images will be distributed on USB sticks.
2 p.m.
“Spin: A Docker-based System at NERSC for Deploying Science Gateways Integrated with HPC Resources” – Cory Snavely, Lawrence Berkeley National Laboratory
This demonstration presents Spin, a Docker-based platform at NERSC that enables researchers to design, build, and manage their own science gateways and other services to complement their computational jobs, present data or visualizations produced by computational processes, conduct complex workflows, and more. After explaining the rationale behind building Spin and describing its basic architecture, staff will show how services can be created in just a few minutes using simple tools. A discussion of services implemented with Spin and ideas for the future will follow.
4 p.m.
“ParaView Running on a Cluster” – W. Alan Scott, Sandia National Laboratories
ParaView will be running with large data on a remote cluster at Sandia National Laboratories.
Thursday, Nov. 15
Station 1
10 a.m.
“SOLLVE: Scaling OpenMP with LLVM for Exascale Performance and Portability” – Sunita Chandrasekaran, Argonne National Laboratory/Oak Ridge National Laboratory
This demo will present an overview and some key software components of the SOLLVE ECP project. SOLLVE aims at enhancing OpenMP to cover the major requirements of ECP application codes. In addition, this project sets the goal to deliver a high-quality, robust implementation of OpenMP and project extensions in LLVM, an open source compiler infrastructure with an active developer community that impacts the DOE pre-exascale systems (CORAL). This project further develops the LLVM BOLT runtime system to exploit light-weight threading for scalability and facilitate interoperability with MPI. SOLLVE is also creating a validation suite to assess our progress and that of vendors to ensure that quality implementations of OpenMP are being delivered to Exascale systems. The project also encourages the accelerated development of similarly high-quality, complete vendor implementations and facilitate extensive interactions between the applications developers and OpenMP developers in industry.