Publications

2024

SC 2024 (a)
LexiQL: Quantum Natural Language Processing on NISQ-era Machines.
Daniel Silver, Aditya Ranjan, Rakesh Achutha, Tirthak Patel, Devesh Tiwari.

SC 2024 (b)
ECO-LIFE: High-Performance and Carbon-Aware Serverless Workloads Scheduling via Multi-generation Hardware.
Yankai Jiang, Rohan Basu Roy, Baolin Li, Devesh Tiwari.

SC 2024 (c)
Incentive-Based Power Efficiency Mechanisms on the Fugaku Supercomputer.
Ana Luisa Veroneze Solorzano, Kento Sato, Keiji Yamamoto, Jim Brandt, Benjamin Schwaller, Sara Petra Walton, Jennifer Green, Fumiyoshi Shoji, Devesh Tiwari.

SC 2024 (d)
Stellaris: Staleness-aware Distributed Reinforcement Learning with Serverless Computing.
Hanfei Yu, Hao Wang, Jian Li, Seung-Jong Park, Devesh Tiwari.

IPDPS 2024 [ Paper] [Artifact][Presentation]
Interpretable Analysis of Production GPU Clusters Monitoring Data via Association Rule Mining.
Baolin Li, Siddharth Samsi, Vijay Gadepally, Devesh Tiwari.

SIGMETRICS 2024 [ Paper] [Artifact]
StarShip: Mitigating I/O Bottlenecks in Serverless Computing for Scientific Workflows.
Rohan Basu Roy, Devesh Tiwari.

ASPLOS 2024 (a) [ Paper] [Artifact]
CodeCrunch: Improving Serverless Performance via Function Compression and Cost-Aware Warmup Location Optimization.
Rohan Basu Roy, Tirthak Patel, Rohan Garg, Devesh Tiwari

ASPLOS 2024 (b) [ Paper] [Artifact]
RainbowCake: Mitigating Cold-starts in Serverless with Layer-wise Container Caching and Sharing.
Hanfei Yu, Rohan Basu Roy, Christian Fontenot, Devesh Tiwari, Jian Li, Hong Zhang, Hao Wang, Seung-Jong Park.

HotCarbon Workshop
Carbon in Motion: Characterizing Open-Sora on the Sustainability of Generative AI for Video Generation
Baolin Li, Yankai Jiang, Devesh Tiwari.

ArXiv [ Paper]
Toward Sustainable GenAI using Generation Directives for Carbon-Friendly Large Language Model Inference.
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari.

FGCS
The Globus Compute Dataset: An Open Function-as-a-Service Dataset from the Edge to the Cloud
André Bauer, Haochen Pan, Ryan Chard, Yadu N. Babuji, Josh Bryan, Devesh Tiwari, Ian T. Foster, Kyle Chard

2023

SC 2023 (a) [ Paper] [Artifact]
GRAPHINE: Generating Application-Specific Neutral Atom Topologies for Improved Quantum Computing Performance.
Tirthak Patel, Daniel Silver, Devesh Tiwari.

SC 2023 (b) [ Paper] [Artifact][Presentation]
Clover: Toward Sustainable AI with Carbon-Aware Machine Learning Inference Service.
Baolin Li, Siddharth Samsi, Vijay Gadepally, Devesh Tiwari

SC 2023 (c)
Comprehensive Experimental Evaluation and Analysis of a Universal Photonic Quantum Computer.
Aditya Ranjan, Tirthak Patel, Harshitta Gandhi, Daniel Silver, William Cutler, Devesh Tiwari.

SC 2023 (d)
Sustainable HPC: Modeling, Characterization, and Implications of Carbon Footprint in Modern HPC Systems.
Baolin Li, Rohan Basu Roy, Daniel Wong, Sid Samsi, Vijay Gadepally, Devesh Tiwari.

AAAI 2023
SLIQ: Resource-Efficient Quantum Similarity Networks for Unlabeled Data on Noisy Quantum Computers.
Daniel Silver, Tirthak Patel, Aditya Ranjan, Harshitta Gandhi, William Cutler, Devesh Tiwari

HPDC 2023 (a)
Kairos: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources.
Baolin Li, Sid Samsi, Vijay Gadepally, Devesh Tiwari.

HPDC 2023 (b)
ProPack: Executing Concurrent Serverless Functions Faster and Cheaper.
Rohan Basu Roy, Tirthak Patel, Richmond Liew, Yadu Nand Babuji, Ryan Chard, Devesh Tiwari.

ECCV 2023
MosaiQ: Enabling High-Quality Image Generation on Quantum Computers
Daniel Silver, Tirthak Patel, William Cutler,Aditya Ranjan, Harshitta Gandhi, Devesh Tiwari.

SoCC 2023
Sustainable Supercomputing for AI: GPU Power Capping at HPC Scale.
Dan Zhao, Siddharth Samsi, Joseph McDonald, Baolin Li, David Bestor, Michael Jones, Devesh Tiwari, Vijay Gadepally.

MICRO 2023
SupeRBNN: Randomized Binary Neural Network Using Adiabatic Superconductor Josephson Devices.
Zhengang Li, Geng Yuan, Tomoharu Yamauchi, Masoud Zabihi, Yanyue Xie, Peiyan Dong, Xulong Tang, Nobuyuki Yoshikawa, Devesh Tiwari, Yanzhi Wang, Olivia Chen.

DAC 2023
Invited: Building Robust Quantum System Software for Technology-Specific Characteristics.
Tirthak Patel, Devesh Tiwari.

HPEC 2023
From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference.
Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, Vijay Gadepally.

ArXiv
Toward Privacy in Quantum Program Execution On Untrusted Quantum Cloud Computing Machines for Business-sensitive Quantum Needs.
Tirthak Patel, Daniel Silver, Aditya Ranjan, Harshitta Gandhi, William Cutler, Devesh Tiwari.

2022

SC 2022 (a)
DayDream: Executing Dynamic Scientific Workflows on Serverless Platforms with Hot Starts.

SC 2022 (b)
CHARTER: Identifying the Most-Critical Gate Operations in Quantum Circuits via Amplified Gate Reversibility.

AAAI 2022
QUILT: Effective Multi-Class Classification on Quantum Computers Using an Ensemble of Diverse Quantum Classifiers.

ASPLOS 2022 (a)
IceBreaker: warming serverless functions better with heterogeneity.

ASPLOS 2022 (b)
QUEST: systematically approximating Quantum circuits for higher output fidelity.

ISCA 2022
Geyser: a compilation framework for quantum computing with neutral atoms.

HPCA 2022
AI-Enabling Workloads on Large-Scale GPU-Accelerated System: Characterization, Opportunities, and Implications.

PPoPP 2022
Mashup: making serverless computing useful for HPC workflows via hybrid execution.

SOCC 2022
MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant GPU Clusters.

NAACL 2022
Great Power, Great Responsibility: Recommendations for Reducing Energy for Training Language Models.

DATE 2022 (a)
OPTIC: A Practical Quantum Binary Classifier for Near-Term Quantum Computers.

DATE 2022 (b)
Do Temperature and Humidity Exposures Hurt or Benefit Your SSDs?

2021

SC 2021 (a)
Systematically Inferring I/O Performance Variability by Examining Repetitive Job Behavior.

SC 2021 (b)
Ribbon: Cost-Effective and QoS-Aware Deep Learning Model Inference using a Diverse Pool of Cloud Computing Instances.

ISCA 2021
SATORI: Efficient and Fair Resource Partitioning by Sacrificing Short-Term Benefits for Long-Term Gains.

HPCA 2021
Operating Liquid-Cooled Large-Scale Systems: Long-Term Monitoring, Reliability Analysis, and Efficiency Measures.

DSN 2021
Examining Failures and Repairs on Supercomputers with Multi-GPU Compute Nodes.

PLDI 2021
BLISS: Auto-tuning Complex Applications Using A Pool of Diverse Lightweight Learning Models.

IISWC 2021
Serverless Storage Scalability Challenges: Characterization, Implications, and Mitigation.

HPEC 2021
Serving Machine Learning Inference Using Heterogeneous Hardware.

ASPLOS 2021
QRAFT: Reverse Your Quantum Circuit and Know the Correct Program Output.

2020

USENIX ATC 2020
UREQA: Leveraging Operation-Aware Error Rates for Effective Quantum Circuit Mapping on NISQ-Era Quantum Computers.

USENIX FAST 2020 (a)
GIFT: A Coupon Based Throttle-and-Reward Mechanism for Fair and Efficient I/O Bandwidth Management on Parallel Storage Systems.

USENIX FAST 2020 (b)
Uncovering Access, Reuse, and Sharing Characteristics of I/O-Intensive Files on Large-Scale Production HPC Systems.

USENIX FAST 2020 (c)
Making Disk Failure Predictions SMARTer!

SC 2020 (a)
VERITAS: Accurately Estimating the Correct Output on Noisy Intermediate-Scale Quantum Computers.

SC 2020 (b)
Experimental Evaluation of NISQ Quantum Computers: Error Measurement, Characterization, and Implications.

SC 2020 (c)
Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification and Implications.

ICCAD 2020
DisQ: A Novel Quantum Output State Classification Method on IBM Quantum Computers using OpenPulse.

HPCA 2020
CLITE: Efficient and QoS-Aware Co-Location of Multiple Latency-Critical Jobs for Warehouse Scale Computers.

IPDPS 2020
What does the Power Consumption Behavior of HPC Jobs Reveal?

JSNAM 2020
Resilience and Coevolution of Preferential Interdependent Networks.

JMR 2020
Comparing Performances of Five Distinct Automatic Classifiers for Fin Whale Vocalizations in Beamformed Spectrograms of Coherent Hydrophone Array.

TDSC 2020
Characterizing and Exploiting Soft Error Vulnerability Phase Behavior in GPU Applications.

2019

TPDS 2019
An Analysis Workflow-Aware Storage System for Multi-Core Active Flash Arrays.

SC 2019
Revisiting I/O Behavior in Large-Scale Storage Systems: The Expected and the Unexpected.

HPDC 2019
PERQ: Fair and Efficient Power Management of Power-Constrained Large-Scale Computing Systems.

DAC 2019
What Does Vibration Do To Your SSD?

CLOUD 2019
Exploring Potential for Non-Disruptive Vertical Auto Scaling and Resource Estimation in Kubernetes.

ICAC 2019
Characterizing Disk Health Degradation and Proactively Protecting Against Disk Failures for Reliable Storage Systems.

CCGrid 2019
Towards Enabling Dynamic Resource Estimation and Correction for Improving Utilization in an Apache Mesos Cloud Environment.

DATE 2019
PCFI: Program Counter Guided Fault Injection for Accelerating GPU Reliability Assessment.

2018

BIGDATA 2018
Reliability Characterization of Solid State Drives in a Scalable Production Datacenter.

ASONAM 2018
Resilience and the Coevolution of Interdependent Multiplex Networks.

ICCCN 2018
Exploring the Optimal Platform Configuration for Power-Constrained HPC Workflows.

DSN 2018 (a)
Shiraz: Exploiting System Reliability and Application Resilience Characteristics to Improve Large Scale System Throughput.

DSN 2018 (b)
Machine Learning Models for GPU Error Prediction in a Large Scale HPC System.

DSN 2018 (c)
Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System.

2017

SC 2017 (a)
Failures in Large Scale Systems: Long-Term Measurement, Analysis, and Implications.

SC 2017 (b)
GUIDE: A Scalable Information Directory Service to Collect, Federate, and Analyze Logs for Operational Insights into a Leadership HPC Facility.

MASCOTS 2017 (a)
Toward Managing HPC Burst Buffers Effectively: Draining Strategy to Regulate Bursty I/O Behavior.

MASCOTS 2017 (b)
Characterizing Temperature, Power, and Soft-error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities.

CLUSTER 2017
Effective Running of End-to-end HPC Workflows on Emerging Heterogeneous Architectures.

MWSCAS 2017
Combining Architectural Fault-injection and Neutron Beam Testing Approaches Toward Better Understanding of GPU Soft-error Resilience.

TECS 2017
Compiler-directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMR.

TOMPECS 2017
Obtaining and Managing Answer Quality for Online Data-intensive Services.

2016

SC 2016 (a)
Granularity and the Cost of Error Recovery in Resilient AMR Scientific Applications.

SC 2016 (b)
Compiler Directed Lightweight, Fine-grained, Guaranteed Recovery for Soft Error Resilience. (Best Student Paper Award Finalist)

MICRO 2016
Low-Cost Soft Error Resilience with Unified Data Verification and Fine-Grained Recovery for Acoustic Sensor Based Detection.

ICAC 2016
Adaptive Power Profiling for Many-Core HPC Architectures.

DSN 2016
Power-aware Checkpointing: Toward the Optimal Checkpointing Interval under Power Capping.

IPDPS 2016
Reducing Waste in Large Scale Systems Through Introspective Analysis.

HPCA 2016
A Large-Scale Study of Soft-Errors on GPUs in the Field.

2015

SC 2015 (a)
Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility.

SC 2015 (b)
A Practical Approach to Reconciling Availability, Performance, and Capacity in Provisioning Extreme-scale Storage Systems.

SC 2015 (c)
AnalyzeThis: An Analysis Workflow-Aware Storage System.

SC 2015 (d)
Node Variability in Large-Scale Power Measurements: Perspectives from the Green500, Top500 and EEHPCWG.

ICAC 2015
Ubora: Measuring and Managing Answer Quality for Online Data-Intensive Services.

DSN 2015
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems.

LCTES 2015
Clover: Compiler Directed Lightweight Soft Error Resilience.

HPCA 2015
Understanding GPU Errors on Large-scale HPC Systems and the Implications for System Design and Operation.

CUG 2015
Experience with GPUs on the Titan Supercomputer from a Reliability, Performance and Power Perspective.

JPDC 2015
Application Configuration Predication for Energy-Efficient Execution on Multicore Systems.

2014

SC 2014
Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems.

DSN 2014
Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems.

IPDPS 2014
MapReuse: Reusing Computation in an In-Memory MapReduce System

CUG 2014
I/O Router Placement and Fine-Grained Routing on Titan to Support Spider II

ICPADS 2014
Improving Large-scale Storage System Performance via Topology-aware and Balanced Data Placement

LUG 2014
SSD Provisioning for Exascale Storage System: When, Where and How much?

2013 and before

FAST 2013
Active Flash: Towards Energy-Efficient, In-Situ Data Analytics on Extreme-Scale Machine

HotPower 2012
Reducing Data Movement Cost using Energy-Efficient Active Computation on SSD

IPDPS 2012
Modeling and Analyzing Key Performance Factors of Shared Memory Map Reduce

ISPASS 2012
Architectural Characterization and Similarity Analysis of Sunspider and Google’s V8 Javascript Benchmarks

HPCA 2011
HAQu: Hardware Accelerated Queueing for Fine-Grained Threading on a Chip Multi-Processor

IPDPS 2010
MMT: Exploiting Fine-Grained Parallelism in Dynamic Memory Management

MEDEA Workshop PACT 2009
Memory Management Thread for Heap Intensive Sequential Applications

Wild and Crazy Idea Session 2009
Explicit Sequential Programming for Implicit Parallel Performance on Many Cores

Ceramics International 2009
Simulation of Thermal and Electric Field Evolution during Spark Plasma Sintering

Ceramics International 2009
Is Weibull distribution the most appropriate statistical strength distribution for brittle materials?