Publications

2023

[AAAI 2023]
SLIQ: Resource-Efficient Quantum Similarity Networks for Unlabeled Data on Noisy Quantum Computers.
MISSING PAPER LINKS

2022

[SC 2022 (a)]
DayDream: Executing Dynamic Scientific Workflows on Serverless Platforms with Hot Starts.

[SC 2022 (b)]
CHARTER: Identifying the Most-Critical Gate Operations in Quantum Circuits via Amplified Gate Reversibility.

[AAAI 2022]
QUILT: Effective Multi-Class Classification on Quantum Computers Using an Ensemble of Diverse Quantum Classifiers.

[ASPLOS 2022 (a)]
IceBreaker: warming serverless functions better with heterogeneity.

[ASPLOS 2022 (b)]
QUEST: systematically approximating Quantum circuits for higher output fidelity.

[ISCA 2022]
Geyser: a compilation framework for quantum computing with neutral atoms.

[HPCA 2022]
AI-Enabling Workloads on Large-Scale GPU-Accelerated System: Characterization, Opportunities, and Implications.

[PPoPP 2022]
Mashup: making serverless computing useful for HPC workflows via hybrid execution.

[SOCC 2022]
MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant GPU Clusters.

[NAACL 2022]
Great Power, Great Responsibility: Recommendations for Reducing Energy for Training Language Models.

[DATE 2022 (a)]
OPTIC: A Practical Quantum Binary Classifier for Near-Term Quantum Computers.

[DATE 2022 (b)]
Do Temperature and Humidity Exposures Hurt or Benefit Your SSDs?

2021

[SC 2021 (a)]
Systematically Inferring I/O Performance Variability by Examining Repetitive Job Behavior.

[SC 2021 (b)]
Ribbon: Cost-Effective and QoS-Aware Deep Learning Model Inference using a Diverse Pool of Cloud Computing Instances.

[ISCA 2021]
SATORI: Efficient and Fair Resource Partitioning by Sacrificing Short-Term Benefits for Long-Term Gains.

[HPCA 2021]
Operating Liquid-Cooled Large-Scale Systems: Long-Term Monitoring, Reliability Analysis, and Efficiency Measures.

[DSN 2021]
Examining Failures and Repairs on Supercomputers with Multi-GPU Compute Nodes.

[PLDI 2021]
BLISS: Auto-tuning Complex Applications Using A Pool of Diverse Lightweight Learning Models.

[IISWC 2021]
Serverless Storage Scalability Challenges: Characterization, Implications, and Mitigation.

[HPEC 2021]
Serving Machine Learning Inference Using Heterogeneous Hardware.

[ASPLOS 2021]
QRAFT: Reverse Your Quantum Circuit and Know the Correct Program Output.

2020

[USENIX ATC 2020]
UREQA: Leveraging Operation-Aware Error Rates for Effective Quantum Circuit Mapping on NISQ-Era Quantum Computers.

[USENIX FAST 2020 (a)]
GIFT: A Coupon Based Throttle-and-Reward Mechanism for Fair and Efficient I/O Bandwidth Management on Parallel Storage Systems.

[USENIX FAST 2020 (b)]
Uncovering Access, Reuse, and Sharing Characteristics of I/O-Intensive Files on Large-Scale Production HPC Systems.

[USENIX FAST 2020 (c)]
Making Disk Failure Predictions SMARTer!

[SC 2020 (a)]
VERITAS: Accurately Estimating the Correct Output on Noisy Intermediate-Scale Quantum Computers.

[SC 2020 (b)]
Experimental Evaluation of NISQ Quantum Computers: Error Measurement, Characterization, and Implications.

[SC 2020 (c)]
Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification and Implications.

[ICCAD 2020]
DisQ: A Novel Quantum Output State Classification Method on IBM Quantum Computers using OpenPulse.

[HPCA 2020]
CLITE: Efficient and QoS-Aware Co-Location of Multiple Latency-Critical Jobs for Warehouse Scale Computers.

[IPDPS 2020]
What does the Power Consumption Behavior of HPC Jobs Reveal?

[JSNAM 2020]
Resilience and Coevolution of Preferential Interdependent Networks.

[JMR 2020]
Comparing Performances of Five Distinct Automatic Classifiers for Fin Whale Vocalizations in Beamformed Spectrograms of Coherent Hydrophone Array.

[TDSC 2020]
Characterizing and Exploiting Soft Error Vulnerability Phase Behavior in GPU Applications.

2019

[TPDS 2019]
An Analysis Workflow-Aware Storage System for Multi-Core Active Flash Arrays.

[SC 2019]
Revisiting I/O Behavior in Large-Scale Storage Systems: The Expected and the Unexpected.

[HPDC 2019]
PERQ: Fair and Efficient Power Management of Power-Constrained Large-Scale Computing Systems.

[DAC 2019]
What Does Vibration Do To Your SSD?

[CLOUD 2019]
Exploring Potential for Non-Disruptive Vertical Auto Scaling and Resource Estimation in Kubernetes.

[ICAC 2019]
Characterizing Disk Health Degradation and Proactively Protecting Against Disk Failures for Reliable Storage Systems.

[CCGrid 2019]
Towards Enabling Dynamic Resource Estimation and Correction for Improving Utilization in an Apache Mesos Cloud Environment.

[DATE 2019]
PCFI: Program Counter Guided Fault Injection for Accelerating GPU Reliability Assessment.

2018

[BIGDATA 2018]
Reliability Characterization of Solid State Drives in a Scalable Production Datacenter.

[ASONAM 2018]
Resilience and the Coevolution of Interdependent Multiplex Networks.

[ICCCN 2018]
Exploring the Optimal Platform Configuration for Power-Constrained HPC Workflows.

[DSN 2018 (a)]
Shiraz: Exploiting System Reliability and Application Resilience Characteristics to Improve Large Scale System Throughput.

[DSN 2018 (b)]
Machine Learning Models for GPU Error Prediction in a Large Scale HPC System.

[DSN 2018 (c)]
Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System.

2017

[SC 2017 (a)]
Failures in Large Scale Systems: Long-Term Measurement, Analysis, and Implications.

[SC 2017 (b)]
GUIDE: A Scalable Information Directory Service to Collect, Federate, and Analyze Logs for Operational Insights into a Leadership HPC Facility.

[MASCOTS 2017 (a)]
Toward Managing HPC Burst Buffers Effectively: Draining Strategy to Regulate Bursty I/O Behavior.

[MASCOTS 2017 (b)]
Characterizing Temperature, Power, and Soft-error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities.

[CLUSTER 2017]
Effective Running of End-to-end HPC Workflows on Emerging Heterogeneous Architectures.

[MWSCAS 2017]
Combining Architectural Fault-injection and Neutron Beam Testing Approaches Toward Better Understanding of GPU Soft-error Resilience.

[TECS 2017]
Compiler-directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMR.

[TOMPECS 2017]
Obtaining and Managing Answer Quality for Online Data-intensive Services.

2016

[SC 2016 (a)]
Granularity and the Cost of Error Recovery in Resilient AMR Scientific Applications.

[SC 2016 (b)]
Compiler Directed Lightweight, Fine-grained, Guaranteed Recovery for Soft Error Resilience. (Best Student Paper Award Finalist)

[MICRO 2016]
Low-Cost Soft Error Resilience with Unified Data Verification and Fine-Grained Recovery for Acoustic Sensor Based Detection.

[ICAC 2016]
Adaptive Power Profiling for Many-Core HPC Architectures.

[DSN 2016]
Power-aware Checkpointing: Toward the Optimal Checkpointing Interval under Power Capping.

[IPDPS 2016]
Reducing Waste in Large Scale Systems Through Introspective Analysis.

[HPCA 2016]
A Large-Scale Study of Soft-Errors on GPUs in the Field.

2015

[SC 2015 (a)]
Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility.

[SC 2015 (b)]
A Practical Approach to Reconciling Availability, Performance, and Capacity in Provisioning Extreme-scale Storage Systems.

[SC 2015 (c)]
AnalyzeThis: An Analysis Workflow-Aware Storage System.

[SC 2015 (d)]
Node Variability in Large-Scale Power Measurements: Perspectives from the Green500, Top500 and EEHPCWG.

[ICAC 2015]
Ubora: Measuring and Managing Answer Quality for Online Data-Intensive Services.

[DSN 2015]
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems.

[LCTES 2015]
Clover: Compiler Directed Lightweight Soft Error Resilience.

[HPCA 2015]
Understanding GPU Errors on Large-scale HPC Systems and the Implications for System Design and Operation.

[CUG 2015]
Experience with GPUs on the Titan Supercomputer from a Reliability, Performance and Power Perspective.

[JPDC 2015]
Application Configuration Predication for Energy-Efficient Execution on Multicore Systems.

2014

[SC 2014]
Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems.

[DSN 2014]
Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems.

[IPDPS 2014]
MapReuse: Reusing Computation in an In-Memory MapReduce System

[CUG 2014]
I/O Router Placement and Fine-Grained Routing on Titan to Support Spider II

[ICPADS 2014]
Improving Large-scale Storage System Performance via Topology-aware and Balanced Data Placement

[LUG 2014]
SSD Provisioning for Exascale Storage System: When, Where and How much?

2013 and before

[FAST 2013]
Active Flash: Towards Energy-Efficient, In-Situ Data Analytics on Extreme-Scale Machine

[HotPower 2012]
Reducing Data Movement Cost using Energy-Efficient Active Computation on SSD

[IPDPS 2012]
Modeling and Analyzing Key Performance Factors of Shared Memory Map Reduce

[ISPASS 2012]
Architectural Characterization and Similarity Analysis of Sunspider and Google’s V8 Javascript Benchmarks

[HPCA 2011]
HAQu: Hardware Accelerated Queueing for Fine-Grained Threading on a Chip Multi-Processor

[IPDPS 2010]
MMT: Exploiting Fine-Grained Parallelism in Dynamic Memory Management

[MEDEA Workshop PACT 2009]
Memory Management Thread for Heap Intensive Sequential Applications

[Wild and Crazy Idea Session 2009]
Explicit Sequential Programming for Implicit Parallel Performance on Many Cores

[Ceramics International 2009]
Simulation of Thermal and Electric Field Evolution during Spark Plasma Sintering

[Ceramics International 2009]
Is Weibull distribution the most appropriate statistical strength distribution for brittle materials?