Selected Recent Publications

[SC 2020 (a)] Experimental Evaluation of NISQ Quantum Computers: Error Measurement, Characterization, and Implications (Best Paper Finalist, Best Student Paper Finalist)
[SC 2020 (b)] VERITAS: Accurately Estimating the Correct Output on Noisy Intermediate-Scale Quantum Computers
[SC 2020 (c)] Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification and Implications
[ICCAD 2020] DISQ: A Novel Quantum Output State Classification Method on IBM Quantum Computers using OpenPulse (Best Paper Finalist)
[USENIX ATC 2020] UREQA: Leveraging Operation-Aware Error Rates for Effective Quantum Circuit Mapping on NISQ-Era Quantum Computers
[FAST 2020 (a)] Uncovering Access, Reuse, and Sharing Characteristics of I/O-Intensive Files on Large-Scale Production HPC Systems
[FAST 2020 (b)] Making Disk Failure Predictions SMARTer!
[FAST 2020 (c)] GIFT: A Coupon Based Throttle-and-Reward Mechanism for Fair and Efficient I/O Bandwidth Management on Parallel Storage Systems
[HPCA 2020] CLITE: Efficient and QoS-Aware Co-Location of Multiple Latency-Critical Jobs for Warehouse Scale Computers
[IPDPS 2020] What does Power Consumption Behavior of HPC Jobs Reveal? : Demystifying, Quantifying, and Predicting Power Consumption Characteristics
[DAC 2019] What does Vibration do to Your SSD?
[SC 2019] Revisiting I/O behavior in large-scale storage systems: the expected and the unexpected
[HPDC 2019] PERQ: Fair and Efficient Power Management of Power-Constrained Large-Scale Computing Systems
[DATE 2019] PCFI: Program Counter Guided Fault Injection for Accelerating GPU Reliability Assessment
[ICAC 2019] Characterizing Disk Health Degradation and Proactively Protecting Against Disk Failures for Reliable Storage Systems
[DSN 2018 (a)] Machine Learning Models for GPU Error Prediction in a Large Scale HPC System
[DSN 2018 (b)] Shiraz: Exploiting System Reliability and Application Resilience Characteristics to Improve Large Scale System Throughput
[DSN 2018 (c)] Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System
[BigData 2018] Reliability Characterization of Solid State Drives in a Scalable Production Datacenter
[SC 2017 (a)] GUIDE: a scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility
[SC 2017 (b)] Failures in large scale systems: long-term measurement, analysis, and implications
[MASCOTS 2017 (a)] Toward Managing HPC Burst Buffers Effectively: Draining Strategy to Regulate Bursty I/O Behavior
[MASCOTS 2017 (b)] Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities
[CLUSTER 2017] Effective Running of End-to-End HPC Workflows on Emerging Heterogeneous Architectures
[SC 2016 (a)] Granularity and the cost of error recovery in resilient AMR scientific applications
[SC 2016 (b)] Compiler-directed lightweight checkpointing for fine-grained guaranteed soft error recovery
[HPCA 2016] A large-scale study of soft-errors on GPUs in the field
[IPDPS 2016] Reducing Waste in Extreme Scale Systems through Introspective Analysis
[DSN 2016] Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy
[MICRO 2016] Low-cost soft error resilience with unified data verification and fine-grained recovery for acoustic sensor based detection