All Publications

[SC 2020 (a)] Experimental Evaluation of NISQ Quantum Computers: Error Measurement, Characterization, and Implications (Best Paper Finalist, Best Student Paper Finalist)
[SC 2020 (b)] VERITAS: Accurately Estimating the Correct Output on Noisy Intermediate-Scale Quantum Computers
[SC 2020 (c)] Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification and Implications
[ICCAD 2020] DISQ: A Novel Quantum Output State Classification Method on IBM Quantum Computers using OpenPulse (Best Paper Finalist)
[USENIX ATC 2020] UREQA: Leveraging Operation-Aware Error Rates for Effective Quantum Circuit Mapping on NISQ-Era Quantum Computers
[FAST 2020 (a)] Uncovering Access, Reuse, and Sharing Characteristics of I/O-Intensive Files on Large-Scale Production HPC Systems
[FAST 2020 (b)] Making Disk Failure Predictions SMARTer!
[FAST 2020 (c)] GIFT: A Coupon Based Throttle-and-Reward Mechanism for Fair and Efficient I/O Bandwidth Management on Parallel Storage Systems
[IPDPS 2020] What does Power Consumption Behavior of HPC Jobs Reveal? : Demystifying, Quantifying, and Predicting Power Consumption Characteristics
[SNAM 2019] Resilience and coevolution of preferential interdependent networks
[RS 2020] Comparing Performances of Five Distinct Automatic Classifiers for Fin Whale Vocalizations in Beamformed Spectrograms of Coherent Hydrophone Array
[HPCA 2020] CLITE: Efficient and QoS-Aware Co-Location of Multiple Latency-Critical Jobs for Warehouse Scale Computers
[DAC 2019] What does Vibration do to Your SSD?
[SC 2019] Revisiting I/O behavior in large-scale storage systems: the expected and the unexpected
[HPDC 2019] PERQ: Fair and Efficient Power Management of Power-Constrained Large-Scale Computing Systems
[DATE 2019] PCFI: Program Counter Guided Fault Injection for Accelerating GPU Reliability Assessment
[ICAC 2019] Characterizing Disk Health Degradation and Proactively Protecting Against Disk Failures for Reliable Storage Systems
[MTAGS 2017] Two stage cluster for resource optimization with Apache Mesos
[CCGRID 2019] Towards Enabling Dynamic Resource Estimation and Correction for Improving Utilization in an Apache Mesos Cloud Environment
[CLOUD 2019] Exploring Potential for Non-Disruptive Vertical Auto Scaling and Resource Estimation in Kubernetes
[TPDS 2019] An Analysis Workflow-Aware Storage System for Multi-Core Active Flash Arrays
[DSN 2018 (a)] Machine Learning Models for GPU Error Prediction in a Large Scale HPC System
[DSN 2018 (b)] Shiraz: Exploiting System Reliability and Application Resilience Characteristics to Improve Large Scale System Throughput
[DSN 2018 (c)] Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System
[BigData 2018] Reliability Characterization of Solid State Drives in a Scalable Production Datacenter
[ACM 2018] Resilience and the Coevolution of Interdependent Multiplex Networks
[ICCCN 2018] Exploring the Optimal Platform Configuration for Power-Constrained HPC Workflows
[SC 2017 (a)] GUIDE: a scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility
[SC 2017 (b)] Failures in large scale systems: long-term measurement, analysis, and implications
[MASCOTS 2017 (a)] Toward Managing HPC Burst Buffers Effectively: Draining Strategy to Regulate Bursty I/O Behavior
[MASCOTS 2017 (b)] Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities
[TOMPECS 2017] Obtaining and Managing Answer Quality for Online Data-Intensive Services
[CLUSTER 2017] Effective Running of End-to-End HPC Workflows on Emerging Heterogeneous Architectures
[TECS 2017] Compiler-Directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMR
[MWSCAS 2017] Combining architectural fault-injection and neutron beam testing approaches toward better understanding of GPU soft-error resilience
[SC 2016 (a)] Granularity and the cost of error recovery in resilient AMR scientific applications
[SC 2016 (b)] Compiler-directed lightweight checkpointing for fine-grained guaranteed soft error recovery
[HPCA 2016] A large-scale study of soft-errors on GPUs in the field
[IPDPS 2016] Reducing Waste in Extreme Scale Systems through Introspective Analysis
[DSN 2016] Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy
[MICRO 2016] Low-cost soft error resilience with unified data verification and fine-grained recovery for acoustic sensor based detection
[JPDC 2016] Application configuration selection for energy-efficient execution on multicore systems
[ICAC 2016] Adaptive Power Profiling for Many-Core HPC Architectures
[SC 2015 (a)] Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility
[SC 2015 (b)] AnalyzeThis: an analysis workflow-aware storage system
[SC 2015 (c)] A practical approach to reconciling availability, performance, and capacity in provisioning extreme-scale storage systems
[HPCA 2015] Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
[DSN 2015] Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems
[CORR 2015] Measuring and Managing Answer Quality for Online Data-Intensive Services
[DSDIS 2015] Low Power Job Scheduler for Supercomputers: A Rule-Based Power-Aware Scheduler
[LCTES 2015] Clover: Compiler Directed Lightweight Soft Error Resilience
[IPDPS 2014] MapReuse: Reusing Computation in an In-Memory MapReduce System
[DSN 2014] Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems
[ICPADS 2014] Improving large-scale storage system performance via topology-aware and balanced data placement
[SC 2014] Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems
[FAST 2013] Active flash: towards energy-efficient, in-situ data analytics on extreme-scale machines
[OSDI 2012] Reducing Data Movement Costs Using Energy-Efficient, Active Computation on SSD
[IPDPS 2012] Modeling and Analyzing Key Performance Factors of Shared Memory MapReduce
[ISPASS 2012] Architectural characterization and similarity analysis of sunspider and Google's V8 Javascript benchmarks
[HPCS 2011] HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor
[IPDPS 2010] MMT: Exploiting fine-grained parallelism in dynamic memory management