ACM Transactions on Architecture and Code Optimization

(The median citation count of ACM Transactions on Architecture and Code Optimization is 1. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2020-04-01 to 2024-04-01.)
Domain-Specific Multi-Level IR Rewriting for GPU24
A RISC-V Simulator and Benchmark Suite for Designing and Evaluating Vector Architectures21
A Black-box Monitoring Approach to Measure Microservices Runtime Performance18
Compiler Support for Sparse Tensor Computations in MLIR17
Dynamic Precision Autotuning with TAFFO16
Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks12
A Case For Intra-rack Resource Disaggregation in HPC12
SLO-Aware Inference Scheduler for Heterogeneous Processors in Edge Platforms12
Inter-kernel Reuse-aware Thread Block Scheduling12
A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels11
Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators11
Vitruvius+: An Area-Efficient RISC-V Decoupled Vector Coprocessor for High Performance Computing Applications11
Securing Branch Predictors with Two-Level Encryption11
Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)10
Exploiting Parallelism Opportunities with Deep Learning Frameworks10
An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-matrix Multiplication9
PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM9
On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond8
GPU Fast Convolution via the Overlap-and-Save Method in Shared Memory8
Low I/O Intensity-aware Partial GC Scheduling to Reduce Long-tail Latency in SSDs8
Schedule Synthesis for Halide Pipelines on GPUs8
CoMeT: An Integrated Interval Thermal Simulation Toolchain for 2D, 2.5D, and 3D Processor-Memory Systems8
Low-precision Logarithmic Number Systems8
Architecting Optically Controlled Phase Change Memory8
Performance Evaluation of Intel Optane Memory for Managed Workloads8
Register-Pressure-Aware Instruction Scheduling Using Ant Colony Optimization7
ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing Runtimes7
Bayesian Optimization for Efficient Accelerator Synthesis7
Task-RM: A Resource Manager for Energy Reduction in Task-Parallel Applications under Quality of Service Constraints6
ReuseTracker : Fast Yet Accurate Multicore Reuse Distance Analyzer6
Scale-out Systolic Arrays6
A Conflict-free Scheduler for High-performance Graph Processing on Multi-pipeline FPGAs6
Cooperative Software-hardware Acceleration of K-means on a Tightly Coupled CPU-FPGA System5
A Reusable Characterization of the Memory System Behavior of SPEC2017 and SPEC20065
Understanding Cache Compression5
Optimizing Small-Sample Disk Fault Detection Based on LSTM-GAN Model5
E-BATCH: Energy-Efficient and High-Throughput RNN Batching5
Autotuning Convolutions Is Easier Than You Think5
Polyhedral Specification and Code Generation of Sparse Tensor Contraction with Co-iteration5
Effective Loop Fusion in Polyhedral Compilation Using Fusion Conflict Graphs5
Energy-efficient In-Memory Address Calculation5
HeapCheck: Low-cost Hardware Support for Memory Safety5
Refresh Triggered Computation5
An FPGA-based Approach to Evaluate Thermal and Resource Management Strategies of Many-core Processors4
Preserving Addressability Upon GC-Triggered Data Movements on Non-Volatile Memory4
GPU Domain Specialization via Composable On-Package Architecture4
MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation4
Practical Software-Based Shadow Stacks on x86-644
Performance and Power Prediction for Concurrent Execution on GPUs4
On Architectural Support for Instruction Set Randomization4
YaConv: Convolution with Low Cache Footprint3
CASHT: Contention Analysis in Shared Hierarchies with Thefts3
Spiking Neural Networks in Spintronic Computational RAM3
Solving Sparse Assignment Problems on FPGAs3
Early Address Prediction3
A Pressure-Aware Policy for Contention Minimization on Multicore Systems3
SecNVM: An Efficient and Write-Friendly Metadata Crash Consistency Scheme for Secure NVM3
Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache3
Systems-on-Chip with Strong Ordering3
A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks3
On Predictable Reconfigurable System Design2
Locality-Aware CTA Scheduling for Gaming Applications2
Source Matching and Rewriting for MLIR Using String-Based Automata2
Scenario-Aware Program Specialization for Timing Predictability2
PowerMorph: QoS-Aware Server Power Reshaping for Data Center Regulation Service2
Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators2
Triangle Dropping: An Occluded-geometry Predictor for Energy-efficient Mobile GPUs2
Design and Evaluation of an Ultra Low-power Human-quality Speech Recognition System2
ASA: A ccelerating S parse A ccumulation in Column-wise SpGEMM2
An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism2
SplitZNS: Towards an Efficient LSM-Tree on Zoned Namespace SSDs2
Automatic Sublining for Efficient Sparse Memory Accesses2
FlexHM: A Practical System for Heterogeneous Memory with Flexible and Efficient Performance Optimizations2
GiantVM: A Novel Distributed Hypervisor for Resource Aggregation with DSM-aware Optimizations2
User-driven Online Kernel Fusion for SYCL2
Leveraging Value Equality Prediction for Value Speculation2
Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs2
Just-In-Time Compilation on ARM—A Closer Look at Call-Site Code Consistency2
Cryptographic Software IP Protection without Compromising Performance or Timing Side-channel Leakage2
Irregular Register Allocation for Translation of Test-pattern Programs2
RegCPython: A Register-based Python Interpreter for Better Performance2
MetaSys: A Practical Open-source Metadata Management System to Implement and Evaluate Cross-layer Optimizations2
Low-power Near-data Instruction Execution Leveraging Opcode-based Timing Analysis2
Memory-Aware Functional IR for Higher-Level Synthesis of Accelerators2
COX : Exposing CUDA Warp-level Functions to CPUs2
Network Interface Architecture for Remote Indirect Memory Access (RIMA) in Datacenters2
Building a Fast and Efficient LSM-tree Store by Integrating Local Storage with Cloud Storage1
Delay-on-Squash: Stopping Microarchitectural Replay Attacks in Their Tracks1
MAPPER: Managing Application Performance via Parallel Efficiency Regulation1
FPD etect1
TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware Scheduling1
Cooperative Slack Management: Saving Energy of Multicore Processors by Trading Performance Slack Between QoS-Constrained Applications1
Runtime Design Space Exploration and Mapping of DCNNs for the Ultra-Low-Power Orlando SoC1
Reducing Minor Page Fault Overheads through Enhanced Page Walker1
A Distributed Hardware Monitoring System for Runtime Verification on Multi-Tile MPSoCs1
Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure1
SAC: An Ultra-Efficient Spin-based Architecture for Compressed DNNs1
Puppeteer: A Random Forest Based Manager for Hardware Prefetchers Across the Memory Hierarchy1
FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration1
Iterative Compilation Optimization Based on Metric Learning and Collaborative Filtering1
gPPM: A Generalized Matrix Operation and Parallel Algorithm to Accelerate the Encoding/Decoding Process of Erasure Codes1
ACTION: Adaptive Cache Block Migration in Distributed Cache Architectures1
Towards Enhanced System Efficiency while Mitigating Row Hammer1
Online Application Guidance for Heterogeneous Memory Systems1
ULEEN: A Novel Architecture for Ultra-low-energy Edge Neural Networks1
SpecTerminator: Blocking Speculative Side Channels Based on Instruction Classes on RISC-V1
Reliability Analysis for Unreliable FSM Computations1
Performance-Energy Trade-off in Modern CMPs1
An Application-oblivious Memory Scheduling System for DNN Accelerators1
SMT-Based Contention-Free Task Mapping and Scheduling on 2D/3D SMART NoC with Mixed Dimension-Order Routing1
Efficient Nearest-Neighbor Data Sharing in GPUs1
Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and Performance1
Advancing Direct Convolution Using Convolution Slicing Optimization and ISA Extensions1
TokenSmart: Distributed, Scalable Power Management in the Many-core Era1
Reliability-aware Garbage Collection for Hybrid HBM-DRAM Memories1
At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads1
Accelerating Video Captioning on Heterogeneous System Architectures1