ACM Transactions on Architecture and Code Optimization

Papers
(The median citation count of ACM Transactions on Architecture and Code Optimization is 1. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2020-07-01 to 2024-07-01.)
ArticleCitations
SMAUG36
IR2V EC35
Domain-Specific Multi-Level IR Rewriting for GPU25
A RISC-V Simulator and Benchmark Suite for Designing and Evaluating Vector Architectures23
Grus21
PERI19
A Black-box Monitoring Approach to Measure Microservices Runtime Performance19
Compiler Support for Sparse Tensor Computations in MLIR19
LLOV18
PAVER17
SLO-Aware Inference Scheduler for Heterogeneous Processors in Edge Platforms17
A Case For Intra-rack Resource Disaggregation in HPC16
Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks14
Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)13
Inter-kernel Reuse-aware Thread Block Scheduling13
Vitruvius+: An Area-Efficient RISC-V Decoupled Vector Coprocessor for High Performance Computing Applications12
KernelFaRer12
Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators12
Securing Branch Predictors with Two-Level Encryption11
A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels11
OD-SGD10
AsynGraph10
PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM10
Exploiting Parallelism Opportunities with Deep Learning Frameworks10
Low I/O Intensity-aware Partial GC Scheduling to Reduce Long-tail Latency in SSDs10
EchoBay10
On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond10
An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-matrix Multiplication10
PolyDL10
Gem5-X10
Low-precision Logarithmic Number Systems9
GRAM9
GEVO9
CoMeT: An Integrated Interval Thermal Simulation Toolchain for 2D, 2.5D, and 3D Processor-Memory Systems9
Bayesian Optimization for Efficient Accelerator Synthesis8
GPU Fast Convolution via the Overlap-and-Save Method in Shared Memory8
Schedule Synthesis for Halide Pipelines on GPUs8
ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing Runtimes8
Architecting Optically Controlled Phase Change Memory8
Performance Evaluation of Intel Optane Memory for Managed Workloads8
Register-Pressure-Aware Instruction Scheduling Using Ant Colony Optimization7
Polyhedral Specification and Code Generation of Sparse Tensor Contraction with Co-iteration7
Autotuning Convolutions Is Easier Than You Think7
A Reusable Characterization of the Memory System Behavior of SPEC2017 and SPEC20066
Scale-out Systolic Arrays6
Understanding Cache Compression6
Task-RM: A Resource Manager for Energy Reduction in Task-Parallel Applications under Quality of Service Constraints6
Optimizing Small-Sample Disk Fault Detection Based on LSTM-GAN Model6
ReuseTracker : Fast Yet Accurate Multicore Reuse Distance Analyzer6
Effective Loop Fusion in Polyhedral Compilation Using Fusion Conflict Graphs5
Energy-efficient In-Memory Address Calculation5
Practical Software-Based Shadow Stacks on x86-645
Gretch5
Refresh Triggered Computation5
MemSZ5
E-BATCH: Energy-Efficient and High-Throughput RNN Batching5
GraphPEG5
FastPath_MP5
LargeGraph5
Cooperative Software-hardware Acceleration of K-means on a Tightly Coupled CPU-FPGA System5
HeapCheck: Low-cost Hardware Support for Memory Safety5
MC-DeF5
High-performance Deterministic Concurrency Using Lingua Franca4
An FPGA-based Approach to Evaluate Thermal and Resource Management Strategies of Many-core Processors4
Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache4
GraphAttack4
On Architectural Support for Instruction Set Randomization4
A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks4
Zeroploit4
Performance and Power Prediction for Concurrent Execution on GPUs4
SPX644
Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators4
Preserving Addressability Upon GC-Triggered Data Movements on Non-Volatile Memory4
GPU Domain Specialization via Composable On-Package Architecture4
MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation4
Just-In-Time Compilation on ARM—A Closer Look at Call-Site Code Consistency3
SecNVM: An Efficient and Write-Friendly Metadata Crash Consistency Scheme for Secure NVM3
Automatic Sublining for Efficient Sparse Memory Accesses3
Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs3
PowerMorph: QoS-Aware Server Power Reshaping for Data Center Regulation Service3
Irregular Register Allocation for Translation of Test-pattern Programs3
YaConv: Convolution with Low Cache Footprint3
Early Address Prediction3
Systems-on-Chip with Strong Ordering3
A Pressure-Aware Policy for Contention Minimization on Multicore Systems3
Solving Sparse Assignment Problems on FPGAs3
On Predictable Reconfigurable System Design3
WaFFLe3
CASHT: Contention Analysis in Shared Hierarchies with Thefts3
Spiking Neural Networks in Spintronic Computational RAM3
Multi-objective Hardware-aware Neural Architecture Search with Pareto Rank-preserving Surrogate Models3
Memory-Aware Functional IR for Higher-Level Synthesis of Accelerators2
SortCache2
Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL2
Reducing Minor Page Fault Overheads through Enhanced Page Walker2
PICO2
SpecTerminator: Blocking Speculative Side Channels Based on Instruction Classes on RISC-V2
RegCPython: A Register-based Python Interpreter for Better Performance2
ASA: A ccelerating S parse A ccumulation in Column-wise SpGEMM2
Iterative Compilation Optimization Based on Metric Learning and Collaborative Filtering2
MetaSys: A Practical Open-source Metadata Management System to Implement and Evaluate Cross-layer Optimizations2
Triangle Dropping: An Occluded-geometry Predictor for Energy-efficient Mobile GPUs2
Lock-Free High-performance Hashing for Persistent Memory via PM-aware Holistic Optimization2
Design and Evaluation of an Ultra Low-power Human-quality Speech Recognition System2
An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism2
SplitZNS: Towards an Efficient LSM-Tree on Zoned Namespace SSDs2
Locality-Aware CTA Scheduling for Gaming Applications2
FlexHM: A Practical System for Heterogeneous Memory with Flexible and Efficient Performance Optimizations2
NNBench-X2
Scenario-Aware Program Specialization for Timing Predictability2
User-driven Online Kernel Fusion for SYCL2
DisGCo2
Leveraging Value Equality Prediction for Value Speculation2
Cryptographic Software IP Protection without Compromising Performance or Timing Side-channel Leakage2
COX : Exposing CUDA Warp-level Functions to CPUs2
A Distributed Hardware Monitoring System for Runtime Verification on Multi-Tile MPSoCs2
ECO TLB2
Source Matching and Rewriting for MLIR Using String-Based Automata2
Puppeteer: A Random Forest Based Manager for Hardware Prefetchers Across the Memory Hierarchy2
GiantVM: A Novel Distributed Hypervisor for Resource Aggregation with DSM-aware Optimizations2
Low-power Near-data Instruction Execution Leveraging Opcode-based Timing Analysis2
TokenSmart: Distributed, Scalable Power Management in the Many-core Era1
Reliability-aware Garbage Collection for Hybrid HBM-DRAM Memories1
Towards Enhanced System Efficiency while Mitigating Row Hammer1
Online Application Guidance for Heterogeneous Memory Systems1
WA-Zone: Wear-Aware Zone Management Optimization for LSM-Tree on ZNS SSDs1
Weaving Synchronous Reactions into the Fabric of SSA-form Compilers1
SHASTA1
PRISM1
An Application-oblivious Memory Scheduling System for DNN Accelerators1
FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration1
PETRA1
Cost-aware Service Placement and Scheduling in the Edge-Cloud Continuum1
CARL: Compiler Assigned Reference Leasing1
Cache Programming for Scientific Loops Using Leases1
Cooperative Slack Management: Saving Energy of Multicore Processors by Trading Performance Slack Between QoS-Constrained Applications1
Occam: Optimal Data Reuse for Convolutional Neural Networks1
BullsEye : Scalable and Accurate Approximation Framework for Cache Miss Calculation1
Accelerating Video Captioning on Heterogeneous System Architectures1
Building a Fast and Efficient LSM-tree Store by Integrating Local Storage with Cloud Storage1
Acceleration of Parallel-Blocked QR Decomposition of Tall-and-Skinny Matrices on FPGAs1
SAC: An Ultra-Efficient Spin-based Architecture for Compressed DNNs1
Efficient Nearest-Neighbor Data Sharing in GPUs1
TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency1
Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and Performance1
SMT-Based Contention-Free Task Mapping and Scheduling on 2D/3D SMART NoC with Mixed Dimension-Order Routing1
Fast One-Sided RDMA-Based State Machine Replication for Disaggregated Memory1
EXPERTISE: An Effective Software-level Redundant Multithreading Scheme against Hardware Faults1
gPPM: A Generalized Matrix Operation and Parallel Algorithm to Accelerate the Encoding/Decoding Process of Erasure Codes1
ACTION: Adaptive Cache Block Migration in Distributed Cache Architectures1
TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware Scheduling1
At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads1
ULEEN: A Novel Architecture for Ultra-low-energy Edge Neural Networks1
Performance-Energy Trade-off in Modern CMPs1
Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure1
MAPPER: Managing Application Performance via Parallel Efficiency Regulation1
Delay-on-Squash: Stopping Microarchitectural Replay Attacks in Their Tracks1
FPD etect1
SG XL1
CIB-HIER1
The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead1
rNdN: Fast Query Compilation for NVIDIA GPUs1
Advancing Direct Convolution Using Convolution Slicing Optimization and ISA Extensions1
0.06848406791687