ACM Transactions on Architecture and Code Optimization

Papers
(The median citation count of ACM Transactions on Architecture and Code Optimization is 1. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2020-03-01 to 2024-03-01.)
ArticleCitations
SMAUG31
IR2V EC28
Domain-Specific Multi-Level IR Rewriting for GPU22
A RISC-V Simulator and Benchmark Suite for Designing and Evaluating Vector Architectures21
A Black-box Monitoring Approach to Measure Microservices Runtime Performance18
Grus18
PERI17
Compiler Support for Sparse Tensor Computations in MLIR17
Dynamic Precision Autotuning with TAFFO16
LLOV16
ArmorAll16
PAVER15
SLO-Aware Inference Scheduler for Heterogeneous Processors in Edge Platforms12
A Case For Intra-rack Resource Disaggregation in HPC12
Optimizing the SSD Burst Buffer by Traffic Detection11
Inter-kernel Reuse-aware Thread Block Scheduling11
Vitruvius+: An Area-Efficient RISC-V Decoupled Vector Coprocessor for High Performance Computing Applications11
Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks11
Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)10
Exploiting Parallelism Opportunities with Deep Learning Frameworks10
A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels10
Dynamic Colocation Policies with Reinforcement Learning10
OD-SGD10
Gem5-X9
A Novel, Highly Integrated Simulator for Parallel and Distributed Systems9
AsynGraph9
Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators9
An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-matrix Multiplication9
EchoBay9
KernelFaRer9
GEVO8
On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond8
Performance Evaluation of Intel Optane Memory for Managed Workloads8
Securing Branch Predictors with Two-Level Encryption8
CoMeT: An Integrated Interval Thermal Simulation Toolchain for 2D, 2.5D, and 3D Processor-Memory Systems8
Schedule Synthesis for Halide Pipelines on GPUs8
Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor8
PolyDL7
GRAM7
Bayesian Optimization for Efficient Accelerator Synthesis7
ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing Runtimes7
Low I/O Intensity-aware Partial GC Scheduling to Reduce Long-tail Latency in SSDs7
Architecting Optically Controlled Phase Change Memory7
GPU Fast Convolution via the Overlap-and-Save Method in Shared Memory7
Informed Prefetching for Indirect Memory Accesses6
A Conflict-free Scheduler for High-performance Graph Processing on Multi-pipeline FPGAs6
Register-Pressure-Aware Instruction Scheduling Using Ant Colony Optimization6
Task-RM: A Resource Manager for Energy Reduction in Task-Parallel Applications under Quality of Service Constraints6
PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM6
ReuseTracker : Fast Yet Accurate Multicore Reuse Distance Analyzer6
HeapCheck: Low-cost Hardware Support for Memory Safety5
Low-precision Logarithmic Number Systems5
Polyhedral Specification and Code Generation of Sparse Tensor Contraction with Co-iteration5
Scale-out Systolic Arrays5
SIMT-X5
E-BATCH: Energy-Efficient and High-Throughput RNN Batching5
Autotuning Convolutions Is Easier Than You Think5
MC-DeF5
Energy-efficient In-Memory Address Calculation5
A Reusable Characterization of the Memory System Behavior of SPEC2017 and SPEC20065
Optimizing Small-Sample Disk Fault Detection Based on LSTM-GAN Model5
Effective Loop Fusion in Polyhedral Compilation Using Fusion Conflict Graphs5
Cooperative Software-hardware Acceleration of K-means on a Tightly Coupled CPU-FPGA System5
Understanding Cache Compression4
An FPGA-based Approach to Evaluate Thermal and Resource Management Strategies of Many-core Processors4
FastPath_MP4
Application-Specific Arithmetic in High-Level Synthesis Tools4
GraphPEG4
Zeroploit4
GraphAttack4
On Architectural Support for Instruction Set Randomization4
Practical Software-Based Shadow Stacks on x86-644
Preserving Addressability Upon GC-Triggered Data Movements on Non-Volatile Memory4
MemSZ4
MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation4
Early Address Prediction3
Gretch3
CASHT: Contention Analysis in Shared Hierarchies with Thefts3
Systems-on-Chip with Strong Ordering3
A Pressure-Aware Policy for Contention Minimization on Multicore Systems3
LargeGraph3
Performance and Power Prediction for Concurrent Execution on GPUs3
SPX643
Spiking Neural Networks in Spintronic Computational RAM3
SecNVM: An Efficient and Write-Friendly Metadata Crash Consistency Scheme for Secure NVM3
Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache3
GPU Domain Specialization via Composable On-Package Architecture3
Refresh Triggered Computation3
Just-In-Time Compilation on ARM—A Closer Look at Call-Site Code Consistency2
On Predictable Reconfigurable System Design2
Automatic Sublining for Efficient Sparse Memory Accesses2
RegCPython: A Register-based Python Interpreter for Better Performance2
ASA: A ccelerating S parse A ccumulation in Column-wise SpGEMM2
PowerMorph: QoS-Aware Server Power Reshaping for Data Center Regulation Service2
Low-power Near-data Instruction Execution Leveraging Opcode-based Timing Analysis2
DisGCo2
Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators2
YaConv: Convolution with Low Cache Footprint2
Cryptographic Software IP Protection without Compromising Performance or Timing Side-channel Leakage2
An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism2
PICO2
ECO TLB2
NNBench-X2
User-driven Online Kernel Fusion for SYCL2
GiantVM: A Novel Distributed Hypervisor for Resource Aggregation with DSM-aware Optimizations2
SortCache2
Memory-Aware Functional IR for Higher-Level Synthesis of Accelerators2
Leveraging Value Equality Prediction for Value Speculation2
COX : Exposing CUDA Warp-level Functions to CPUs2
Network Interface Architecture for Remote Indirect Memory Access (RIMA) in Datacenters2
Irregular Register Allocation for Translation of Test-pattern Programs2
Locality-Aware CTA Scheduling for Gaming Applications2
FlexHM: A Practical System for Heterogeneous Memory with Flexible and Efficient Performance Optimizations2
A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUs2
MetaSys: A Practical Open-source Metadata Management System to Implement and Evaluate Cross-layer Optimizations2
Scenario-Aware Program Specialization for Timing Predictability2
A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks2
Triangle Dropping: An Occluded-geometry Predictor for Energy-efficient Mobile GPUs2
Design and Evaluation of an Ultra Low-power Human-quality Speech Recognition System2
WaFFLe2
A Distributed Hardware Monitoring System for Runtime Verification on Multi-Tile MPSoCs1
Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure1
SHASTA1
An Application-oblivious Memory Scheduling System for DNN Accelerators1
Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and Performance1
SMT-Based Contention-Free Task Mapping and Scheduling on 2D/3D SMART NoC with Mixed Dimension-Order Routing1
Efficient Nearest-Neighbor Data Sharing in GPUs1
MAPPER: Managing Application Performance via Parallel Efficiency Regulation1
Advancing Direct Convolution Using Convolution Slicing Optimization and ISA Extensions1
Reliability-aware Garbage Collection for Hybrid HBM-DRAM Memories1
Towards Enhanced System Efficiency while Mitigating Row Hammer1
Online Application Guidance for Heterogeneous Memory Systems1
Building a Fast and Efficient LSM-tree Store by Integrating Local Storage with Cloud Storage1
Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs1
Reliability Analysis for Unreliable FSM Computations1
SG XL1
Iterative Compilation Optimization Based on Metric Learning and Collaborative Filtering1
FPD etect1
FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration1
TokenSmart: Distributed, Scalable Power Management in the Many-core Era1
ACTION: Adaptive Cache Block Migration in Distributed Cache Architectures1
At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads1
Accelerating Video Captioning on Heterogeneous System Architectures1
Source Matching and Rewriting for MLIR Using String-Based Automata1
SpecTerminator: Blocking Speculative Side Channels Based on Instruction Classes on RISC-V1
Delay-on-Squash: Stopping Microarchitectural Replay Attacks in Their Tracks1
Performance-Energy Trade-off in Modern CMPs1
PETRA1
PRISM1
Improving Memory Efficiency in Heterogeneous MPSoCs through Row-Buffer Locality-aware Forwarding1
CIB-HIER1
Cooperative Slack Management: Saving Energy of Multicore Processors by Trading Performance Slack Between QoS-Constrained Applications1
Runtime Design Space Exploration and Mapping of DCNNs for the Ultra-Low-Power Orlando SoC1
Reducing Minor Page Fault Overheads through Enhanced Page Walker1
0.016723871231079