ACM Transactions on Architecture and Code Optimization

Papers
(The median citation count of ACM Transactions on Architecture and Code Optimization is 1. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2021-05-01 to 2025-05-01.)
ArticleCitations
An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-matrix Multiplication37
Object Intersection Captures on Interactive Apps to Drive a Crowd-sourced Replay-based Compiler Optimization32
Spiking Neural Networks in Spintronic Computational RAM31
ASM: An Adaptive Secure Multicore for Co-located Mutually Distrusting Processes30
TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency29
TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework25
An Intelligent Scheduling Approach on Mobile OS for Optimizing UI Smoothness and Power23
Highly Efficient Self-checking Matrix Multiplication on Tiled AMX Accelerators18
ModNEF : An Open Source Modular Neuromorphic Emulator for FPGA for Low-Power In-Edge Artificial Intelligence17
DCMA: Accelerating Parallel DMA Transfers with a Multi-Port Direct Cached Memory Access in a Massive-Parallel Vector Processor16
Building a Fast and Efficient LSM-tree Store by Integrating Local Storage with Cloud Storage16
Source Matching and Rewriting for MLIR Using String-Based Automata15
SIMD-Matcher: A SIMD-based Arbitrary Matching Framework15
Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs14
Tiaozhuan: A General and Efficient Indirect Branch Optimization for Binary Translation14
COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign14
A Concise Concurrent B + -Tree for Persistent Memory13
Locality-Aware CTA Scheduling for Gaming Applications13
Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product12
DeepZoning: Re-accelerate CNN Inference with Zoning Graph for Heterogeneous Edge Cluster12
Accelerating Video Captioning on Heterogeneous System Architectures12
Domain-Specific Multi-Level IR Rewriting for GPU12
A NUMA-Aware Version of an Adaptive Self-Scheduling Loop Scheduler11
iSwap: A New Memory Page Swap Mechanism for Reducing Ineffective I/O Operations in Cloud Environments11
COX : Exposing CUDA Warp-level Functions to CPUs10
GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems10
Quantifying Resource Contention of Co-located Workloads with the System-level Entropy10
AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUs10
Accelerating Nearest Neighbor Search in 3D Point Cloud Registration on GPUs9
ODGS: Dependency-Aware Scheduling for High-Level Synthesis with Graph Neural Network and Reinforcement Learning9
Flexible and Effective Object Tiering for Heterogeneous Memory Systems9
Joint Program and Layout Transformations to Enable Convolutional Operators on Specialized Hardware Based on Constraint Programming8
An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism8
SnsBooster: Enhancing Sampling-Based \mu Arch Evaluation Efficiency through Online Performance Sensitivity Analysis8
BridgeGC: An Efficient Cross-Level Garbage Collector for Big Data Frameworks7
Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive Grouping7
EXPERTISE: An Effective Software-level Redundant Multithreading Scheme against Hardware Faults7
NEM-GNN: DAC/ADC-less, Scalable, Reconfigurable, Graph and Sparsity-Aware Near-Memory Accelerator for Graph Neural Networks6
RT-GNN: Accelerating Sparse Graph Neural Networks by Tensor-CUDA Kernel Fusion6
Understanding Cache Compression6
TPRepair: Tree-based Pipelined Repair in Clustered Storage Systems6
A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks6
Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture6
Advancing Direct Convolution Using Convolution Slicing Optimization and ISA Extensions6
DTAP: Accelerating Strongly-Typed Programs with Data Type-Aware Hardware Prefetching5
Environmental Condition Aware Super-Resolution Acceleration Framework in Server-Client Hierarchies5
PowerMorph: QoS-Aware Server Power Reshaping for Data Center Regulation Service5
Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree Traversals5
Low-power Near-data Instruction Execution Leveraging Opcode-based Timing Analysis5
Towards High Performance QNNs via Distribution-Based CNOT Gate Reduction5
Low I/O Intensity-aware Partial GC Scheduling to Reduce Long-tail Latency in SSDs5
System-level Early-stage Modeling and Evaluation of IVR-assisted Processor Power Delivery System5
RaNAS: Resource-Aware Neural Architecture Search for Edge Computing5
HyGain: High-performance, Energy-efficient Hybrid Gain Cell-based Cache Hierarchy5
Multi-objective Hardware-aware Neural Architecture Search with Pareto Rank-preserving Surrogate Models5
HEngine: A High Performance Optimization Framework on a GPU for Homomorphic Encryption5
x Meta : SSD-HDD-hybrid Optimization for Metadata Maintenance of Cloud-scale Object Storage4
Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs4
ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing Runtimes4
Byte-Select Compression4
MemoriaNova: Optimizing Memory-Aware Model Inference for Edge Computing4
Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star Networks4
GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph Processing4
A Stable Idle Time Detection Platform for Real I/O Workloads4
Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs4
Improving Utilization of Dataflow Unit for Multi-Batch Processing4
OptiFX: Automatic Optimization for Convolutional Neural Networks with Aggressive Operator Fusion on GPUs4
Sniper: Exploiting Spatial and Temporal Sampling for Large-Scale Performance Analysis4
FlexHM: A Practical System for Heterogeneous Memory with Flexible and Efficient Performance Optimizations4
Koala: Efficient Pipeline Training through Automated Schedule Searching on Domain-Specific Language3
JiuJITsu: Removing Gadgets with Safe Register Allocation for JIT Code Generation3
BullsEye : Scalable and Accurate Approximation Framework for Cache Miss Calculation3
Shift-CIM: In-SRAM Alignment To Support General-Purpose Bit-level Sparsity Exploration in SRAM Multiplication3
Architecting Optically Controlled Phase Change Memory3
Preserving Addressability Upon GC-Triggered Data Movements on Non-Volatile Memory3
Abakus: Accelerating k -mer Counting with Storage Technology3
WIPE: A Write-Optimized Learned Index for Persistent Memory3
CoolDC: A Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling3
Architectural Support for Sharing, Isolating and Virtualizing FPGA Resources3
Scale-out Systolic Arrays3
MicroProf : Code-level Attribution of Unnecessary Data Transfer in Microservice Applications3
CoNST: Code Generator for Sparse Tensor Networks3
Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache3
CASHT: Contention Analysis in Shared Hierarchies with Thefts3
TSN Cache: Exploiting Data Localities in Graph Computing Applications3
Automatic Sublining for Efficient Sparse Memory Accesses3
All-gather Algorithms Resilient to Imbalanced Process Arrival Patterns3
FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler3
SplitZNS: Towards an Efficient LSM-Tree on Zoned Namespace SSDs3
An Optimized GPU Implementation for GIST Descriptor2
A Case For Intra-rack Resource Disaggregation in HPC2
PANDA: Adaptive Prefetching and Decentralized Scheduling for Dataflow Architectures2
ApSpGEMM: Accelerating Large-scale SpGEMM with Heterogeneous Collaboration and Adaptive Panel2
Memory-Aware Functional IR for Higher-Level Synthesis of Accelerators2
An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU2
An FPGA-based Approach to Evaluate Thermal and Resource Management Strategies of Many-core Processors2
A Pressure-Aware Policy for Contention Minimization on Multicore Systems2
E-BATCH: Energy-Efficient and High-Throughput RNN Batching2
The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead2
SMT-Based Contention-Free Task Mapping and Scheduling on 2D/3D SMART NoC with Mixed Dimension-Order Routing2
GraphService: Topology-aware Constructor for Large-scale Graph Applications2
Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access2
PAVER2
Jointly Optimizing Job Assignment and Resource Partitioning for Improving System Throughput in Cloud Datacenters2
Iterating Pointers: Enabling Static Analysis for Loop-based Pointers2
TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware Scheduling2
Register-Pressure-Aware Instruction Scheduling Using Ant Colony Optimization2
SSD-SGD: Communication Sparsification for Distributed Deep Learning Training2
PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation2
Puppeteer: A Random Forest Based Manager for Hardware Prefetchers Across the Memory Hierarchy2
Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and Performance2
PARALiA: A Performance Aware Runtime for Auto-tuning Linear Algebra on Heterogeneous Systems2
High-performance Deterministic Concurrency Using Lingua Franca2
SAL: Optimizing the Dataflow of Spin-based Architectures for Lightweight Neural Networks2
Towards Enhanced System Efficiency while Mitigating Row Hammer2
Consequence-based Clustered Architecture2
A Data-Loader Tunable Knob to Shorten GPU Idleness for Distributed Deep Learning2
The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture2
MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation2
Data Deduplication Based on Content Locality of Transactions to Enhance Blockchain Scalability2
Optimization of Sparse Matrix Computation for Algebraic Multigrid on GPUs1
Lock-Free High-performance Hashing for Persistent Memory via PM-aware Holistic Optimization1
Conflict Management in Vector Register Files1
A Lock-free RDMA-friendly Index in CPU-parsimonious Environments1
KernelFaRer1
FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration1
Gem5-X1
Turn-based Spatiotemporal Coherence for GPUs1
VersaTile: Flexible Tiled Architectures via Associative Processors1
Approx-RM: Reducing Energy on Heterogeneous Multicore Processors under Accuracy and Timing Constraints1
Low-precision Logarithmic Number Systems1
Compiler Support for Sparse Tensor Computations in MLIR1
Symbolic Analysis for Data Plane Programs Specialization1
Assessing the Impact of Compiler Optimizations on GPUs Reliability1
MetaSys: A Practical Open-source Metadata Management System to Implement and Evaluate Cross-layer Optimizations1
At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads1
IBing: An Efficient Interleaved Bidirectional Ring All-Reduce Algorithm for Gradient Synchronization1
PICO1
Solving Sparse Assignment Problems on FPGAs1
ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors1
Hardware-hardened Sandbox Enclaves for Trusted Serverless Computing1
Bubble-Swap Flow Control1
LitTLS: Lightweight Thread-Level Speculation on Little Cores1
Delay-on-Squash: Stopping Microarchitectural Replay Attacks in Their Tracks1
ShieldCXL: A Practical Obliviousness Support with Sealed CXL Memory1
ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads1
SuccinctKV: a CPU-efficient LSM-tree Based KV Store with Scan-based Compaction1
MUA-Router: Maximizing the Utility-of-Allocation for On-chip Pipelining Routers1
ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error Detection1
Unleashing Parallelism with Elastic-Barriers1
Understanding Silent Data Corruption in Processors for Mitigating its Effects1
CIB-HIER1
GPU Domain Specialization via Composable On-Package Architecture1
PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM1
SPIRIT: Scalable and Persistent In-Memory Indices for Real-Time Search1
Overlapping Aware Data Placement Optimizations for LSM Tree-Based Store on ZNS SSDs1
QuCloud+: A Holistic Qubit Mapping Scheme for Single/Multi-programming on 2D/3D NISQ Quantum Computers1
Constructing a Supplementary Benchmark Suite to Represent Android Applications with User Interactions by using Performance Counters1
Supporting Dynamic Program Sizes in Deep Learning-Based Cost Models for Code Optimization1
Fast One-Sided RDMA-Based State Machine Replication for Disaggregated Memory1
SAC: An Ultra-Efficient Spin-based Architecture for Compressed DNNs1
Unveiling and Evaluating Vulnerabilities in Branch Predictors via a Three-Step Modeling Methodology1
SpecTerminator: Blocking Speculative Side Channels Based on Instruction Classes on RISC-V1
Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAs1
Mapi-Pro: An Energy Efficient Memory Mapping Technique for Intermittent Computing1
HAIR: Halving the Area of the Integer Register File with Odd/Even Banking1
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-efficient Genome Analysis1
Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector Multiplication1
Critical Data Backup with Hybrid Flash-Based Consumer Devices1
GraphAttack1
Optimizing Garbage Collection for ZNS SSDs via In-storage Data Migration and Address Remapping1
WaFFLe1
GiantVM: A Novel Distributed Hypervisor for Resource Aggregation with DSM-aware Optimizations1
PARADISE: Criticality-Aware Instruction Reordering for Power Attack Resistance1
CARL: Compiler Assigned Reference Leasing1
0.0590980052948