ACM Transactions on Architecture and Code Optimization

Papers
(The TQCC of ACM Transactions on Architecture and Code Optimization is 4. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2022-05-01 to 2026-05-01.)
ArticleCitations
TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency52
Object Intersection Captures on Interactive Apps to Drive a Crowd-sourced Replay-based Compiler Optimization44
ASM: An Adaptive Secure Multicore for Co-located Mutually Distrusting Processes30
Performance, Energy and NVM Lifetime-Aware Data Structure Refinement and Placement for Heterogeneous Memory Systems28
TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework28
Highly Efficient Self-checking Matrix Multiplication on Tiled AMX Accelerators27
ESMPC: An Efficient Neural Network Training Framework for Secure Two- and Three-Party Computation26
ModNEF : An Open Source Modular Neuromorphic Emulator for FPGA for Low-Power In-Edge Artificial Intelligence26
Accelerating Verifiable Queries over Blockchain Database System Using Processing-in-memory25
Intra-request Lag-aware Cache Management to Enhance I/O Responsiveness of SSDs24
Supporting QoS Guarantee in Heterogeneous Object Storage System: A Spatio-Temporal Graph Data Processing Method22
An Intelligent Scheduling Approach on Mobile OS for Optimizing UI Smoothness and Power22
An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-matrix Multiplication19
Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs17
A Concise Concurrent B + -Tree for Persistent Memory17
COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign16
DCMA: Accelerating Parallel DMA Transfers with a Multi-Port Direct Cached Memory Access in a Massive-Parallel Vector Processor16
Tiaozhuan: A General and Efficient Indirect Branch Optimization for Binary Translation16
Building a Fast and Efficient LSM-tree Store by Integrating Local Storage with Cloud Storage16
Source Matching and Rewriting for MLIR Using String-Based Automata15
SIMD-Matcher: A SIMD-based Arbitrary Matching Framework14
Mitigating the Bandwidth Wall via Data-Streaming System–Accelerator Co-Design14
A NUMA-Aware Version of an Adaptive Self-Scheduling Loop Scheduler13
DeepZoning: Re-accelerate CNN Inference with Zoning Graph for Heterogeneous Edge Cluster13
iSwap: A New Memory Page Swap Mechanism for Reducing Ineffective I/O Operations in Cloud Environments12
FlashGEMM: Optimizing Sequences of Matrix Multiplication by Exploiting Data Reuse on CPUs12
Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product11
Accelerating Video Captioning on Heterogeneous System Architectures11
Osiris: A Systolic Approach to Accelerating Fully Homomorphic Encryption11
MLKAPS: Machine Learning and Adaptive Sampling for HPC Kernel Auto-tuning10
ODGS: Dependency-Aware Scheduling for High-Level Synthesis with Graph Neural Network and Reinforcement Learning10
COX : Exposing CUDA Warp-level Functions to CPUs10
SnsBooster: Enhancing Sampling-based μ Arch Evaluation Efficiency through Online Performance Sensitivity Analysis10
AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUs10
BridgeGC: An Efficient Cross-Level Garbage Collector for Big Data Frameworks9
Accelerating Parallel Structures in DNNs via Parallel Fusion and Operator Co-Optimization9
Quantifying Resource Contention of Co-located Workloads with the System-level Entropy9
Flexible and Effective Object Tiering for Heterogeneous Memory Systems9
Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive Grouping9
Accelerating Nearest Neighbor Search in 3D Point Cloud Registration on GPUs9
FDSR: Efficient Model Training via Adaptive Tensor Quantization Based on Frequency Domain Division and Similarity Data Reuse9
EXPERTISE: An Effective Software-level Redundant Multithreading Scheme against Hardware Faults9
Towards high scalability and fine-grained parallelism on distributed HPC platforms9
An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism9
GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems9
NEM-GNN: DAC/ADC-less, Scalable, Reconfigurable, Graph and Sparsity-Aware Near-Memory Accelerator for Graph Neural Networks8
Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture8
HyGain: High-performance, Energy-efficient Hybrid Gain Cell-based Cache Hierarchy8
CLAP: Cross-Layer Adaptive Pipelining Inference Scheduling for Resource-Efficient Edge-Cloud Vision Systems8
Advancing Direct Convolution Using Convolution Slicing Optimization and ISA Extensions8
Optimizing Attention for Large Language Model Inference on the MT-3000 Many-Core Processor8
A Step toward Stateful HW-SW Migration: An Architecture-agnostic Checkpointing-rollback Toolchain8
A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks8
HEngine: A High Performance Optimization Framework on a GPU for Homomorphic Encryption7
TPRepair: Tree-based Pipelined Repair in Clustered Storage Systems7
Multi-objective Hardware-aware Neural Architecture Search with Pareto Rank-preserving Surrogate Models7
WSGraph: A Framework for Tackling Redundant and Irregular Data Access in Streaming Graph Processing7
Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree Traversals7
Towards High Performance QNNs via Distribution-Based CNOT Gate Reduction7
RT-GNN: Accelerating Sparse Graph Neural Networks by Tensor-CUDA Kernel Fusion7
Environmental Condition Aware Super-Resolution Acceleration Framework in Server-Client Hierarchies7
A Decoupled Analytical Model for Tile Size Selection in Affine Programs7
RaNAS: Resource-Aware Neural Architecture Search for Edge Computing7
DTAP: Accelerating Strongly-Typed Programs with Data Type-Aware Hardware Prefetching7
PowerMorph: QoS-Aware Server Power Reshaping for Data Center Regulation Service7
Enabling Low-Latency, GPU-Efficient Serverless Inference with Model Swapping7
PctoDL: Adaptive GPU Throughput Optimization for Deep Learning Inference with Power Constraints7
A Stable Idle Time Detection Platform for Real I/O Workloads6
RACER: Avoiding End-to-End Slowdowns in Accelerated Chip Multi-Processors6
HiSo: Co-optimizing the Intra-layer and Inter-layer Scheduling Schemes with the Hybrid Data Flow for PIM Architectures6
Toward Comprehensive Design Space Exploration on Heterogeneous Multi-core Processors6
Mobile-3DCNN: An Acceleration Framework for Ultra-Real-Time Execution of Large 3D CNNs on Mobile Devices6
EDAS: Enabling Fast Data Loading for GPU Serverless Computing6
gECC: A GPU-based high-throughput framework for Elliptic Curve Cryptography6
Lightweight Code Outlining for Android Applications6
SimTrace: Exploiting Spatial and Temporal Sampling for Large-Scale Performance Analysis6
MemoriaNova: Optimizing Memory-Aware Model Inference for Edge Computing6
Towards Optimizing Learned Index for High Performance, Memory Efficiency and NUMA Awareness6
Pac-PIM: A Parallel Communication Framework for Commodity Processing-in-memory Systems6
A Memory-Aware Sparse Matrix-Matrix Multiplication on Multicore Architectures6
JiuJITsu: Removing Gadgets with Safe Register Allocation for JIT Code Generation5
Capability-Based Efficient Data Transmission Mechanism for Serverless Computing5
OptiFX: Automatic Optimization for Convolutional Neural Networks with Aggressive Operator Fusion on GPUs5
Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star Networks5
Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs5
Efficient and Scalable Hybrid Parallelization of Unstructured Computational Fluid Dynamics with Geometric Multigrid5
Shift-CIM: In-SRAM Alignment To Support General-Purpose Bit-level Sparsity Exploration in SRAM Multiplication5
WIPE: A Write-Optimized Learned Index for Persistent Memory5
Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs5
Improving Utilization of Dataflow Unit for Multi-Batch Processing5
Accelerating the Simulation of Parallel Workloads using Loop-Bounded Checkpoints5
CoolDC: A Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling5
TSN Cache: Exploiting Data Localities in Graph Computing Applications5
GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph Processing5
FlexHM: A Practical System for Heterogeneous Memory with Flexible and Efficient Performance Optimizations5
x Meta : SSD-HDD-hybrid Optimization for Metadata Maintenance of Cloud-scale Object Storage5
Performance Prediction of Concurrent DNN Training Tasks in GPU Spatial Sharing Environments5
CGCGraph: Efficient CPU-GPU Co-execution for Concurrent Dynamic Graph Processing5
FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler4
An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU4
Koala: Efficient Pipeline Training through Automated Schedule Searching on Domain-Specific Language4
Scale-out Systolic Arrays4
Address/Data Instruction Steering in Clustered General Purpose Processors4
BLG-Tuning: Benchmark-Based Low-Cost General-Purpose I/O Modeling and Tuning4
Iterating Pointers: Enabling Static Analysis for Loop-based Pointers4
3D GNLM: Efficient 3D Non-Local Means Kernel with Nested Reuse Strategies for Embedded GPUs4
Architecting Optically Controlled Phase Change Memory4
Efficient Flexible Edge Inference for Mixed-Precision Quantized DNN using Customized RISC-V Core4
RaKV: A Write-Optimized LSM Store for Cloud Block Storage with Robust SLA4
Rethinking Variable-Length Encoding: Exploiting Bit Sparsity for Parallel Decoding in LLM Accelerators4
Consequence-based Clustered Architecture4
Matrix: Multi-Cipher Structures Dataflow for Parallel and Pipelined TFHE Accelerator4
Optimizing OpenCL Barrier Synchronization and Memory Efficiency on Multi-Core DSPs4
Architectural Support for Sharing, Isolating and Virtualizing FPGA Resources4
MetaEC: An Efficient and Resilient Erasure-Coded KV Store on Disaggregated Memory4
BullsEye : Scalable and Accurate Approximation Framework for Cache Miss Calculation4
SplitZNS: Towards an Efficient LSM-Tree on Zoned Namespace SSDs4
0.086634874343872