ACM Transactions on Architecture and Code Optimization

Papers
(The median citation count of ACM Transactions on Architecture and Code Optimization is 0. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2021-02-01 to 2025-02-01.)
ArticleCitations
Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree Traversals25
The Droplet Search Algorithm for Kernel Scheduling23
Decreasing the Miss Rate and Eliminating the Performance Penalty of a Data Filter Cache23
Highly Efficient Self-checking Matrix Multiplication on Tiled AMX Accelerators23
The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead22
ASM: An Adaptive Secure Multicore for Co-located Mutually Distrusting Processes21
SMT-Based Contention-Free Task Mapping and Scheduling on 2D/3D SMART NoC with Mixed Dimension-Order Routing20
Mapi-Pro: An Energy Efficient Memory Mapping Technique for Intermittent Computing18
Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training18
Multiply-and-Fire: An Event-Driven Sparse Neural Network Accelerator15
PETRA13
Puppeteer: A Random Forest Based Manager for Hardware Prefetchers Across the Memory Hierarchy13
Object Intersection Captures on Interactive Apps to Drive a Crowd-sourced Replay-based Compiler Optimization13
MemoriaNova: Optimizing Memory-Aware Model Inference for Edge Computing12
MUA-Router: Maximizing the Utility-of-Allocation for On-chip Pipelining Routers12
An Intelligent Scheduling Approach on Mobile OS for Optimizing UI Smoothness and Power12
The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture12
FlexPointer: Fast Address Translation Based on Range TLB and Tagged Pointers11
ASA: A ccelerating S parse A ccumulation in Column-wise SpGEMM10
ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing Runtimes10
Spiking Neural Networks in Spintronic Computational RAM9
MasterPlan: A Reinforcement Learning Based Scheduler for Archive Storage9
Performance Evaluation of Intel Optane Memory for Managed Workloads9
An Application-oblivious Memory Scheduling System for DNN Accelerators9
TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency8
An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-matrix Multiplication8
D 2 Comp: Efficient Offload of LSM-tree Compaction with Data Processing Units on Disaggregated Storage8
A Survey of General-purpose Polyhedral Compilers7
XEngine: Optimal Tensor Rematerialization for Neural Networks in Heterogeneous Environments7
MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation7
Data Deduplication Based on Content Locality of Transactions to Enhance Blockchain Scalability7
PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation6
RACE: An Efficient Redundancy-aware Accelerator for Dynamic Graph Neural Network6
An Optimized GPU Implementation for GIST Descriptor6
A Stable Idle Time Detection Platform for Real I/O Workloads6
Potamoi: Accelerating Neural Rendering via a Unified Streaming Architecture6
GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph Processing6
Improving Utilization of Dataflow Unit for Multi-Batch Processing5
Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs5
QuCloud+: A Holistic Qubit Mapping Scheme for Single/Multi-programming on 2D/3D NISQ Quantum Computers5
Byte-Select Compression5
WA-Zone: Wear-Aware Zone Management Optimization for LSM-Tree on ZNS SSDs5
Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs5
KernelFaRer5
GRAM5
Source Matching and Rewriting for MLIR Using String-Based Automata5
SSD-SGD: Communication Sparsification for Distributed Deep Learning Training5
Fast Key-Value Lookups with Node Tracker5
RegCPython: A Register-based Python Interpreter for Better Performance5
Delay-on-Squash: Stopping Microarchitectural Replay Attacks in Their Tracks5
TEA+ : A Novel Temporal Graph Random Walk Engine with Hybrid Storage Architecture4
gHyPart: GPU-friendly End-to-End Hypergraph Partitioner4
Energy-efficient In-Memory Address Calculation4
KINDRED: Heterogeneous Split-Lock Architecture for Safe Autonomous Machines4
Enhancing High-Throughput GPU Random Walks Through Multi-Task Concurrency Orchestration4
x Meta : SSD-HDD-hybrid Optimization for Metadata Maintenance of Cloud-scale Object Storage4
Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs4
Building a Fast and Efficient LSM-tree Store by Integrating Local Storage with Cloud Storage3
ReuseTracker : Fast Yet Accurate Multicore Reuse Distance Analyzer3
Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure3
Leveraging the Hardware Resources to Accelerate cryo-EM Reconstruction of RELION on the New Sunway Supercomputer3
PARADISE: Criticality-Aware Instruction Reordering for Power Attack Resistance3
CoMeT: An Integrated Interval Thermal Simulation Toolchain for 2D, 2.5D, and 3D Processor-Memory Systems3
A Concise Concurrent B + -Tree for Persistent Memory3
SLAP: Segmented Reuse-Time-Label Based Admission Policy for Content Delivery Network Caching3
LiteCON : An All-photonic Neuromorphic Accelerator for Energy-efficient Deep Learning3
Tiaozhuan: A General and Efficient Indirect Branch Optimization for Binary Translation3
SAC: An Ultra-Efficient Spin-based Architecture for Compressed DNNs3
DAG-Order: An Order-Based Dynamic DAG Scheduling for Real-Time Networks-on-Chip3
SpecTerminator: Blocking Speculative Side Channels Based on Instruction Classes on RISC-V3
FlexHM: A Practical System for Heterogeneous Memory with Flexible and Efficient Performance Optimizations3
SIMD-Matcher: A SIMD-based Arbitrary Matching Framework3
Conflict Management in Vector Register Files3
Optimization of Sparse Matrix Computation for Algebraic Multigrid on GPUs3
PICO2
Architecting Optically Controlled Phase Change Memory2
Pac-Sim: Simulation of Multi-threaded Workloads using Intelligent, Live Sampling2
Access Characteristic-Guided Remote Swapping Across Mobile Devices2
AIS: An Active Idleness I/O Scheduler to Reduce Buffer-Exhausted Degradation of Solid-State Drives2
ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors2
LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs2
Phronesis: Efficient Performance Modeling for High-dimensional Configuration Tuning2
Extension VM: Interleaved Data Layout in Vector Memory2
SuccinctKV: a CPU-efficient LSM-tree Based KV Store with Scan-based Compaction2
HAIR: Halving the Area of the Integer Register File with Odd/Even Banking2
PMGraph: Accelerating Concurrent Graph Queries over Streaming Graphs2
WIPE: A Write-Optimized Learned Index for Persistent Memory2
DELTA: Memory-Efficient Training via Dynamic Fine-Grained Recomputation and Swapping2
Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star Networks2
COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign2
Bubble-Swap Flow Control2
GPU Domain Specialization via Composable On-Package Architecture2
Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product2
At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads2
Domain-Specific Multi-Level IR Rewriting for GPU2
A 2 : Towards Accelerator Level Parallelism for Autonomous Micromobility Systems2
PRAGA: A Priority-Aware Hardware/Software Co-design for High-Throughput Graph Processing Acceleration2
Fixed-point Encoding and Architecture Exploration for Residue Number Systems2
Characterizing and Optimizing LDPC Performance on 3D NAND Flash Memories2
Achieving Tunable Erasure Coding with Cluster-Aware Redundancy Transitioning2
exZNS: Extending Zoned Namespace to Support Byte-loggable Zones2
Early Address Prediction2
Understanding Silent Data Corruption in Processors for Mitigating its Effects2
ShieldCXL: A Practical Obliviousness Support with Sealed CXL Memory1
Automatic Sublining for Efficient Sparse Memory Accesses1
Approx-RM: Reducing Energy on Heterogeneous Multicore Processors under Accuracy and Timing Constraints1
ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error Detection1
Smart-DNN+: A Memory-efficient Neural Networks Compression Framework for the Model Inference1
JiuJITsu: Removing Gadgets with Safe Register Allocation for JIT Code Generation1
Locality-Aware CTA Scheduling for Gaming Applications1
MPU: Memory-centric SIMT Processor via In-DRAM Near-bank Computing1
Lavender: An Efficient Resource Partitioning Framework for Large-Scale Job Colocation1
Maximizing Data and Hardware Reuse for HLS with Early-Stage Symbolic Partitioning1
A NUMA-Aware Version of an Adaptive Self-Scheduling Loop Scheduler1
DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration1
Towards Enhanced System Efficiency while Mitigating Row Hammer1
ULEEN: A Novel Architecture for Ultra-low-energy Edge Neural Networks1
Scale-out Systolic Arrays1
DeepZoning: Re-accelerate CNN Inference with Zoning Graph for Heterogeneous Edge Cluster1
Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey1
Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector Multiplication1
Reducing Minor Page Fault Overheads through Enhanced Page Walker1
Knowledge-Augmented Mutation-Based Bug Localization for Hardware Design Code1
SplitZNS: Towards an Efficient LSM-Tree on Zoned Namespace SSDs1
Compiler Support for Sparse Tensor Computations in MLIR1
CASHT: Contention Analysis in Shared Hierarchies with Thefts1
An Efficient ReRAM-based Accelerator for Asynchronous Iterative Graph Processing1
GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems1
Taming Flexible Job Packing in Deep Learning Training Clusters1
LargeGraph1
Gretch1
iSwap: A New Memory Page Swap Mechanism for Reducing Ineffective I/O Operations in Cloud Environments1
GraphAttack1
SPIRIT: Scalable and Persistent In-Memory Indices for Real-Time Search1
Architectural Support for Sharing, Isolating and Virtualizing FPGA Resources1
CoolDC: A Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling1
Task-RM: A Resource Manager for Energy Reduction in Task-Parallel Applications under Quality of Service Constraints1
BullsEye : Scalable and Accurate Approximation Framework for Cache Miss Calculation1
ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management1
Accelerating Video Captioning on Heterogeneous System Architectures1
An Optimized Framework for Matrix Factorization on the New Sunway Many-core Platform1
Online Application Guidance for Heterogeneous Memory Systems1
Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache1
Just-In-Time Compilation on ARM—A Closer Look at Call-Site Code Consistency1
MST: Topology-Aware Message Aggregation for Exascale Graph Processing of Traversal-Centric Algorithms1
Vitruvius+: An Area-Efficient RISC-V Decoupled Vector Coprocessor for High Performance Computing Applications1
Performance and Power Prediction for Concurrent Execution on GPUs1
Intermediate Address Space: virtual memory optimization of heterogeneous architectures for cache-resident workloads1
QoS-pro: A QoS-enhanced Transaction Processing Framework for Shared SSDs1
High-performance Deterministic Concurrency Using Lingua Franca0
Agile C-states: A Core C-state Architecture for Latency Critical Applications Optimizing both Transition and Cold-Start Latency0
Critical Data Backup with Hybrid Flash-Based Consumer Devices0
Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks0
PARALiA: A Performance Aware Runtime for Auto-tuning Linear Algebra on Heterogeneous Systems0
GiantVM: A Novel Distributed Hypervisor for Resource Aggregation with DSM-aware Optimizations0
PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM0
A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks0
HyGain: High-performance, Energy-efficient Hybrid Gain Cell-based Cache Hierarchy0
Symbolic Analysis for Data Plane Programs Specialization0
Using Barrier Elision to Improve Transactional Code Generation0
Device Hopping0
HeapCheck: Low-cost Hardware Support for Memory Safety0
SAL: Optimizing the Dataflow of Spin-based Architectures for Lightweight Neural Networks0
All-gather Algorithms Resilient to Imbalanced Process Arrival Patterns0
A Reusable Characterization of the Memory System Behavior of SPEC2017 and SPEC20060
FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration0
Cost-aware Service Placement and Scheduling in the Edge-Cloud Continuum0
RaNAS: Resource-Aware Neural Architecture Search for Edge Computing0
Steered Bubble: An Interposer-based Deadlock Recovery Algorithm for Multi-chiplet Systems0
Occam: Optimal Data Reuse for Convolutional Neural Networks0
ApSpGEMM: Accelerating Large-scale SpGEMM with Heterogeneous Collaboration and Adaptive Panel0
Towards High Performance QNNs via Distribution-Based CNOT Gate Reduction0
SecNVM: An Efficient and Write-Friendly Metadata Crash Consistency Scheme for Secure NVM0
IBing: An Efficient Interleaved Bidirectional Ring All-Reduce Algorithm for Gradient Synchronization0
A Case For Intra-rack Resource Disaggregation in HPC0
On Predictable Reconfigurable System Design0
Cross-core Data Sharing for Energy-efficient GPUs0
Cooperative Slack Management: Saving Energy of Multicore Processors by Trading Performance Slack Between QoS-Constrained Applications0
MicroProf : Code-level Attribution of Unnecessary Data Transfer in Microservice Applications0
Optimizing Garbage Collection for ZNS SSDs via In-storage Data Migration and Address Remapping0
Weaving Synchronous Reactions into the Fabric of SSA-form Compilers0
Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs0
GraphService: Topology-aware Constructor for Large-scale Graph Applications0
GraphPEG0
Environmental Condition Aware Super-Resolution Acceleration Framework in Server-Client Hierarchies0
User-driven Online Kernel Fusion for SYCL0
Multiple Function Merging for Code Size Reduction0
Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and Performance0
Multi-objective Hardware-aware Neural Architecture Search with Pareto Rank-preserving Surrogate Models0
TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware Scheduling0
Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators0
EA4RCA: Efficient AIE accelerator design framework for regular Communication-Avoiding Algorithm0
Unveiling and Evaluating Vulnerabilities in Branch Predictors via a Three-Step Modeling Methodology0
Consequence-based Clustered Architecture0
CIB-HIER0
PRISM0
Hyperion: A Highly Effective Page and PC Based Delta Prefetcher0
An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism0
Camouflage: Utility-Aware Obfuscation for Accurate Simulation of Sensitive Program Traces0
TPRepair: Tree-Based Pipelined Repair in Clustered Storage Systems0
COWS for High Performance: Cost Aware Work Stealing for Irregular Parallel Loop0
RT-GNN: Accelerating Sparse Graph Neural Networks by Tensor-CUDA Kernel Fusion0
MAPPER: Managing Application Performance via Parallel Efficiency Regulation0
Cache Programming for Scientific Loops Using Leases0
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-efficient Genome Analysis0
A Data-Loader Tunable Knob to Shorten GPU Idleness for Distributed Deep Learning0
YaConv: Convolution with Low Cache Footprint0
PERI0
An Instruction Inflation Analyzing Framework for Dynamic Binary Translators0
DTAP: Accelerating Strongly-Typed Programs with Data Type-Aware Hardware Prefetching0
Time-Aware Spectrum-Based Bug Localization for Hardware Design Code with Data Purification0
Fine-grain Quantitative Analysis of Demand Paging in Unified Virtual Memory0
Tyche: An Efficient and General Prefetcher for Indirect Memory Accesses0
Low-power Near-data Instruction Execution Leveraging Opcode-based Timing Analysis0
Grus0
Shining Light on the Inter-procedural Code Obfuscation: Keep Pace with Progress in Binary Diffing0
PowerMorph: QoS-Aware Server Power Reshaping for Data Center Regulation Service0
Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access0
Characterizing Multi-Chip GPU Data Sharing0
MetaSys: A Practical Open-source Metadata Management System to Implement and Evaluate Cross-layer Optimizations0
Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators0
Constructing a Supplementary Benchmark Suite to Represent Android Applications with User Interactions by using Performance Counters0
E-BATCH: Energy-Efficient and High-Throughput RNN Batching0
Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order Cores0
An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU0
Acceleration of Parallel-Blocked QR Decomposition of Tall-and-Skinny Matrices on FPGAs0
System-level Early-stage Modeling and Evaluation of IVR-assisted Processor Power Delivery System0
GCNTrain+: A Versatile and Efficient Accelerator for Graph Convolutional Neural Network Training0
A Non-Intrusive Tool Chain to Optimize MPSoC End-to-End Systems0
Turn-based Spatiotemporal Coherence for GPUs0
Low-precision Logarithmic Number Systems0
Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU Systems0
FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration0
COVER: Alleviating Crash-Consistency Error Amplification in Secure Persistent Memory Systems0
LIA: Latency-Improved Adaptive routing for Dragonfly networks0
COX : Exposing CUDA Warp-level Functions to CPUs0
Low I/O Intensity-aware Partial GC Scheduling to Reduce Long-tail Latency in SSDs0
Scenario-Aware Program Specialization for Timing Predictability0
Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL0
A High Scalability Memory NoC with Shared-Inside Hierarchical-Groupings for Triplet-Based Many-Core Architecture0
Understanding Cache Compression0
CacheInspector0
Register-Pressure-Aware Instruction Scheduling Using Ant Colony Optimization0
Preserving Addressability Upon GC-Triggered Data Movements on Non-Volatile Memory0
Characterizing and Understanding HGNN Training on GPUs0
A Pressure-Aware Policy for Contention Minimization on Multicore Systems0
FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler0
Polyhedral Specification and Code Generation of Sparse Tensor Contraction with Co-iteration0
CARL: Compiler Assigned Reference Leasing0
Iterative Compilation Optimization Based on Metric Learning and Collaborative Filtering0
0.078852891921997