ACM Transactions on Architecture and Code Optimization

Papers
(The median citation count of ACM Transactions on Architecture and Code Optimization is 1. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2020-11-01 to 2024-11-01.)
ArticleCitations
SMAUG39
IR2V EC39
Domain-Specific Multi-Level IR Rewriting for GPU25
A RISC-V Simulator and Benchmark Suite for Designing and Evaluating Vector Architectures25
Grus23
PERI21
SLO-Aware Inference Scheduler for Heterogeneous Processors in Edge Platforms21
A Black-box Monitoring Approach to Measure Microservices Runtime Performance20
Compiler Support for Sparse Tensor Computations in MLIR20
LLOV19
A Case For Intra-rack Resource Disaggregation in HPC18
Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks18
Vitruvius+: An Area-Efficient RISC-V Decoupled Vector Coprocessor for High Performance Computing Applications18
PAVER17
Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators15
Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)15
PolyDL13
Low I/O Intensity-aware Partial GC Scheduling to Reduce Long-tail Latency in SSDs13
Gem5-X13
CoMeT: An Integrated Interval Thermal Simulation Toolchain for 2D, 2.5D, and 3D Processor-Memory Systems12
PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM12
KernelFaRer12
A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels12
An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-matrix Multiplication12
Exploiting Parallelism Opportunities with Deep Learning Frameworks11
Architecting Optically Controlled Phase Change Memory11
Scale-out Systolic Arrays10
On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond10
GRAM9
Polyhedral Specification and Code Generation of Sparse Tensor Contraction with Co-iteration9
Performance Evaluation of Intel Optane Memory for Managed Workloads9
Low-precision Logarithmic Number Systems9
GEVO9
ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing Runtimes9
Autotuning Convolutions Is Easier Than You Think9
Register-Pressure-Aware Instruction Scheduling Using Ant Colony Optimization8
ReuseTracker : Fast Yet Accurate Multicore Reuse Distance Analyzer8
Understanding Cache Compression8
Bayesian Optimization for Efficient Accelerator Synthesis8
Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators7
Optimizing Small-Sample Disk Fault Detection Based on LSTM-GAN Model7
GraphPEG7
A Reusable Characterization of the Memory System Behavior of SPEC2017 and SPEC20066
FastPath_MP6
High-performance Deterministic Concurrency Using Lingua Franca6
Performance and Power Prediction for Concurrent Execution on GPUs6
HeapCheck: Low-cost Hardware Support for Memory Safety6
LargeGraph6
Task-RM: A Resource Manager for Energy Reduction in Task-Parallel Applications under Quality of Service Constraints6
Practical Software-Based Shadow Stacks on x86-645
MC-DeF5
GraphAttack5
GPU Domain Specialization via Composable On-Package Architecture5
Spiking Neural Networks in Spintronic Computational RAM5
E-BATCH: Energy-Efficient and High-Throughput RNN Batching5
A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks5
SplitZNS: Towards an Efficient LSM-Tree on Zoned Namespace SSDs5
Refresh Triggered Computation5
MemSZ5
MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation5
Multi-objective Hardware-aware Neural Architecture Search with Pareto Rank-preserving Surrogate Models5
WaFFLe5
Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache5
Gretch5
Energy-efficient In-Memory Address Calculation5
GiantVM: A Novel Distributed Hypervisor for Resource Aggregation with DSM-aware Optimizations4
An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism4
ULEEN: A Novel Architecture for Ultra-low-energy Edge Neural Networks4
Design and Evaluation of an Ultra Low-power Human-quality Speech Recognition System4
YaConv: Convolution with Low Cache Footprint4
SPX644
An FPGA-based Approach to Evaluate Thermal and Resource Management Strategies of Many-core Processors4
Preserving Addressability Upon GC-Triggered Data Movements on Non-Volatile Memory4
On Architectural Support for Instruction Set Randomization4
Irregular Register Allocation for Translation of Test-pattern Programs3
Just-In-Time Compilation on ARM—A Closer Look at Call-Site Code Consistency3
PowerMorph: QoS-Aware Server Power Reshaping for Data Center Regulation Service3
Early Address Prediction3
Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs3
Solving Sparse Assignment Problems on FPGAs3
SecNVM: An Efficient and Write-Friendly Metadata Crash Consistency Scheme for Secure NVM3
Automatic Sublining for Efficient Sparse Memory Accesses3
Efficient Nearest-Neighbor Data Sharing in GPUs3
At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads3
WA-Zone: Wear-Aware Zone Management Optimization for LSM-Tree on ZNS SSDs3
COX : Exposing CUDA Warp-level Functions to CPUs3
On Predictable Reconfigurable System Design3
A Pressure-Aware Policy for Contention Minimization on Multicore Systems3
CASHT: Contention Analysis in Shared Hierarchies with Thefts3
SortCache3
Systems-on-Chip with Strong Ordering3
ASA: A ccelerating S parse A ccumulation in Column-wise SpGEMM3
Locality-Aware CTA Scheduling for Gaming Applications2
DAG-Order: An Order-Based Dynamic DAG Scheduling for Real-Time Networks-on-Chip2
SpecTerminator: Blocking Speculative Side Channels Based on Instruction Classes on RISC-V2
NNBench-X2
MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT Framework2
PRISM2
An Efficient Hybrid Deep Learning Accelerator for Compact and Heterogeneous CNNs2
Lock-Free High-performance Hashing for Persistent Memory via PM-aware Holistic Optimization2
gPPM: A Generalized Matrix Operation and Parallel Algorithm to Accelerate the Encoding/Decoding Process of Erasure Codes2
Cryptographic Software IP Protection without Compromising Performance or Timing Side-channel Leakage2
Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey2
PICO2
FlexHM: A Practical System for Heterogeneous Memory with Flexible and Efficient Performance Optimizations2
RegCPython: A Register-based Python Interpreter for Better Performance2
Puppeteer: A Random Forest Based Manager for Hardware Prefetchers Across the Memory Hierarchy2
User-driven Online Kernel Fusion for SYCL2
Scenario-Aware Program Specialization for Timing Predictability2
Triangle Dropping: An Occluded-geometry Predictor for Energy-efficient Mobile GPUs2
Leveraging Value Equality Prediction for Value Speculation2
TokenSmart: Distributed, Scalable Power Management in the Many-core Era2
TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware Scheduling2
Reducing Minor Page Fault Overheads through Enhanced Page Walker2
SAC: An Ultra-Efficient Spin-based Architecture for Compressed DNNs2
LiteCON : An All-photonic Neuromorphic Accelerator for Energy-efficient Deep Learning2
Source Matching and Rewriting for MLIR Using String-Based Automata2
Iterative Compilation Optimization Based on Metric Learning and Collaborative Filtering2
MetaSys: A Practical Open-source Metadata Management System to Implement and Evaluate Cross-layer Optimizations2
Cost-aware Service Placement and Scheduling in the Edge-Cloud Continuum2
Advancing Direct Convolution Using Convolution Slicing Optimization and ISA Extensions2
Memory-Aware Functional IR for Higher-Level Synthesis of Accelerators2
Assessing the Impact of Compiler Optimizations on GPUs Reliability2
A Distributed Hardware Monitoring System for Runtime Verification on Multi-Tile MPSoCs2
Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access1
QuCloud+: A Holistic Qubit Mapping Scheme for Single/Multi-programming on 2D/3D NISQ Quantum Computers1
CARL: Compiler Assigned Reference Leasing1
XEngine: Optimal Tensor Rematerialization for Neural Networks in Heterogeneous Environments1
ISP Agent: A Generalized In-storage-processing Workload Offloading Framework by Providing Multiple Optimization Opportunities1
SG XL1
Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph Transformations1
The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead1
Cache Programming for Scientific Loops Using Leases1
Reliability-aware Garbage Collection for Hybrid HBM-DRAM Memories1
Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs1
Architectural Support for Sharing, Isolating and Virtualizing FPGA Resources1
Low-power Near-data Instruction Execution Leveraging Opcode-based Timing Analysis1
Accelerating Video Captioning on Heterogeneous System Architectures1
MAPPER: Managing Application Performance via Parallel Efficiency Regulation1
SIMD-Matcher: A SIMD-based Arbitrary Matching Framework1
Acceleration of Parallel-Blocked QR Decomposition of Tall-and-Skinny Matrices on FPGAs1
Building a Fast and Efficient LSM-tree Store by Integrating Local Storage with Cloud Storage1
Hardware-hardened Sandbox Enclaves for Trusted Serverless Computing1
Delay-on-Squash: Stopping Microarchitectural Replay Attacks in Their Tracks1
rNdN: Fast Query Compilation for NVIDIA GPUs1
TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency1
Fast One-Sided RDMA-Based State Machine Replication for Disaggregated Memory1
PETRA1
DxPU: Large-scale Disaggregated GPU Pools in the Datacenter1
Highly Efficient Self-checking Matrix Multiplication on Tiled AMX Accelerators1
ACTION: Adaptive Cache Block Migration in Distributed Cache Architectures1
Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAs1
Towards Enhanced System Efficiency while Mitigating Row Hammer1
QoS-pro: A QoS-enhanced Transaction Processing Framework for Shared SSDs1
Performance-Energy Trade-off in Modern CMPs1
Online Application Guidance for Heterogeneous Memory Systems1
FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration1
Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure1
CIB-HIER1
SLAP: Segmented Reuse-Time-Label Based Admission Policy for Content Delivery Network Caching1
COWS for High Performance: Cost Aware Work Stealing for Irregular Parallel Loop1
RACE: An Efficient Redundancy-aware Accelerator for Dynamic Graph Neural Network1
EXPERTISE: An Effective Software-level Redundant Multithreading Scheme against Hardware Faults1
An Application-oblivious Memory Scheduling System for DNN Accelerators1
Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL1
SMT-Based Contention-Free Task Mapping and Scheduling on 2D/3D SMART NoC with Mixed Dimension-Order Routing1
Occam: Optimal Data Reuse for Convolutional Neural Networks1
Cooperative Slack Management: Saving Energy of Multicore Processors by Trading Performance Slack Between QoS-Constrained Applications1
Abakus: Accelerating k -mer Counting with Storage Technology1
Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector Multiplication1
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-efficient Genome Analysis1
BullsEye : Scalable and Accurate Approximation Framework for Cache Miss Calculation1
Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and Performance1
WIPE: A Write-Optimized Learned Index for Persistent Memory1
Weaving Synchronous Reactions into the Fabric of SSA-form Compilers1
A Concise Concurrent B + -Tree for Persistent Memory1
0.040543079376221