OOIR: Observatory of International Research

Papers

(The median citation count of ACM Transactions on Architecture and Code Optimization is 1. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2021-05-01 to 2025-05-01.)

Article	Citations
An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-matrix Multiplication	37
Object Intersection Captures on Interactive Apps to Drive a Crowd-sourced Replay-based Compiler Optimization	32
Spiking Neural Networks in Spintronic Computational RAM	31
ASM: An Adaptive Secure Multicore for Co-located Mutually Distrusting Processes	30
TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency	29
TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework	25
An Intelligent Scheduling Approach on Mobile OS for Optimizing UI Smoothness and Power	23
Highly Efficient Self-checking Matrix Multiplication on Tiled AMX Accelerators	18
ModNEF : An Open Source Modular Neuromorphic Emulator for FPGA for Low-Power In-Edge Artificial Intelligence	17
DCMA: Accelerating Parallel DMA Transfers with a Multi-Port Direct Cached Memory Access in a Massive-Parallel Vector Processor	16
Building a Fast and Efficient LSM-tree Store by Integrating Local Storage with Cloud Storage	16
SIMD-Matcher: A SIMD-based Arbitrary Matching Framework	15
Source Matching and Rewriting for MLIR Using String-Based Automata	15
COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign	14
Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs	14
Tiaozhuan: A General and Efficient Indirect Branch Optimization for Binary Translation	14
Locality-Aware CTA Scheduling for Gaming Applications	13
A Concise Concurrent B ⁺ -Tree for Persistent Memory	13
Domain-Specific Multi-Level IR Rewriting for GPU	12
Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product	12
DeepZoning: Re-accelerate CNN Inference with Zoning Graph for Heterogeneous Edge Cluster	12
Accelerating Video Captioning on Heterogeneous System Architectures	12
iSwap: A New Memory Page Swap Mechanism for Reducing Ineffective I/O Operations in Cloud Environments	11
A NUMA-Aware Version of an Adaptive Self-Scheduling Loop Scheduler	11
AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUs	10

COX : Exposing CUDA Warp-level Functions to CPUs	10
GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems	10
Quantifying Resource Contention of Co-located Workloads with the System-level Entropy	10
Flexible and Effective Object Tiering for Heterogeneous Memory Systems	9
Accelerating Nearest Neighbor Search in 3D Point Cloud Registration on GPUs	9
ODGS: Dependency-Aware Scheduling for High-Level Synthesis with Graph Neural Network and Reinforcement Learning	9
An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism	8
SnsBooster: Enhancing Sampling-Based \mu Arch Evaluation Efficiency through Online Performance Sensitivity Analysis	8
Joint Program and Layout Transformations to Enable Convolutional Operators on Specialized Hardware Based on Constraint Programming	8
EXPERTISE: An Effective Software-level Redundant Multithreading Scheme against Hardware Faults	7
BridgeGC: An Efficient Cross-Level Garbage Collector for Big Data Frameworks	7
Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive Grouping	7
A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks	6
Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture	6
Advancing Direct Convolution Using Convolution Slicing Optimization and ISA Extensions	6
NEM-GNN: DAC/ADC-less, Scalable, Reconfigurable, Graph and Sparsity-Aware Near-Memory Accelerator for Graph Neural Networks	6
RT-GNN: Accelerating Sparse Graph Neural Networks by Tensor-CUDA Kernel Fusion	6
Understanding Cache Compression	6
TPRepair: Tree-based Pipelined Repair in Clustered Storage Systems	6
RaNAS: Resource-Aware Neural Architecture Search for Edge Computing	5
HyGain: High-performance, Energy-efficient Hybrid Gain Cell-based Cache Hierarchy	5
Multi-objective Hardware-aware Neural Architecture Search with Pareto Rank-preserving Surrogate Models	5
HEngine: A High Performance Optimization Framework on a GPU for Homomorphic Encryption	5
DTAP: Accelerating Strongly-Typed Programs with Data Type-Aware Hardware Prefetching	5
Environmental Condition Aware Super-Resolution Acceleration Framework in Server-Client Hierarchies	5
PowerMorph: QoS-Aware Server Power Reshaping for Data Center Regulation Service	5
Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree Traversals	5
Low-power Near-data Instruction Execution Leveraging Opcode-based Timing Analysis	5
Towards High Performance QNNs via Distribution-Based CNOT Gate Reduction	5
Low I/O Intensity-aware Partial GC Scheduling to Reduce Long-tail Latency in SSDs	5
System-level Early-stage Modeling and Evaluation of IVR-assisted Processor Power Delivery System	5
Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs	4
Improving Utilization of Dataflow Unit for Multi-Batch Processing	4
OptiFX: Automatic Optimization for Convolutional Neural Networks with Aggressive Operator Fusion on GPUs	4
Sniper: Exploiting Spatial and Temporal Sampling for Large-Scale Performance Analysis	4
FlexHM: A Practical System for Heterogeneous Memory with Flexible and Efficient Performance Optimizations	4
x Meta : SSD-HDD-hybrid Optimization for Metadata Maintenance of Cloud-scale Object Storage	4
Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs	4
ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing Runtimes	4
Byte-Select Compression	4
MemoriaNova: Optimizing Memory-Aware Model Inference for Edge Computing	4
Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star Networks	4
GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph Processing	4
A Stable Idle Time Detection Platform for Real I/O Workloads	4
Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache	3
CASHT: Contention Analysis in Shared Hierarchies with Thefts	3
TSN Cache: Exploiting Data Localities in Graph Computing Applications	3
Automatic Sublining for Efficient Sparse Memory Accesses	3
All-gather Algorithms Resilient to Imbalanced Process Arrival Patterns	3
FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler	3

SplitZNS: Towards an Efficient LSM-Tree on Zoned Namespace SSDs	3
Koala: Efficient Pipeline Training through Automated Schedule Searching on Domain-Specific Language	3
JiuJITsu: Removing Gadgets with Safe Register Allocation for JIT Code Generation	3
BullsEye : Scalable and Accurate Approximation Framework for Cache Miss Calculation	3
Shift-CIM: In-SRAM Alignment To Support General-Purpose Bit-level Sparsity Exploration in SRAM Multiplication	3
Architecting Optically Controlled Phase Change Memory	3
Preserving Addressability Upon GC-Triggered Data Movements on Non-Volatile Memory	3
Abakus: Accelerating k -mer Counting with Storage Technology	3
WIPE: A Write-Optimized Learned Index for Persistent Memory	3
CoolDC: A Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling	3
Architectural Support for Sharing, Isolating and Virtualizing FPGA Resources	3
Scale-out Systolic Arrays	3
MicroProf : Code-level Attribution of Unnecessary Data Transfer in Microservice Applications	3
CoNST: Code Generator for Sparse Tensor Networks	3
Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and Performance	2
PARALiA: A Performance Aware Runtime for Auto-tuning Linear Algebra on Heterogeneous Systems	2
High-performance Deterministic Concurrency Using Lingua Franca	2
SAL: Optimizing the Dataflow of Spin-based Architectures for Lightweight Neural Networks	2
Towards Enhanced System Efficiency while Mitigating Row Hammer	2
Consequence-based Clustered Architecture	2
A Data-Loader Tunable Knob to Shorten GPU Idleness for Distributed Deep Learning	2
The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture	2
MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation	2
Data Deduplication Based on Content Locality of Transactions to Enhance Blockchain Scalability	2
An Optimized GPU Implementation for GIST Descriptor	2
A Case For Intra-rack Resource Disaggregation in HPC	2
PANDA: Adaptive Prefetching and Decentralized Scheduling for Dataflow Architectures	2
ApSpGEMM: Accelerating Large-scale SpGEMM with Heterogeneous Collaboration and Adaptive Panel	2
Memory-Aware Functional IR for Higher-Level Synthesis of Accelerators	2
An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU	2
An FPGA-based Approach to Evaluate Thermal and Resource Management Strategies of Many-core Processors	2
A Pressure-Aware Policy for Contention Minimization on Multicore Systems	2
E-BATCH: Energy-Efficient and High-Throughput RNN Batching	2
The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead	2
SMT-Based Contention-Free Task Mapping and Scheduling on 2D/3D SMART NoC with Mixed Dimension-Order Routing	2
GraphService: Topology-aware Constructor for Large-scale Graph Applications	2
Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access	2
PAVER	2
Jointly Optimizing Job Assignment and Resource Partitioning for Improving System Throughput in Cloud Datacenters	2
Iterating Pointers: Enabling Static Analysis for Loop-based Pointers	2
TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware Scheduling	2
Register-Pressure-Aware Instruction Scheduling Using Ant Colony Optimization	2
SSD-SGD: Communication Sparsification for Distributed Deep Learning Training	2
PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation	2
Puppeteer: A Random Forest Based Manager for Hardware Prefetchers Across the Memory Hierarchy	2
Supporting Dynamic Program Sizes in Deep Learning-Based Cost Models for Code Optimization	1
Fast One-Sided RDMA-Based State Machine Replication for Disaggregated Memory	1
Unveiling and Evaluating Vulnerabilities in Branch Predictors via a Three-Step Modeling Methodology	1
SAC: An Ultra-Efficient Spin-based Architecture for Compressed DNNs	1
SpecTerminator: Blocking Speculative Side Channels Based on Instruction Classes on RISC-V	1
Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAs	1
Mapi-Pro: An Energy Efficient Memory Mapping Technique for Intermittent Computing	1
HAIR: Halving the Area of the Integer Register File with Odd/Even Banking	1
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-efficient Genome Analysis	1
Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector Multiplication	1
Critical Data Backup with Hybrid Flash-Based Consumer Devices	1
GraphAttack	1
Optimizing Garbage Collection for ZNS SSDs via In-storage Data Migration and Address Remapping	1
WaFFLe	1
GiantVM: A Novel Distributed Hypervisor for Resource Aggregation with DSM-aware Optimizations	1
PARADISE: Criticality-Aware Instruction Reordering for Power Attack Resistance	1
CARL: Compiler Assigned Reference Leasing	1
Optimization of Sparse Matrix Computation for Algebraic Multigrid on GPUs	1
Lock-Free High-performance Hashing for Persistent Memory via PM-aware Holistic Optimization	1
Conflict Management in Vector Register Files	1
A Lock-free RDMA-friendly Index in CPU-parsimonious Environments	1
FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration	1
KernelFaRer	1
Gem5-X	1
Turn-based Spatiotemporal Coherence for GPUs	1
VersaTile: Flexible Tiled Architectures via Associative Processors	1
Approx-RM: Reducing Energy on Heterogeneous Multicore Processors under Accuracy and Timing Constraints	1
Low-precision Logarithmic Number Systems	1
Compiler Support for Sparse Tensor Computations in MLIR	1
Symbolic Analysis for Data Plane Programs Specialization	1
Assessing the Impact of Compiler Optimizations on GPUs Reliability	1
MetaSys: A Practical Open-source Metadata Management System to Implement and Evaluate Cross-layer Optimizations	1
At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads	1
IBing: An Efficient Interleaved Bidirectional Ring All-Reduce Algorithm for Gradient Synchronization	1
PICO	1

Solving Sparse Assignment Problems on FPGAs	1
ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors	1
Hardware-hardened Sandbox Enclaves for Trusted Serverless Computing	1
LitTLS: Lightweight Thread-Level Speculation on Little Cores	1
Bubble-Swap Flow Control	1
Delay-on-Squash: Stopping Microarchitectural Replay Attacks in Their Tracks	1
ShieldCXL: A Practical Obliviousness Support with Sealed CXL Memory	1
ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads	1
SuccinctKV: a CPU-efficient LSM-tree Based KV Store with Scan-based Compaction	1
MUA-Router: Maximizing the Utility-of-Allocation for On-chip Pipelining Routers	1
ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error Detection	1
Unleashing Parallelism with Elastic-Barriers	1
Understanding Silent Data Corruption in Processors for Mitigating its Effects	1
CIB-HIER	1
GPU Domain Specialization via Composable On-Package Architecture	1
PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM	1
SPIRIT: Scalable and Persistent In-Memory Indices for Real-Time Search	1
Overlapping Aware Data Placement Optimizations for LSM Tree-Based Store on ZNS SSDs	1
QuCloud+: A Holistic Qubit Mapping Scheme for Single/Multi-programming on 2D/3D NISQ Quantum Computers	1
Constructing a Supplementary Benchmark Suite to Represent Android Applications with User Interactions by using Performance Counters	1