OOIR: Observatory of International Research

Papers

(The TQCC of ACM Transactions on Architecture and Code Optimization is 4. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2022-05-01 to 2026-05-01.)

Article	Citations
TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency	52
Object Intersection Captures on Interactive Apps to Drive a Crowd-sourced Replay-based Compiler Optimization	44
ASM: An Adaptive Secure Multicore for Co-located Mutually Distrusting Processes	30
Performance, Energy and NVM Lifetime-Aware Data Structure Refinement and Placement for Heterogeneous Memory Systems	28
TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework	28
Highly Efficient Self-checking Matrix Multiplication on Tiled AMX Accelerators	27
ESMPC: An Efficient Neural Network Training Framework for Secure Two- and Three-Party Computation	26
ModNEF : An Open Source Modular Neuromorphic Emulator for FPGA for Low-Power In-Edge Artificial Intelligence	26
Accelerating Verifiable Queries over Blockchain Database System Using Processing-in-memory	25
Intra-request Lag-aware Cache Management to Enhance I/O Responsiveness of SSDs	24
An Intelligent Scheduling Approach on Mobile OS for Optimizing UI Smoothness and Power	22
Supporting QoS Guarantee in Heterogeneous Object Storage System: A Spatio-Temporal Graph Data Processing Method	22
An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-matrix Multiplication	19
A Concise Concurrent B ⁺ -Tree for Persistent Memory	17
Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs	17
DCMA: Accelerating Parallel DMA Transfers with a Multi-Port Direct Cached Memory Access in a Massive-Parallel Vector Processor	16
Tiaozhuan: A General and Efficient Indirect Branch Optimization for Binary Translation	16
Building a Fast and Efficient LSM-tree Store by Integrating Local Storage with Cloud Storage	16
COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign	16
Source Matching and Rewriting for MLIR Using String-Based Automata	15
Mitigating the Bandwidth Wall via Data-Streaming System–Accelerator Co-Design	14
SIMD-Matcher: A SIMD-based Arbitrary Matching Framework	14
DeepZoning: Re-accelerate CNN Inference with Zoning Graph for Heterogeneous Edge Cluster	13
A NUMA-Aware Version of an Adaptive Self-Scheduling Loop Scheduler	13
FlashGEMM: Optimizing Sequences of Matrix Multiplication by Exploiting Data Reuse on CPUs	12

iSwap: A New Memory Page Swap Mechanism for Reducing Ineffective I/O Operations in Cloud Environments	12
Accelerating Video Captioning on Heterogeneous System Architectures	11
Osiris: A Systolic Approach to Accelerating Fully Homomorphic Encryption	11
Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product	11
ODGS: Dependency-Aware Scheduling for High-Level Synthesis with Graph Neural Network and Reinforcement Learning	10
COX : Exposing CUDA Warp-level Functions to CPUs	10
SnsBooster: Enhancing Sampling-based μ Arch Evaluation Efficiency through Online Performance Sensitivity Analysis	10
AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUs	10
MLKAPS: Machine Learning and Adaptive Sampling for HPC Kernel Auto-tuning	10
Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive Grouping	9
Accelerating Nearest Neighbor Search in 3D Point Cloud Registration on GPUs	9
FDSR: Efficient Model Training via Adaptive Tensor Quantization Based on Frequency Domain Division and Similarity Data Reuse	9
EXPERTISE: An Effective Software-level Redundant Multithreading Scheme against Hardware Faults	9
Towards high scalability and fine-grained parallelism on distributed HPC platforms	9
An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism	9
GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems	9
BridgeGC: An Efficient Cross-Level Garbage Collector for Big Data Frameworks	9
Accelerating Parallel Structures in DNNs via Parallel Fusion and Operator Co-Optimization	9
Quantifying Resource Contention of Co-located Workloads with the System-level Entropy	9
Flexible and Effective Object Tiering for Heterogeneous Memory Systems	9
Advancing Direct Convolution Using Convolution Slicing Optimization and ISA Extensions	8
Optimizing Attention for Large Language Model Inference on the MT-3000 Many-Core Processor	8
A Step toward Stateful HW-SW Migration: An Architecture-agnostic Checkpointing-rollback Toolchain	8
A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks	8
NEM-GNN: DAC/ADC-less, Scalable, Reconfigurable, Graph and Sparsity-Aware Near-Memory Accelerator for Graph Neural Networks	8
Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture	8
HyGain: High-performance, Energy-efficient Hybrid Gain Cell-based Cache Hierarchy	8
CLAP: Cross-Layer Adaptive Pipelining Inference Scheduling for Resource-Efficient Edge-Cloud Vision Systems	8
Towards High Performance QNNs via Distribution-Based CNOT Gate Reduction	7
RT-GNN: Accelerating Sparse Graph Neural Networks by Tensor-CUDA Kernel Fusion	7
Environmental Condition Aware Super-Resolution Acceleration Framework in Server-Client Hierarchies	7
A Decoupled Analytical Model for Tile Size Selection in Affine Programs	7
RaNAS: Resource-Aware Neural Architecture Search for Edge Computing	7
DTAP: Accelerating Strongly-Typed Programs with Data Type-Aware Hardware Prefetching	7
PowerMorph: QoS-Aware Server Power Reshaping for Data Center Regulation Service	7
Enabling Low-Latency, GPU-Efficient Serverless Inference with Model Swapping	7
PctoDL: Adaptive GPU Throughput Optimization for Deep Learning Inference with Power Constraints	7
HEngine: A High Performance Optimization Framework on a GPU for Homomorphic Encryption	7
TPRepair: Tree-based Pipelined Repair in Clustered Storage Systems	7
Multi-objective Hardware-aware Neural Architecture Search with Pareto Rank-preserving Surrogate Models	7
WSGraph: A Framework for Tackling Redundant and Irregular Data Access in Streaming Graph Processing	7
Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree Traversals	7
Mobile-3DCNN: An Acceleration Framework for Ultra-Real-Time Execution of Large 3D CNNs on Mobile Devices	6
EDAS: Enabling Fast Data Loading for GPU Serverless Computing	6
gECC: A GPU-based high-throughput framework for Elliptic Curve Cryptography	6
Lightweight Code Outlining for Android Applications	6
SimTrace: Exploiting Spatial and Temporal Sampling for Large-Scale Performance Analysis	6
MemoriaNova: Optimizing Memory-Aware Model Inference for Edge Computing	6
Towards Optimizing Learned Index for High Performance, Memory Efficiency and NUMA Awareness	6
Pac-PIM: A Parallel Communication Framework for Commodity Processing-in-memory Systems	6

A Memory-Aware Sparse Matrix-Matrix Multiplication on Multicore Architectures	6
A Stable Idle Time Detection Platform for Real I/O Workloads	6
RACER: Avoiding End-to-End Slowdowns in Accelerated Chip Multi-Processors	6
HiSo: Co-optimizing the Intra-layer and Inter-layer Scheduling Schemes with the Hybrid Data Flow for PIM Architectures	6
Toward Comprehensive Design Space Exploration on Heterogeneous Multi-core Processors	6
FlexHM: A Practical System for Heterogeneous Memory with Flexible and Efficient Performance Optimizations	5
WIPE: A Write-Optimized Learned Index for Persistent Memory	5
x Meta : SSD-HDD-hybrid Optimization for Metadata Maintenance of Cloud-scale Object Storage	5
Performance Prediction of Concurrent DNN Training Tasks in GPU Spatial Sharing Environments	5
CoolDC: A Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling	5
OptiFX: Automatic Optimization for Convolutional Neural Networks with Aggressive Operator Fusion on GPUs	5
TSN Cache: Exploiting Data Localities in Graph Computing Applications	5
Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star Networks	5
Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs	5
CGCGraph: Efficient CPU-GPU Co-execution for Concurrent Dynamic Graph Processing	5
JiuJITsu: Removing Gadgets with Safe Register Allocation for JIT Code Generation	5
Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs	5
Capability-Based Efficient Data Transmission Mechanism for Serverless Computing	5
Improving Utilization of Dataflow Unit for Multi-Batch Processing	5
Accelerating the Simulation of Parallel Workloads using Loop-Bounded Checkpoints	5
Efficient and Scalable Hybrid Parallelization of Unstructured Computational Fluid Dynamics with Geometric Multigrid	5
GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph Processing	5
Shift-CIM: In-SRAM Alignment To Support General-Purpose Bit-level Sparsity Exploration in SRAM Multiplication	5
Architectural Support for Sharing, Isolating and Virtualizing FPGA Resources	4
MetaEC: An Efficient and Resilient Erasure-Coded KV Store on Disaggregated Memory	4
BullsEye : Scalable and Accurate Approximation Framework for Cache Miss Calculation	4
SplitZNS: Towards an Efficient LSM-Tree on Zoned Namespace SSDs	4
FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler	4
An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU	4
Koala: Efficient Pipeline Training through Automated Schedule Searching on Domain-Specific Language	4
Scale-out Systolic Arrays	4
Address/Data Instruction Steering in Clustered General Purpose Processors	4
BLG-Tuning: Benchmark-Based Low-Cost General-Purpose I/O Modeling and Tuning	4
Iterating Pointers: Enabling Static Analysis for Loop-based Pointers	4
3D GNLM: Efficient 3D Non-Local Means Kernel with Nested Reuse Strategies for Embedded GPUs	4
Architecting Optically Controlled Phase Change Memory	4
Efficient Flexible Edge Inference for Mixed-Precision Quantized DNN using Customized RISC-V Core	4
RaKV: A Write-Optimized LSM Store for Cloud Block Storage with Robust SLA	4
Rethinking Variable-Length Encoding: Exploiting Bit Sparsity for Parallel Decoding in LLM Accelerators	4
Consequence-based Clustered Architecture	4
Matrix: Multi-Cipher Structures Dataflow for Parallel and Pipelined TFHE Accelerator	4
Optimizing OpenCL Barrier Synchronization and Memory Efficiency on Multi-Core DSPs	4