IEEE Computer Architecture Letters

Papers
(The median citation count of IEEE Computer Architecture Letters is 1. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2022-05-01 to 2026-05-01.)
ArticleCitations
Old is Gold: Optimizing Single-Threaded Applications With ExGen-Malloc90
The Architectural Sustainability Indicator24
Speculative Multi-Level Access in LSM Tree-Based KV Store23
A Characterization of Generative Recommendation Models: Study of Hierarchical Sequential Transduction Unit21
Toward Practical 128-Bit General Purpose Microarchitectures21
Characterization and Analysis of Text-to-Image Diffusion Models20
Exploration of Algorithm-Hardware Co-Design for Floating-Point Digital Compute-in-Memory20
Accelerating Programmable Bootstrapping Targeting Contemporary GPU Microarchitecture19
Time Series Machine Learning Models for Precise SSD Access Latency Prediction17
De-Quantization Penalties for Interactive LLM Inference on Prosumer GPUs17
SCALES: SCALable and Area-Efficient Systolic Accelerator for Ternary Polynomial Multiplication16
Breaking the HBM Bit Cost Barrier: Domain-Specific ECC for AI Inference Infrastructure13
Context-Aware Set Dueling for Dynamic Policy Arbitration13
A Quantitative Analysis of Mamba-2-Based Large Language Model: Study of State Space Duality12
MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference12
In-Depth Characterization of Machine Learning on an Optimized Multi-Party Computing Library12
AiDE: Attention-FFN Disaggregated Execution for Cost-Effective LLM Decoding on CXL-PNM11
Improving Energy-Efficiency of Capsule Networks on Modern GPUs10
SoCurity: A Design Approach for Enhancing SoC Security10
Straw: A Stress-Aware WL-Based Read Reclaim Technique for High-Density NAND Flash-Based SSDs10
A Flexible Embedding-Aware Near Memory Processing Architecture for Recommendation System10
OASIS: Outlier-Aware KV Cache Clustering for Scaling LLM Inference in CXL Memory Systems10
Exploring KV Cache Quantization in Multimodal Large Language Model Inference10
In-Memory Versioning (IMV)9
RouteReplies: Alleviating Long Latency in Many-Chip-Module GPUs9
REDIT: Redirection-Enabled Memory-Side Directory Architecture for CXL Memory Fabric8
StreamDQ: HBM-Integrated On-the-Fly DeQuantization via Memory Load for Large Language Models8
A Case for In-Memory Random Scatter-Gather for Fast Graph Processing8
Disaggregated Speculative Decoding for Carbon-Efficient LLM Serving7
Enabling Computation and Communication Overlap in PIMs for On-Device LLM Inference7
Exploring the DIMM PIM Architecture for Accelerating Time Series Analysis7
QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture6
Security Helper Chiplets: A New Paradigm for Secure Hardware Monitoring6
PUDTune: Multi-Level Charging for High-Precision Calibration in Processing-Using-DRAM6
Thread-Adaptive: High-Throughput Parallel Architectures of SLH-DSA on GPUs6
Accelerating Deep Reinforcement Learning via Phase-Level Parallelism for Robotics Applications6
Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference6
Improving Performance on Tiered Memory With Semantic Data Placement6
Mitigating Timing-Based NoC Side-Channel Attacks With LLC Remapping6
Managing Prefetchers With Deep Reinforcement Learning5
NoHammer: Preventing Row Hammer With Last-Level Cache Management5
Efficient Deadlock Avoidance by Considering Stalling, Message Dependencies, and Topology5
SparseLeakyNets: Classification Prediction Attack Over Sparsity-Aware Embedded Neural Networks Using Timing Side-Channel Information5
LADIO: Leakage-Aware Direct I/O for I/O-Intensive Workloads5
pNet-gem5: Full-System Simulation With High-Performance Networking Enabled by Parallel Network Packet Processing5
High-Performance Winograd Based Accelerator Architecture for Convolutional Neural Network5
Memory-Centric MCM-GPU Architecture5
DeMM: A Decoupled Matrix Multiplication Engine Supporting Relaxed Structured Sparsity5
SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency5
RAESC: A Reconfigurable AES Countermeasure Architecture for RISC-V With Enhanced Power Side-Channel Resilience4
H 3 : H ybrid Architecture Using H igh Bandwidth Memory4
ZoneBuffer: An Efficient Buffer Management Scheme for ZNS SSDs4
Adaptive Web Browsing on Mobile Heterogeneous Multi-cores4
Nighthawk: Zero-Copy Cache Quarantine for Invisible Speculation4
Enhancing the Reach and Reliability of Quantum Annealers by Pruning Longer Chains4
Xami : E x pert-Aware A daptive Compression for Mi 4
FPGA-Accelerated Data Preprocessing for Personalized Recommendation Systems4
Hisui: Unlocking Tiered Memory Efficiency for FaaS Workloads4
PreGNN: Hardware Acceleration to Take Preprocessing Off the Critical Path in Graph Neural Networks4
ReplayOpt: Optimizer-State Replay to Resolve Critical-Path Bottlenecks in Offloaded Training4
A Flexible Hybrid Interconnection Design for High-Performance and Energy-Efficient Chiplet-Based Systems4
Primate: A Framework to Automatically Generate Soft Processors for Network Applications4
Exploring Volatile FPGAs Potential for Accelerating Energy-Harvesting IoT Applications3
Fast Inter-Enclave Communication Encryption3
Guard Cache: Creating Noisy Side-Channels3
Direct-Coding DNA With Multilevel Parallelism3
Accelerators & Security: The Socket Approach3
SSE: Security Service Engines to Accelerate Enclave Performance in Secure Multicore Processors3
Understanding the Performance Behaviors of End-to-End Protein Design Pipelines on GPUs3
A Quantum Computer Trusted Execution Environment3
LeakDiT: Diffusion Transformers for Trace-Augmented Side-Channel Analysis3
Camulator: A Lightweight and Extensible Trace-Driven Cache Simulator for Embedded Multicore SoCs3
Enabling Cost-Efficient LLM Inference on Mid-Tier GPUs With NMP DIMMs3
Driving the Core Frontend With LiteBTB3
T-CAT: Dynamic Cache Allocation for Tiered Memory Systems With Memory Interleaving3
Fast Performance Prediction for Efficient Distributed DNN Training3
gem5-accel: A Pre-RTL Simulation Toolchain for Accelerator Architecture Validation2
Overcoming Memory Capacity Wall of GPUs With Heterogeneous Memory Stack2
Energy-Efficient Bayesian Inference Using Bitstream Computing2
Minimal Counters, Maximum Insight: Simplifying System Performance With HPC Clusters for Optimized Monitoring2
Approximate Multiplier Design With LFSR-Based Stochastic Sequence Generators for Edge AI2
Characterization and Analysis of the 3D Gaussian Splatting Rendering Pipeline2
Analyzing and Exploiting Memory Hierarchy Parallelism With MLP Stacks2
Computational CXL-Memory Solution for Accelerating Memory-Intensive Applications2
FullPack: Full Vector Utilization for Sub-Byte Quantized Matrix-Vector Multiplication on General Purpose CPUs2
Redundant Array of Independent Memory Devices2
IntervalSim++: Enhanced Interval Simulation for Unbalanced Processor Designs2
LWAL: Lightweight Adaptive Learning-Driven Cache Bypassing for GPUs2
Architectural Implications of GNN Aggregation Programming Abstractions2
DRAM-CAM: General-Purpose Bit-Serial Exact Pattern Matching2
CABANA : Cluster-Aware Query Batching for Accelerating Billion-Scale ANNS With Intel AMX2
PINSim: A Processing In- and Near-Sensor Simulator to Model Intelligent Vision Sensors2
Enhancing DNN Training Efficiency Via Dynamic Asymmetric Architecture2
Capacity-Latency Tradeoffs in CXL Memory Expander at Hyperscale2
Enhancing DCIM Efficiency with Multi-Storage-Row Architecture for Edge AI Workloads2
Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System2
A Case Study of a DRAM-NVM Hybrid Memory Allocator for Key-Value Stores2
Hungarian Qubit Assignment for Optimized Mapping of Quantum Circuits on Multi-Core Architectures2
A First-Order Model to Assess Computer Architecture Sustainability2
Per-Row Activation Counting on Real Hardware: Demystifying Performance Overheads2
Reducing the Silicon Area Overhead of Counter-Based Rowhammer Mitigations2
EgDiff: An Enhanced Global Load Value Predictor2
SEMS: Scalable Embedding Memory System for Accelerating Embedding-Based DNNs2
On Internally Tagged Instruction Set Architectures2
Cost-Effective Extension of DRAM-PIM for Group-Wise LLM Quantization2
Halis: A Hardware-Software Co-Designed Near-Cache Accelerator for Graph Pattern Mining2
R.I.P. Geomean Speedup Use Equal-Work (Or Equal-Time) Harmonic Mean Speedup Instead2
Accelerating Page Migrations in Operating Systems With Intel DSA2
Characterization and Implementation of Radar System Applications on a Reconfigurable Dataflow Architecture2
FPGA-Based AI Smart NICs for Scalable Distributed AI Training Systems2
Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models1
Approximate SFQ-Based Computing Architecture Modeling With Device-Level Guidelines1
SPGPU: Spatially Programmed GPU1
GraNDe: Near-Data Processing Architecture With Adaptive Matrix Mapping for Graph Convolutional Networks1
Balancing Performance Against Cost and Sustainability in Multi-Chip-Module GPUs1
Contention-Aware GPU Thread Block Scheduler for Efficient GPU-SSD1
A Data Prefetcher-Based 1000-Core RISC-V Processor for Efficient Processing of Graph Neural Networks1
Hashing ATD Tags for Low-Overhead Safe Contention Monitoring1
Tulip: Turn-Free Low-Power Network-on-Chip1
SPAM: Streamlined Prefetcher-Aware Multi-Threaded Cache Covert-Channel Attack1
Characterizing and Understanding End-to-End Multi-Modal Neural Networks on GPUs1
Canal: A Flexible Interconnect Generator for Coarse-Grained Reconfigurable Arrays1
Characterizing and Understanding HGNNs on GPUs1
Address Scaling: Architectural Support for Fine-Grained Thread-Safe Metadata Management1
An Intermediate Language for General Sparse Format Customization1
Amethyst: Reducing Data Center Emissions With Dynamic Autotuning and VM Management1
Architectural Security Regulation1
Intelligent SSD Firmware for Zero-Overhead Journaling1
On Variable Strength Quantum ECC1
Structured Combinators for Efficient Graph Reduction1
A Partial Tag–Data Decoupled Architecture for Last-Level Cache Optimization1
JBOC: Just a Bunch of CXL-enabled SSDs for Resource-Efficient LLM Checkpointing1
Exploiting Intel AMX Power Gating1
I/O-ETEM: An I/O-Aware Approach for Estimating Execution Time of Machine Learning Workloads1
Exploiting Direct Memory Operands in GPU Instructions1
MixDiT: Accelerating Image Diffusion Transformer Inference With Mixed-Precision MX Quantization1
A Multiple-Aspect Optimal CNN Accelerator in Top1 Accuracy, Performance, and Power Efficiency1
TeleVM: A Lightweight Virtual Machine for RISC-V Architecture1
MajorK: Majority Based kmer Matching in Commodity DRAM1
A Case for Hardware Memoization in Server CPUs1
CGR-NPU: A Hybrid CGRA and NPU Architecture for Adaptive Neural Computing Workloads1
Efficient MoE Model Fine-tuning on Commodity GPU Server with Offloading1
Electra: Eliminating the Ineffectual Computations on Bitmap Compressed Matrices1
A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models1
EONSim: An NPU Simulator for On-Chip Memory and Embedding Vector Operations1
GPU-Centric Memory Tiering for LLM Serving With NVIDIA Grace Hopper Superchip1
Low-Latency PIM Accelerator for Edge LLM Inference1
Fusing Adds and Shifts for Efficient Dot Products1
Pyramid: Accelerating LLM Inference With Cross-Level Processing-in-Memory1
X-PPR: Post Package Repair for CXL Memory1
HINT: A Hardware Platform for Intra-Host NIC Traffic and SmartNIC Emulation1
Supporting a Virtual Vector Instruction Set on a Commercial Compute-in-SRAM Accelerator1
eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models1
Stardust: Scalable and Transferable Workload Mapping for Large AI on Multi-Chiplet Systems1
GEMM the New Gem: The Inevitable Kernel and its Sensitivity to Compiler Optimizations and Libraries1
Cache and Near-Data Co-Design for Chiplets1
MOST: Memory Oversubscription-Aware Scheduling for Tensor Migration on GPU Unified Storage1
0.079002857208252