IEEE Transactions on Multimedia

Papers
(The TQCC of IEEE Transactions on Multimedia is 16. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2022-05-01 to 2026-05-01.)
ArticleCitations
Improving Vision Anomaly Detection With the Guidance of Language Modality904
Focusing on Subtle Differences: A Feature Disentanglement Model for Series Photo Selection497
Weakly-Supervised 3D Visual Grounding Based on Visual Language Alignment353
SkyML: A MLaaS Federation Design for Multicloud-Based Multimedia Analytics325
Self-Mining the Confident Prototypes for Source-Free Unsupervised Domain Adaptation in Image Segmentation243
BASNet: Boundary Assisted Network for Image Splicing Forgery Detection192
Pixel Bleach Network for Detecting Face Forgery Under Compression181
Mix-Based Training Strategies for Learning Implicit Neural Representations181
Semantic-Aware Triplet Loss for Image Classification176
Ensemble Prototype Networks for Unsupervised Cross-Modal Hashing With Cross-Task Consistency175
Semantic Dual-Adversarial Network for Blended-Target Domain Adaptation173
Simulate, Refocus and Ensemble: An Attention-Refocusing Scheme for Domain Generalization172
Vision-Controllable Language Model for Image-Guided Story Ending Generation154
FoodSAM: Any Food Segmentation149
Online Low-Light Sand-Dust Video Enhancement Using Adaptive Dynamic Brightness Correction and a Rolling Guidance Filter146
Asymptotics-Aware Multi-View Subspace Clustering143
Deep Semantic-Consistent Penalizing Hashing for Cross-Modal Retrieval142
Semi-Supervised Domain Adaptation via Joint Transductive and Inductive Subspace Learning142
Vulnerability of Feature Extractors in 2D Image-Based 3D Object Retrieval140
Self-Guided Discriminative Locality Preserving Projections136
Disaggregation Distillation for Person Search135
Bias-Correction Feature Learner for Semi-Supervised Instance Segmentation134
Disentangled Graph Variational Auto-Encoder for Multimodal Recommendation With Interpretability134
BMB: Balanced Memory Bank for Long-Tailed Semi-Supervised Learning130
Adaptive HEVC Video Steganography With High Performance Based on Attention-Net and PU Partition Modes128
SLCGC: A lightweight Self-supervised Low-Pass Contrastive Graph Clustering Network for Hyperspectral Images126
AMS-Net: Adaptive Multi-Scale Network for Image Compressive Sensing126
LMAgent: A Large-scale Multimodal Agents Society for Multi-user Simulation123
MGKsite: Multi-Modal Knowledge-Driven Site Selection via Intra and Inter-Modal Graph Fusion123
Anomaly-Led Prompting Learning Caption Generating Model and Benchmark122
Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective121
Hierarchical Equalization Loss for Long-Tailed Instance Segmentation117
Posture-Movement-Frequency-Enhanced Graph Convolutional Network for Gait Emotion Recognition115
Mask-Aware Kernel Learning for Action Recognition114
Skeleton-Based Action Recognition With Select-Assemble-Normalize Graph Convolutional Networks111
Dynamic Contrastive Distillation for Image-Text Retrieval111
XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework110
Interpretable Graph Convolutional Network for Multi-View Semi-Supervised Learning109
Bidirectional Translation Between UHD-HDR and HD-SDR Videos109
Exploring Kernel Transformations for Implicit Neural Representations107
Rethinking Affine Transform for Efficient Image Enhancement: A Color Space Perspective105
One-Shot Human Motion Transfer via Occlusion-Robust Flow Prediction and Neural Texturing104
Efficient Cross-Modal Video Retrieval With Meta-Optimized Frames101
SCSP: An Unsupervised Image-to-Image Translation Network Based on Semantic Cooperative Shape Perception101
Transferable Backdoor Attack on Any CLIP Model With Any Target Class by Pre-Trained Hack Network99
Optimal Transport-Based Patch Matching for Image Style Transfer98
Long Video Understanding with Learnable Retrieval in Video-Language Models96
Feature First: Advancing Image-Text Retrieval Through Improved Visual Features96
Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked Attention96
Guided Image-to-Image Translation by Discriminator-Generator Communication95
Multi-Level Transitional Contrast Learning for Personalized Image Aesthetics Assessment94
Progressive Local Filter Pruning for Image Retrieval Acceleration93
SGG-Nets: Generic Rotation-Invariant Plugin Networks for Point Cloud Analysis93
Semi-Supervised Domain Adaptation for Major Depressive Disorder Detection92
ViDR-GNN: Vision Implicit Discriminative Reorganization Graph Neural Networks92
Late Fusion Multiple Kernel Clustering With Local Kernel Alignment Maximization92
HRVFusion: Video-based Long-Term Heart Rate Variability Measurement with Conditional Diffusion Models91
Dual-Task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding91
ICE: Interactive 3D Game Character Facial Editing via Dialogue90
Weakly-Supervised Video Object Grounding via Learning Uni-Modal Associations89
PhotoHelper: Portrait Photographing Guidance Via Deep Feature Retrieval and Fusion89
Towards Substation Semantic Segmentation: A benchmark dataset and a cross-attention embedded hierarchical network89
Outliers Adaptation Exploration and Centroids Matching Label Refinement for Unsupervised Person Re-identification87
Few-Shot Generative Model Adaptation via Style-Guided Prompt87
Distributed Deep Point Cloud Feature Compression for Vehicle-to-Vehicle Cooperative Perception86
Robust Multi-Stage Tracking via Multi-Scale and Multi-Level Representation Learning86
Quality Assessment for DIBR-Synthesized Views Based on Wavelet Transform and Gradient Magnitude Similarity86
MVPC-CLIP: Multi-Granularity Visual Prompt Co-Operative for Aerial Video Recognition85
Semi-Supervised Contrastive Learning With Similarity Co-Calibration85
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition83
PropMambaSR: Lightweight Image Super-Resolution with Propagation State Space Model81
DWSF-Net: A Dynamic Wavelet-based Spatial-frequency Fusion Network for Multispectral Object Detection79
3D-SceneQ: Empowering 3D LLM with Query-Guided Adaptive Pruning and Multi-modal Feature Enhancement78
Perceptual Image Hashing Using Feature Fusion of Orthogonal Moments78
Adaptive Weight Generator for Multi-Task Image Recognition by Task Grouping Prompt77
Revisiting the Adversarial Transferability: Towards a Perspective of Semantic Preservation77
Watch Where You Move: Region-Aware Dynamic Aggregation and Excitation for Gait Recognition76
MHRN: A Multimodal Hierarchical Reasoning Network for Topic Detection76
Unsupervised Learning-Based Framework for Deepfake Video Detection75
Neighborhood Contrastive Transformer for Change Captioning75
Scale Up Composed Image Retrieval Learning via Modification Text Generation75
Siamese Alignment Network for Weakly Supervised Video Moment Retrieval74
Towards Neural Codec-Empowered 360$^\circ$ Video Streaming: A Saliency-Aided Synergistic Approach74
Towards Temporal Event Detection: A Dataset, Benchmarks and Challenges73
Image-Based Structured Vehicle Behavior Analysis Inspired by Interactive Cognition73
Benchmark Dataset and Pair-Wise Ranking Method for Quality Evaluation of Night-Time Image Enhancement70
Human-Centric Behavior Description in Videos: New Benchmark and Model70
Sparse Transformer for Ultra-Sparse Sampled Video Compressive Sensing70
Exploiting EfficientSAM and Temporal Coherence for Audio-Visual Segmentation69
VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition68
Sounding Depressed? Personalized Deep Learning Model for Depression Detection from Speech and Text68
Anchor-guided Discrete Multi-view Clustering66
Compositional Text-to-Image Synthesis with Training-Free Layout-Guided Diffusion66
EPM-Net: Efficient Feature Extraction, Point-Pair Feature Matching for Robust 6-D Pose Estimation66
Exploring Local and Global Consistent Correlation on Hypergraph for Rotation Invariant Point Cloud Analysis65
Prototypical Bidirectional Adaptation and Learning for Cross-Domain Semantic Segmentation64
ALCER3D: Adaptive Learning Constraints for Enhanced Retrieval of Complex Indoor 3D Scenarios64
High Specificity Guided Cross-Domain Few-Shot Segmentation63
Cps-STS: Bridging the Gap Between Content and Position for Coarse-Point-Supervised Scene Text Spotter63
RUL: Region Uncertainty Learning for Robust Face Recognition63
Improving Fine-Grained Image Classification With Multimodal Information62
Knowledge Distillation-Based Domain-Invariant Representation Learning for Domain Generalization61
A Two-Stream Hybrid Convolution-Transformer Network Architecture for Clothing-Change Person Re-Identification61
RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios61
Supervised Contrastive Learning for Indoor Point Cloud Oversegmentation59
Dynamic Strategy Prompt Reasoning for Emotional Support Conversation59
Primary Code Guided Targeted Attack against Cross-modal Hashing Retrieval59
Personalized Fashion Recommendation With Discrete Content-Based Tensor Factorization58
Cooperative Bargaining Game Based Adaptive Video Multicast Over Mobile Edge Networks58
TPE-ADE: Thumbnail-Preserving Encryption Based on Adaptive Deviation Embedding for JPEG Images58
JPEG AI Compressed Domain Face Detection: A Multi-Scale Bridging Perspective58
Denoised Semantic Features for Local Consistent No-Reference Image Quality Assessment57
No-Reference Bitstream-Layer Model for Perceptual Quality Assessment of V-PCC Encoded Point Clouds57
Universal Infrared Image Nonuniformity Correction via Stripe-Aware Attention Network57
Probabilistic Temporal Masked Attention for Cross-View Online Action Detection57
Style-Agnostic Representation Learning for Visible-Infrared Person Re-Identification57
Motion Direction Awareness: A Biomimetic Dynamic Capture Mechanism for Video Prediction56
Foodfusion: A Novel Approach for Food Image Composition via Diffusion Models56
Investigating the Effective Dynamic Information of Spectral Shapes for Audio Classification56
Multi-View User Preference Modeling for Personalized Text-to-Image Generation56
DEHand: Deformable Encoding for Photo-Realistic Free-View and Free-Pose Hand Rendering56
Action-Responsive Contrastive Network for Fine-Grained Skeleton-Based Action Recognition56
Depth Map Super-Resolution via Deep Cross-Modality and Cross-Scale Guidance56
DREAMT: Diversity Enlarged Mutual Teaching for Unsupervised Domain Adaptive Person Re-Identification55
High Fidelity Face-Swapping With Style ConvTransformer and Latent Space Selection54
Boosting Universal Adversarial Attack on Deep Neural Networks54
Spatial-Temporal Saliency Guided Unbiased Contrastive Learning for Video Scene Graph Generation53
Semi-Supervised Authentically Distorted Image Quality Assessment With Consistency-Preserving Dual-Branch Convolutional Neural Network53
Reconstructed Graph Constrained Auto-Encoders for Multi-View Representation Learning53
Exploring Kernel-Based Texture Transfer for Pose-Guided Person Image Generation53
Show, Tell and Rephrase: Diverse Video Captioning via Two-Stage Progressive Training53
UniCrossGait: Unified Cross-modal Gait Recognition Based on Knowledge Distillation52
SDE2D: Semantic-Guided Discriminability Enhancement Feature Detector and Descriptor52
Improving Out-of-Distribution Generalization on Point Clouds with Cross-Domain Adversarial Distillation52
HP-C4D: A Fast Camera and 4D Radar Fusion Framework with Height Prediction For 3D Object Detection52
GeneralizGeneralizing Beyond Patterns: Dynamic Moment Query Recalibrating for Out-of-Distribution Video Temporal Localizationing51
Vulnerabilities in AI-Generated Image Detection: The Challenge of Adversarial Attacks51
SegTrans: Transferable Adversarial Examples for Segmentation Models51
C-CTX: Cubic-Checkerboard Context Entropy Model for Learned Image Compression50
Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement50
Can Machines Generate Personalized Music? A Hybrid Favorite-Aware Method for User Preference Music Transfer50
Low-Light Image Enhancement via Self-Reinforced Retinex Projection Model50
Test-Time Model Adaptation for Visual Question Answering With Debiased Self-Supervisions50
Dual Representation Aggregation Network for Blind Image Super-Resolution via Iterative Bi-level Optimization50
Twin Tensor Learning for Consistency and Inconsistency: A Unified Affinity Learning Framework for Multi-View Clustering50
Synthesize Boundaries: A Boundary-Aware Self-Consistent Framework for Weakly Supervised Salient Object Detection50
DA-Net: Density-Aware 3D Object Detection Network for Point Clouds50
CMANet: Context-Aware Mutual Attention Network for Referring Image Segmentation50
A Multidimensional Media Adaptation Framework for Live Holographic Communication50
CRSOT: Cross-Resolution Object Tracking Using Unaligned Frame and Event Cameras49
Motion Deblur by Learning Residual From Events49
Deep Unfolding Network for Image Compressed Sensing by Content-Adaptive Gradient Updating and Deformation-Invariant Non-Local Modeling49
Cross-Domain Sample Relationship Learning for Facial Expression Recognition49
FFFN: Frame-By-Frame Feedback Fusion Network for Video Super-Resolution49
Multimodal Progressive Modulation Network for Micro-Video Multi-Label Classification49
Saliency-Aware Adversarial Attacks on Visual Trackers49
Video Instance Segmentation by Instance Flow Assembly49
RSNet: Relation Separation Network for Few-Shot Similar Class Recognition49
Rate-Adaptive Neural Network for Image Compressive Sensing48
Video-to-Music Recommendation Using Temporal Alignment of Segments48
FedSH: Towards Privacy-Preserving Text-Based Person Re-Identification48
GLFF: Global and Local Feature Fusion for AI-Synthesized Image Detection48
Underwater Image Enhancement With Cascaded Contrastive Learning48
STNet: Scale Tree Network With Multi-Level Auxiliator for Crowd Counting48
IEIRNet: Inconsistency Exploiting Based Identity Rectification for Face Forgery Detection48
Sentiment-Enhanced Graph-Based Sarcasm Explanation in Dialogue47
Multimodal Sentiment Analysis With Image-Text Interaction Network47
Enhanced Context Mining and Filtering for Learned Video Compression47
Flow Guidance Deformable Compensation Network for Video Frame Interpolation47
Pedestrian Trajectory Prediction Based on Social Interactions Learning With Random Weights47
Blind Video Quality Assessment at the Edge47
Dense Video Captioning With Early Linguistic Information Fusion46
CenterTube: Tracking Multiple 3D Objects With 4D Tubelets in Dynamic Point Clouds46
OpenSlot: Mixed Open-Set Recognition With Object-Centric Learning46
Neural-Enhanced Rate Adaptation and Computation Distribution for Emerging mmWave Multi-User 3D Video Streaming Systems46
Multi-Grained Vision-and-Language Model for Medical Image and Text Alignment45
Augment One With Others: Generalizing to Unforeseen Variations for Visual Tracking45
SwimVG: Step-Wise Multimodal Fusion and Adaption for Visual Grounding45
Graph Convolutional Network With Unknown Class Number45
Decoupled Prototype Learning for Reliable Test-Time Adaptation45
Underwater Adaptive Video Transmissions Using MIMO-Based Software-Defined Acoustic Modems45
Prune and Merge: Efficient Token Compression for Vision Transformer With Spatial Information Preserved45
Compact-Yet-Separate: Proto-Centric Multi-Modal Hashing With Pronounced Category Differences for Multi-Modal Retrieval45
Geometric Continuity and Consistency Learning for Self-Supervised Point Cloud Completion45
Heterogeneous Multimodal Federated Learning with Missing Modality via Mask-Restoration and Self-Guidance44
Semantics Alternating Enhancement and Bidirectional Aggregation for Referring Video Object Segmentation44
A Blockchain and Improved Perception Hash Based Copyright Protection Scheme for Purely Chromatic Background Images44
Develop Then Rival: A Human Vision-Inspired Framework for Superimposed Image Decomposition44
YACT-Net: Asymmetric YUV Color Transfer for Reference-Based Colorization44
VRTNet: Vector Rectifier Transformer for Two-View Correspondence Learning44
Category-Contrastive Fine-Grained Crowd Counting and Beyond44
Tuning-Free High-Resolution Video Diffusion With Spatial-Temporal Latent Grouping43
Tensorformer: Normalized Matrix Attention Transformer for High-Quality Point Cloud Reconstruction43
Visibility-Based Geometry Pruning of Neural Plenoptic Scene Representations43
MGHead: Motion-Aware Animated Gaussian Head Avatars with Anchored Skeletal Structures43
CMI-Net: Cross-View Message Token Interaction Network for 3D Shape Recognition43
Edge-Assisted Massive Video Delivery Over Cell-Free Massive MIMO43
Compression of Plenoptic Point Cloud Attributes Using 6-D Point Clouds and 6-D Transforms43
TR-Adapter: Parameter-Efficient Transfer Learning for Video Question Answering43
Action-Semantic Consistent Knowledge for Weakly-Supervised Action Localization43
Inexactly Matched Referring Expression Comprehension With Rationale43
Progressive Learning of Instance-Level Proxy Semantics for Few-Shot Action Recognition43
Interpretable Multi-View Representation Learning Towards Complex Scenes: From Homogeneity to Heterogeneity43
Instruction-Driven 3D Facial Expression Generation and Transition42
ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization42
MVL-Net: Pairwise Learning for Multi-View Multiple People Labelling42
Unleashing Knowledge Potential of Source Hypothesis for Source-Free Domain Adaptation42
Face De-Occlusion With Deep Cascade Guidance Learning42
Geo-SelfSSC: Integrating Dense Geometric Priors for Enhanced Self-Supervised Semantic Scene Completion42
Neuromorphic Similarity Measurement of Tactile Stimuli in Human–Machine Interface42
Harnessing Attention Weight Tables for Computationally Efficient Multiple Object Tracking With Transformers42
LININ: Logic Integrated Neural Inference Network for Explanatory Visual Question Answering42
SSPNet: Predicting Visual Saliency Shifts42
Geometry-Aware 3D Gaussian Representation for Real-Time Rendering of Large-Scale Scenes41
Cross-Modality Feature Fusion for Forward-Looking Sonar Image Segmentation in Complex Underwater Environments41
Improving Visual Object Tracking Through Visual Prompting41
Towards a Multi-Granulated Statistical Framework for Human–Machine Collaboration in Image Classification41
Scene Graph Knowledge Enhanced Hashing with Contrastive Learning for Image-Text Retrieval41
Soundscape Captioning Using Sound Affective Quality Network and Large Language Model41
Exploring Cross-Modal Mutual Prompt Learning for Video Quality Assessment41
Simultaneously Training and Compressing Vision-and-Language Pre-Training Model41
AFANet: Adaptive Frequency-Aware Network for Weakly-Supervised Few-Shot Semantic Segmentation41
Unsupervised Deepfake Detection via Camera Source Clustering and Temporal-Spatial Features41
Auxiliary Representation Guided Network for Visible-Infrared Person Re-Identification41
Cross-Scatter Sparse Dictionary Pair Learning for Cross-Domain Classification41
Toward General Cross-Modal Signal Reconstruction for Robotic Teleoperation41
Progressive Learning Model for Big Data Analysis Using Subnetwork and Moore-Penrose Inverse41
Noise Aware Audio-Visual Speech Denoising40
Point Cloud Soft Multicast for Untethered XR Users40
TrackletGait: A Robust Framework for Gait Recognition in the Wild40
Question Understanding and Temporality Guiding for Video Question Answering40
Learning Semantic Polymorphic Mapping for Text-Based Person Retrieval40
Reordered $k$-Means: A New Baseline for View-Unaligned Multi-View Clustering40
Bidirectional Maximum Entropy Training With Word Co-Occurrence for Video Captioning40
Hierarchical Motion-Enhanced Matching Framework for Few-Shot Action Recognition40
Manifold-Based Incomplete Multi-View Clustering via Bi-Consistency Guidance40
STFE: A Comprehensive Video-Based Person Re-Identification Network Based on Spatio-Temporal Feature Enhancement39
FGDNet: Fine-Grained Detection Network Towards Face Anti-Spoofing39
Multimodal Evidential Learning for Open-World Weakly-Supervised Video Anomaly Detection39
CNIE: Content-Aware Non-Transferable Information Extraction for Fine-Grained Visual Categorization39
Textual Enhanced Adaptive Meta-Fusion for Few-Shot Visual Recognition39
Make Graph-Based Referring Expression Comprehension Great Again Through Expression-Guided Dynamic Gating and Regression39
Towards Structure-Aware Model for Multi-Modal Knowledge Graph Completion39
LARNet: Towards Lightweight, Accurate and Real-Time Salient Object Detection39
MPPM: A Mobile-Efficient Part Model for Object re-ID39
EraW-Net: Enhance-Refine-Align W-Net for Scene-Associated Driver Attention Estimation39
Joint Intra & Inter-Grained Reasoning: A New Look Into Semantic Consistency of Image-Text Retrieval39
CLCT: Complementary Local Consensus Transformer for Two-View Correspondence Pruning39
Semi-Supervised Knowledge Distillation for Cross-Modal Hashing39
Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multimodal Interaction38
0.094129085540771