OOIR: Observatory of International Research

Papers

(The median citation count of IEEE Transactions on Multimedia is 5. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2022-06-01 to 2026-06-01.)

Article	Citations
Improving Vision Anomaly Detection With the Guidance of Language Modality	931
Focusing on Subtle Differences: A Feature Disentanglement Model for Series Photo Selection	509
Vulnerability of Feature Extractors in 2D Image-Based 3D Object Retrieval	365
SkyML: A MLaaS Federation Design for Multicloud-Based Multimedia Analytics	336
Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked Attention	248
Rethinking Affine Transform for Efficient Image Enhancement: A Color Space Perspective	200
FoodSAM: Any Food Segmentation	191
Online Low-Light Sand-Dust Video Enhancement Using Adaptive Dynamic Brightness Correction and a Rolling Guidance Filter	184
Semi-Supervised Domain Adaptation via Joint Transductive and Inductive Subspace Learning	183
Simulate, Refocus and Ensemble: An Attention-Refocusing Scheme for Domain Generalization	180
SGG-Nets: Generic Rotation-Invariant Plugin Networks for Point Cloud Analysis	176
ViDR-GNN: Vision Implicit Discriminative Reorganization Graph Neural Networks	175
Dual-Task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding	157
Ensemble Prototype Networks for Unsupervised Cross-Modal Hashing With Cross-Task Consistency	153
Late Fusion Multiple Kernel Clustering With Local Kernel Alignment Maximization	153
Few-Shot Generative Model Adaptation via Style-Guided Prompt	150
Outliers Adaptation Exploration and Centroids Matching Label Refinement for Unsupervised Person Re-identification	149
Feature First: Advancing Image-Text Retrieval Through Improved Visual Features	144
Weakly-Supervised Video Object Grounding via Learning Uni-Modal Associations	144
Quality Assessment for DIBR-Synthesized Views Based on Wavelet Transform and Gradient Magnitude Similarity	143
Towards Substation Semantic Segmentation: A benchmark dataset and a cross-attention embedded hierarchical network	141
HRVFusion: Video-based Long-Term Heart Rate Variability Measurement with Conditional Diffusion Models	139
ICE: Interactive 3D Game Character Facial Editing via Dialogue	135
Revisiting the Adversarial Transferability: Towards a Perspective of Semantic Preservation	133
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition	132

Exploring Kernel Transformations for Implicit Neural Representations	130
Posture-Movement-Frequency-Enhanced Graph Convolutional Network for Gait Emotion Recognition	127
Mask-Aware Kernel Learning for Action Recognition	126
Siamese Alignment Network for Weakly Supervised Video Moment Retrieval	125
XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework	125
Anomaly-Led Prompting Learning Caption Generating Model and Benchmark	124
BMB: Balanced Memory Bank for Long-Tailed Semi-Supervised Learning	123
LMAgent: A Large-scale Multimodal Agents Society for Multi-user Simulation	120
Hierarchical Equalization Loss for Long-Tailed Instance Segmentation	119
BASNet: Boundary Assisted Network for Image Splicing Forgery Detection	116
Bias-Correction Feature Learner for Semi-Supervised Instance Segmentation	116
Mix-Based Training Strategies for Learning Implicit Neural Representations	113
Pixel Bleach Network for Detecting Face Forgery Under Compression	113
Skeleton-Based Action Recognition With Select-Assemble-Normalize Graph Convolutional Networks	111
Bidirectional Translation Between UHD-HDR and HD-SDR Videos	110
Neighborhood Contrastive Transformer for Change Captioning	109
Scale Up Composed Image Retrieval Learning via Modification Text Generation	107
Optimal Transport-Based Patch Matching for Image Style Transfer	105
Robust Multi-Stage Tracking via Multi-Scale and Multi-Level Representation Learning	104
Long Video Understanding with Learnable Retrieval in Video-Language Models	103
Transferable Backdoor Attack on Any CLIP Model With Any Target Class by Pre-Trained Hack Network	101
Watch Where You Move: Region-Aware Dynamic Aggregation and Excitation for Gait Recognition	100
Efficient Cross-Modal Video Retrieval With Meta-Optimized Frames	100
DWSF-Net: A Dynamic Wavelet-based Spatial-frequency Fusion Network for Multispectral Object Detection	97
Perceptual Image Hashing Using Feature Fusion of Orthogonal Moments	97
PropMambaSR: Lightweight Image Super-Resolution with Propagation State Space Model	97
3D-SceneQ: Empowering 3D LLM with Query-Guided Adaptive Pruning and Multi-modal Feature Enhancement	97
MVPC-CLIP: Multi-Granularity Visual Prompt Co-Operative for Aerial Video Recognition	96
Adaptive Weight Generator for Multi-Task Image Recognition by Task Grouping Prompt	96
MHRN: A Multimodal Hierarchical Reasoning Network for Topic Detection	96
Semi-Supervised Contrastive Learning With Similarity Co-Calibration	95
Semantic Dual-Adversarial Network for Blended-Target Domain Adaptation	95
Deep Semantic-Consistent Penalizing Hashing for Cross-Modal Retrieval	95
Self-Guided Discriminative Locality Preserving Projections	94
Semantic-Aware Triplet Loss for Image Classification	94
SCSP: An Unsupervised Image-to-Image Translation Network Based on Semantic Cooperative Shape Perception	94
Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective	93
Adaptive HEVC Video Steganography With High Performance Based on Attention-Net and PU Partition Modes	91
Distributed Deep Point Cloud Feature Compression for Vehicle-to-Vehicle Cooperative Perception	90
Unsupervised Learning-Based Framework for Deepfake Video Detection	89
AMS-Net: Adaptive Multi-Scale Network for Image Compressive Sensing	89
Self-Mining the Confident Prototypes for Source-Free Unsupervised Domain Adaptation in Image Segmentation	89
Progressive Local Filter Pruning for Image Retrieval Acceleration	88
Interpretable Graph Convolutional Network for Multi-View Semi-Supervised Learning	87
Guided Image-to-Image Translation by Discriminator-Generator Communication	84
Weakly-Supervised 3D Visual Grounding Based on Visual Language Alignment	84
Disentangled Graph Variational Auto-Encoder for Multimodal Recommendation With Interpretability	82
Dynamic Contrastive Distillation for Image-Text Retrieval	82
Multi-Level Transitional Contrast Learning for Personalized Image Aesthetics Assessment	82
PhotoHelper: Portrait Photographing Guidance Via Deep Feature Retrieval and Fusion	81

SLCGC: A lightweight Self-supervised Low-Pass Contrastive Graph Clustering Network for Hyperspectral Images	80
Asymptotics-Aware Multi-View Subspace Clustering	79
Semi-Supervised Domain Adaptation for Major Depressive Disorder Detection	78
One-Shot Human Motion Transfer via Occlusion-Robust Flow Prediction and Neural Texturing	78
Rényi Entropy Induced Efficient and Balanced One-Step Multi-View Clustering	77
Vision-Controllable Language Model for Image-Guided Story Ending Generation	77
Disaggregation Distillation for Person Search	76
Towards Neural Codec-Empowered 360$^\circ$ Video Streaming: A Saliency-Aided Synergistic Approach	75
MGKsite: Multi-Modal Knowledge-Driven Site Selection via Intra and Inter-Modal Graph Fusion	75
Towards Temporal Event Detection: A Dataset, Benchmarks and Challenges	74
Image-Based Structured Vehicle Behavior Analysis Inspired by Interactive Cognition	73
Benchmark Dataset and Pair-Wise Ranking Method for Quality Evaluation of Night-Time Image Enhancement	72
ALCER3D: Adaptive Learning Constraints for Enhanced Retrieval of Complex Indoor 3D Scenarios	71
Cps-STS: Bridging the Gap Between Content and Position for Coarse-Point-Supervised Scene Text Spotter	71
Dual Representation Aggregation Network for Blind Image Super-Resolution via Iterative Bi-level Optimization	71
Synthesize Boundaries: A Boundary-Aware Self-Consistent Framework for Weakly Supervised Salient Object Detection	70
C-CTX: Cubic-Checkerboard Context Entropy Model for Learned Image Compression	69
Primary Code Guided Targeted Attack against Cross-modal Hashing Retrieval	69
Supervised Contrastive Learning for Indoor Point Cloud Oversegmentation	69
Style-Agnostic Representation Learning for Visible-Infrared Person Re-Identification	67
No-Reference Bitstream-Layer Model for Perceptual Quality Assessment of V-PCC Encoded Point Clouds	66
RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios	66
SegTrans: Transferable Adversarial Examples for Segmentation Models	65
DEHand: Deformable Encoding for Photo-Realistic Free-View and Free-Pose Hand Rendering	65
Twin Tensor Learning for Consistency and Inconsistency: A Unified Affinity Learning Framework for Multi-View Clustering	64
Improving Fine-Grained Image Classification With Multimodal Information	63
DA-Net: Density-Aware 3D Object Detection Network for Point Clouds	63
JPEG AI Compressed Domain Face Detection: A Multi-Scale Bridging Perspective	63
Cross-Domain Sample Relationship Learning for Facial Expression Recognition	62
Can Machines Generate Personalized Music? A Hybrid Favorite-Aware Method for User Preference Music Transfer	62
Deep Unfolding Network for Image Compressed Sensing by Content-Adaptive Gradient Updating and Deformation-Invariant Non-Local Modeling	62
Semi-Supervised Authentically Distorted Image Quality Assessment With Consistency-Preserving Dual-Branch Convolutional Neural Network	62
FedSH: Towards Privacy-Preserving Text-Based Person Re-Identification	61
Reconstructed Graph Constrained Auto-Encoders for Multi-View Representation Learning	61
Motion Direction Awareness: A Biomimetic Dynamic Capture Mechanism for Video Prediction	60
Action-Responsive Contrastive Network for Fine-Grained Skeleton-Based Action Recognition	60
Depth Map Super-Resolution via Deep Cross-Modality and Cross-Scale Guidance	59
Test-Time Model Adaptation for Visual Question Answering With Debiased Self-Supervisions	59
RUL: Region Uncertainty Learning for Robust Face Recognition	59
Foodfusion: A Novel Approach for Food Image Composition via Diffusion Models	59
Investigating the Effective Dynamic Information of Spectral Shapes for Audio Classification	59
Multi-View User Preference Modeling for Personalized Text-to-Image Generation	59
High Specificity Guided Cross-Domain Few-Shot Segmentation	59
Knowledge Distillation-Based Domain-Invariant Representation Learning for Domain Generalization	58
Dynamic Strategy Prompt Reasoning for Emotional Support Conversation	58
Exploiting EfficientSAM and Temporal Coherence for Audio-Visual Segmentation	58
Video Instance Segmentation by Instance Flow Assembly	57
A Two-Stream Hybrid Convolution-Transformer Network Architecture for Clothing-Change Person Re-Identification	57
Generalizing Beyond Patterns: Dynamic Moment Query Recalibrating for Out-of-Distribution Video Temporal Localization	57
STNet: Scale Tree Network With Multi-Level Auxiliator for Crowd Counting	57
Velocity First? Rethinking 3D Object Detection with 4D Millimeter Wave Radar	57
Video-to-Music Recommendation Using Temporal Alignment of Segments	56
Sentiment-Enhanced Graph-Based Sarcasm Explanation in Dialogue	54
Rate-Adaptive Neural Network for Image Compressive Sensing	54
Vulnerabilities in AI-Generated Image Detection: The Challenge of Adversarial Attacks	54
IEIRNet: Inconsistency Exploiting Based Identity Rectification for Face Forgery Detection	54
TPE-ADE: Thumbnail-Preserving Encryption Based on Adaptive Deviation Embedding for JPEG Images	53
Reliable Multi-View Clustering with Graph Neural Network	53
Boosting Universal Adversarial Attack on Deep Neural Networks	53
GLFF: Global and Local Feature Fusion for AI-Synthesized Image Detection	53
Enhanced Context Mining and Filtering for Learned Video Compression	53
Improving Out-of-Distribution Generalization on Point Clouds with Cross-Domain Adversarial Distillation	53
CRSOT: Cross-Resolution Object Tracking Using Unaligned Frame and Event Cameras	52
Sounding Depressed? Personalized Deep Learning Model for Depression Detection from Speech and Text	52
Exploring Local and Global Consistent Correlation on Hypergraph for Rotation Invariant Point Cloud Analysis	52
VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition	52
EPM-Net: Efficient Feature Extraction, Point-Pair Feature Matching for Robust 6-D Pose Estimation	52
Compositional Text-to-Image Synthesis with Training-Free Layout-Guided Diffusion	52
Prototypical Bidirectional Adaptation and Learning for Cross-Domain Semantic Segmentation	52
Sparse Transformer for Ultra-Sparse Sampled Video Compressive Sensing	52
Cooperative Bargaining Game Based Adaptive Video Multicast Over Mobile Edge Networks	52
Saliency-Aware Adversarial Attacks on Visual Trackers	52
Anchor-guided Discrete Multi-view Clustering	52
Human-Centric Behavior Description in Videos: New Benchmark and Model	52
Spatial-Temporal Saliency Guided Unbiased Contrastive Learning for Video Scene Graph Generation	51
Probabilistic Temporal Masked Attention for Cross-View Online Action Detection	51
Denoised Semantic Features for Local Consistent No-Reference Image Quality Assessment	51
Universal Infrared Image Nonuniformity Correction via Stripe-Aware Attention Network	51
RSNet: Relation Separation Network for Few-Shot Similar Class Recognition	51
Personalized Fashion Recommendation With Discrete Content-Based Tensor Factorization	51

A Multidimensional Media Adaptation Framework for Live Holographic Communication	50
DREAMT: Diversity Enlarged Mutual Teaching for Unsupervised Domain Adaptive Person Re-Identification	50
Multimodal Progressive Modulation Network for Micro-Video Multi-Label Classification	50
CMANet: Context-Aware Mutual Attention Network for Referring Image Segmentation	50
High Fidelity Face-Swapping With Style ConvTransformer and Latent Space Selection	50
HP-C4D: A Fast Camera and 4D Radar Fusion Framework with Height Prediction For 3D Object Detection	49
FFFN: Frame-By-Frame Feedback Fusion Network for Video Super-Resolution	49
Show, Tell and Rephrase: Diverse Video Captioning via Two-Stage Progressive Training	49
Exploring Kernel-Based Texture Transfer for Pose-Guided Person Image Generation	49
UniCrossGait: Unified Cross-modal Gait Recognition Based on Knowledge Distillation	49
SDE2D: Semantic-Guided Discriminability Enhancement Feature Detector and Descriptor	49
Bidirectional Prototype-Reward co-Evolution for Test-Time Adaptation of Vision-Language Models	49
Multimodal Sentiment Analysis With Image-Text Interaction Network	48
Motion Deblur by Learning Residual From Events	48
Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement	48
Underwater Image Enhancement With Cascaded Contrastive Learning	48
Low-Light Image Enhancement via Self-Reinforced Retinex Projection Model	48
Pedestrian Trajectory Prediction Based on Social Interactions Learning With Random Weights	48
Dense Video Captioning With Early Linguistic Information Fusion	47
Flow Guidance Deformable Compensation Network for Video Frame Interpolation	47
CenterTube: Tracking Multiple 3D Objects With 4D Tubelets in Dynamic Point Clouds	47
Unleashing Knowledge Potential of Source Hypothesis for Source-Free Domain Adaptation	46
Semi-Supervised Knowledge Distillation for Cross-Modal Hashing	46
Going the Extra Mile in Face Image Quality Assessment: A Novel Database and Model	46
DIP: Diffusion Learning of Inconsistency Pattern for General DeepFake Detection	46
USD: Uncertainty-Based One-Phase Learning to Enhance Pseudo-Label Reliability for Semi-Supervised Object Detection	46
Rethinking the Role of Vector Quantization for Blind Image Restoration	46
VatLM: Visual-Audio-Text Pre-Training With Unified Masked Prediction for Speech Representation Learning	46
RaFPN: Relation-Aware Feature Pyramid Network for Dense Image Prediction	45
Augment One With Others: Generalizing to Unforeseen Variations for Visual Tracking	45
Underwater Adaptive Video Transmissions Using MIMO-Based Software-Defined Acoustic Modems	45
Semantics Alternating Enhancement and Bidirectional Aggregation for Referring Video Object Segmentation	45
Blind Video Quality Assessment at the Edge	45
OpenSlot: Mixed Open-Set Recognition With Object-Centric Learning	45
Decoupled Prototype Learning for Reliable Test-Time Adaptation	45
Prune and Merge: Efficient Token Compression for Vision Transformer With Spatial Information Preserved	45
SwimVG: Step-Wise Multimodal Fusion and Adaption for Visual Grounding	45
LININ: Logic Integrated Neural Inference Network for Explanatory Visual Question Answering	45
Graph Convolutional Network With Unknown Class Number	45
Geometric Continuity and Consistency Learning for Self-Supervised Point Cloud Completion	45
Category-Contrastive Fine-Grained Crowd Counting and Beyond	45
SSPNet: Predicting Visual Saliency Shifts	44
Soundscape Captioning Using Sound Affective Quality Network and Large Language Model	44
Auxiliary Representation Guided Network for Visible-Infrared Person Re-Identification	44
Unsupervised Deepfake Detection via Camera Source Clustering and Temporal-Spatial Features	44
Cross-Modality Feature Fusion for Forward-Looking Sonar Image Segmentation in Complex Underwater Environments	44
Geometry-Aware 3D Gaussian Representation for Real-Time Rendering of Large-Scale Scenes	44
Scene Graph Knowledge Enhanced Hashing with Contrastive Learning for Image-Text Retrieval	44
Cross-Scatter Sparse Dictionary Pair Learning for Cross-Domain Classification	43
Toward General Cross-Modal Signal Reconstruction for Robotic Teleoperation	43
A Blockchain and Improved Perception Hash Based Copyright Protection Scheme for Purely Chromatic Background Images	43
Inexactly Matched Referring Expression Comprehension With Rationale	43
Progressive Learning of Instance-Level Proxy Semantics for Few-Shot Action Recognition	43
Exploring Cross-Modal Mutual Prompt Learning for Video Quality Assessment	43
Towards a Multi-Granulated Statistical Framework for Human–Machine Collaboration in Image Classification	43
TR-Adapter: Parameter-Efficient Transfer Learning for Video Question Answering	43
Tensorformer: Normalized Matrix Attention Transformer for High-Quality Point Cloud Reconstruction	43
YACT-Net: Asymmetric YUV Color Transfer for Reference-Based Colorization	43
Simultaneously Training and Compressing Vision-and-Language Pre-Training Model	43
VRTNet: Vector Rectifier Transformer for Two-View Correspondence Learning	43
Edge-Assisted Massive Video Delivery Over Cell-Free Massive MIMO	43
Compression of Plenoptic Point Cloud Attributes Using 6-D Point Clouds and 6-D Transforms	43
Interpretable Multi-View Representation Learning Towards Complex Scenes: From Homogeneity to Heterogeneity	42
Instruction-Driven 3D Facial Expression Generation and Transition	42
Visibility-Based Geometry Pruning of Neural Plenoptic Scene Representations	42
CMI-Net: Cross-View Message Token Interaction Network for 3D Shape Recognition	42
Harnessing Attention Weight Tables for Computationally Efficient Multiple Object Tracking With Transformers	42
Tuning-Free High-Resolution Video Diffusion With Spatial-Temporal Latent Grouping	42
Action-Semantic Consistent Knowledge for Weakly-Supervised Action Localization	42
MGHead: Motion-Aware Animated Gaussian Head Avatars with Anchored Skeletal Structures	42
Point Cloud Soft Multicast for Untethered XR Users	41
FGDNet: Fine-Grained Detection Network Towards Face Anti-Spoofing	41
Joint Intra & Inter-Grained Reasoning: A New Look Into Semantic Consistency of Image-Text Retrieval	41
STFE: A Comprehensive Video-Based Person Re-Identification Network Based on Spatio-Temporal Feature Enhancement	41
TrackletGait: A Robust Framework for Gait Recognition in the Wild	41
MVL-Net: Pairwise Learning for Multi-View Multiple People Labelling	41
Question Understanding and Temporality Guiding for Video Question Answering	41
Reordered $k$-Means: A New Baseline for View-Unaligned Multi-View Clustering	41
Make Graph-Based Referring Expression Comprehension Great Again Through Expression-Guided Dynamic Gating and Regression	41
Towards Structure-Aware Model for Multi-Modal Knowledge Graph Completion	41
ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization	41
Progressive Learning Model for Big Data Analysis Using Subnetwork and Moore-Penrose Inverse	41
Bidirectional Maximum Entropy Training With Word Co-Occurrence for Video Captioning	41
EraW-Net: Enhance-Refine-Align W-Net for Scene-Associated Driver Attention Estimation	41
MPPM: A Mobile-Efficient Part Model for Object re-ID	41
CNIE: Content-Aware Non-Transferable Information Extraction for Fine-Grained Visual Categorization	41
Geo-SelfSSC: Integrating Dense Geometric Priors for Enhanced Self-Supervised Semantic Scene Completion	41
Noise Aware Audio-Visual Speech Denoising	41
Manifold-Based Incomplete Multi-View Clustering via Bi-Consistency Guidance	40
A Pixel Distribution Remapping and Multi-Prior Retinex Variational Model for Underwater Image Enhancement	40
Hierarchical Motion-Enhanced Matching Framework for Few-Shot Action Recognition	40
Low-Light Image Enhancement With SAM-Based Structure Priors and Guidance	40
Learning Semantic Polymorphic Mapping for Text-Based Person Retrieval	40
Multimodal Evidential Learning for Open-World Weakly-Supervised Video Anomaly Detection	40
Face De-Occlusion With Deep Cascade Guidance Learning	40