OOIR: Observatory of International Research

Papers

(The median citation count of IEEE-ACM Transactions on Audio Speech and Language Processing is 8. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2022-06-01 to 2026-06-01.)

Article	Citations
Representation Learning With Hidden Unit Clustering for Low Resource Speech Applications	340
Decorrelation in Feedback Delay Networks	251
CET2: Modelling Topic Transitions for Coherent and Engaging Knowledge-Grounded Conversations	237
WDEA: The Structure and Semantic Fusion With Wasserstein Distance for Low-Resource Language Entity Alignment	185
$\mathcal {P}$owMix: A Versatile Regularizer for Multimodal Sentiment Analysis	178
Towards Generating Diverse Audio Captions via Adversarial Training	171
MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation	164
Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition	136
DropAttack: A Random Dropped Weight Attack Adversarial Training for Natural Language Understanding	130
Reverberant Source Separation Using NTF With Delayed Subsources and Spatial Priors	117
Audio-Only Phonetic Segment Classification Using Embeddings Learned From Audio and Ultrasound Tongue Imaging Data	98
Envelope-Based Multichannel Noise Reduction for Cochlear Implant Applications	88
Generalizing Speaker Verification for Spoof Awareness in the Embedding Space	84
Attention-Based Speech Enhancement Using Human Quality Perception Modeling	82
Efficient Lightweight Speaker Verification With Broadcasting CNN-Transformer and Knowledge Distillation Training of Self-Attention Maps	77
Learning Discriminative Representations and Decision Boundaries for Open Intent Detection	71
Multi-Channel to Multi-Channel Noise Reduction and Reverberant Speech Preservation in Time-Varying Acoustic Scenes for Binaural Reproduction	70
Improvement of Accent Classification Models Through Grad-Transfer From Spectrograms and Gradient-Weighted Class Activation Mapping	62
A User-Centric Approach for Deep Residual-Echo Suppression in Double-Talk	61
Review of Methods for Automatic Speaker Verification	60
Enhancing Robustness of Speech Watermarking Using a Transformer-Based Framework Exploiting Acoustic Features	60
AudioLM: A Language Modeling Approach to Audio Generation	55
The VoxCeleb Speaker Recognition Challenge: A Retrospective	53
Adaptive Multi-Domain Dialogue State Tracking on Spoken Conversations	50
COVID-19 Detection via Fusion of Modulation Spectrum and Linear Prediction Speech Features	48

Label-Correction Capsule Network for Hierarchical Text Classification	48
IEEE Signal Processing Society Information	47
Pronunciation Dictionary-Free Multilingual Speech Synthesis Using Learned Phonetic Representations	42
ReZero: Region-Customizable Sound Extraction	41
Emotion Prediction Oriented Method With Multiple Supervisions for Emotion-Cause Pair Extraction	41
Complex-Domain Pitch Estimation Algorithm for Narrowband Speech Signals	41
Disentangled Text Representation Learning With Information-Theoretic Perspective for Adversarial Robustness	40
Spherically Steerable Vector Differential Microphone Arrays	39
Knowledge-Guided Transformer for Joint Theme and Emotion Classification of Chinese Classical Poetry	39
Exploiting Low-Rank Tensor-Train Deep Neural Networks Based on Riemannian Gradient Descent With Illustrations of Speech Processing	38
Implicit Self-Supervised Language Representation for Spoken Language Diarization	38
Enhanced Multi-Domain Dialogue State Tracker With Second-Order Slot Interactions	38
Distinctive and Natural Speaker Anonymization via Singular Value Transformation-Assisted Matrix	37
Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation	37
Audio-Visual Cross-Attention Network for Robotic Speaker Tracking	36
SPEC: Summary Preference Decomposition for Low-Resource Abstractive Summarization	36
Integrated Syntactic and Semantic Tree for Targeted Sentiment Classification Using Dual-Channel Graph Convolutional Network	35
Predicting Level-Dependent Changes in Concurrent Vowel Scores Using the 2D-CNN Models	35
Phrase-Aware Financial Sentiment Analysis Based on Constituent Syntax	35
Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach	35
Blind Identification of Ambisonic Reduced Room Impulse Response	34
Source Separation of Piano Concertos Using Musically Motivated Augmentation Techniques	34
End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations	34
Sound Field Estimation Based on Physics-Constrained Kernel Interpolation Adapted to Environment	34
Hate Speech Detection via Dual Contrastive Learning	34
Artificial Vocal Learning Guided by Phoneme Recognition and Visual Information	33
MusicYOLO: A Vision-Based Framework for Automatic Singing Transcription	33
Howling Detection and Gain Control for Speech Reinforcement in a Noisy Car Cabin Environment	33
Training a Singing Transcription Model Using Connectionist Temporal Classification Loss and Cross-Entropy Loss	32
Generalized Hyperbolic Tangent Based Random Fourier Conjugate Gradient Filter for Nonlinear Active Noise Control	32
Weighted Frequency Smoothing for Enhanced Speaker Localization	32
Multi-Layer Combined Frequency and Periodicity Representations for Multi-Pitch Estimation of Multi-Instrument Music	32
TOE: A Grid-Tagging Discontinuous NER Model Enhanced by Embedding Tag/Word Relations and More Fine-Grained Tags	31
Grouped Feedback Delay Networks With Frequency-Dependent Coupling	31
Learning Dynamic and Static Representations for Extrapolation-Based Temporal Knowledge Graph Reasoning	31
FlowHash: Accelerating Audio Search With Balanced Hashing via Normalizing Flow	30
Tackling Interpretability in Audio Classification Networks With Non-negative Matrix Factorization	30
Neural Multi-Channel and Multi-Microphone Acoustic Echo Cancellation	30
$F0$ Estimation and Voicing Detection With Cascade Architecture in Noisy Speech	30
Learning to Perturb for Contrastive Learning of Unsupervised Sentence Representations	30
DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech — A Study Between English and Mandarin	30
Multi-Grained Evidence Inference for Multi-Choice Reading Comprehension	30
Cross-Modal Interaction via Reinforcement Feedback for Audio-Lyrics Retrieval	29
Query-Efficient Black-Box Adversarial Attacks on Automatic Speech Recognition	29
PE-Wav2vec: A Prosody-Enhanced Speech Model for Self-Supervised Prosody Learning in TTS	29
Unsupervised Music Source Separation Using Differentiable Parametric Source Models	29
Improving Speech Translation Accuracy and Time Efficiency With Fine-Tuned wav2vec 2.0-Based Speech Segmentation	29
Magnitude-Corrected and Time-Aligned Interpolation of Head-Related Transfer Functions	29
An Analysis of Traditional Noise Power Spectral Density Estimators Based on the Gaussian Stochastic Volatility Model	28
Unsupervised Disentanglement Learning Model for Exemplar-Guided Paraphrase Generation	28

Towards Recognition for Radio-Echo Speech in Air Traffic Control: Dataset and a Contrastive Learning Approach	28
Refining History for Future-Aware Neural Machine Translation	27
A New Diffusion Filtered-X Affine Projection Algorithm: Performance Analysis and Application in Windy Environment	27
Visually Grounded Few-Shot Word Learning in Low-Resource Settings	27
Anti-Aliasing Speech DOA Estimation Under Spatial Aliasing Conditions	27
Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization	27
Configurable EBEN: Extreme Bandwidth Extension Network to Enhance Body-Conducted Speech Capture	26
Unified Instance and Knowledge Alignment Pretraining for Aspect-Based Sentiment Analysis	26
Joint Multiscale Cross-Lingual Speaking Style Transfer With Bidirectional Attention Mechanism for Automatic Dubbing	26
The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation	26
CoNeTTE: An Efficient Audio Captioning System Leveraging Multiple Datasets With Task Embedding	26
Multi-Channel Speech Separation Using Spatially Selective Deep Non-Linear Filters	26
U-Shaped Transformer With Frequency-Band Aware Attention for Speech Enhancement	26
Attention-Based Encoder-Decoder End-to-End Neural Diarization With Embedding Enhancer	26
Direction of Arrival Estimation of Sound Sources Using Icosahedral CNNs	25
Audio-Visual End-to-End Multi-Channel Speech Separation, Dereverberation and Recognition	25
Improving Speech Enhancement Performance by Leveraging Contextual Broad Phonetic Class Information	25
Decomposition-Based Wiener Filter Using the Kronecker Product and Conjugate Gradient Method	25
Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting	25
Enhancing Conformer-Based Sound Event Detection Using Frequency Dynamic Convolutions and BEATs Audio Embeddings	25
Scalable-Complexity Steered Response Power Based on Low-Rank and Sparse Interpolation	25
DiaPer: End-to-End Neural Diarization With Perceiver-Based Attractors	25
ASiT: Local-Global Audio Spectrogram Vision Transformer for Event Classification	24
The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance	24
TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition	23
Active Discovering New Slots for Task-Oriented Conversation	23
Higher-Order Stereophony	23
Optimal Modal Decomposition for Directionally Biased Sound Field Recording	23
Contrastive Learning for Target Speaker Extraction With Attention-Based Fusion	23
A Perceptually Evaluated Signal Model: Collisions Between a Vibrating Object and an Obstacle	23
Distance Metric-Based Open-Set Domain Adaptation for Speaker Verification	23
Coefficients-Switched Normalized Least-Mean- Squares Adaption in Echo Canceler of Sparse-Echo-Path	23
Minimum Processing Near-End Listening Enhancement	22
Towards Improved Objective Perceptual Audio Quality Assessment - Part 1: A Novel Data-Driven Cognitive Model	22
Enhancing Paraphrase Question Generation With Prior Knowledge	22
A CTC Alignment-Based Non-Autoregressive Transformer for End-to-End Automatic Speech Recognition	22
TriSAT: Trimodal Representation Learning for Multimodal Sentiment Analysis	22
Hybrid-Frequency-Resolution Adaptive Kalman Filter for Online Identification of Long Acoustic Responses With Low Input-Output Latency	22
EchoScan: Scanning Complex Room Geometries via Acoustic Echoes	22
Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction With Multi-Modal Embeddings	21
A Composite T60 Regression and Classification Approach for Speech Dereverberation	21
BEHM-GAN: Bandwidth Extension of Historical Music Using Generative Adversarial Networks	21
Consonant-Vowel Transition Models Based on Deep Learning for Objective Evaluation of Articulation	21
Multi-Source Discriminant Subspace Alignment for Cross-Domain Speech Emotion Recognition	21
A New Virtual Tracking Sub-Algorithm Based Hybrid Active Control System for Narrowband Noise With Impulsive Interference	21
Improving Non-Autoregressive Translation Quality With Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC	21
Meta-AF: Meta-Learning for Adaptive Filters	20
MOSA: Music Motion With Semantic Annotation Dataset for Cross-Modal Music Processing	20
Speech Enhancement and Dereverberation With Diffusion-Based Generative Models	20
Statistical Analysis for Speaker Recognition Evaluation With Data Dependence and Three Score Distributions	20
Multi-Source Localization Using Optimized Time-Frequency Representation and Sparsity Component Analysis	20
MRC-PASCL: A Few-Shot Machine Reading Comprehension Approach via Post-Training and Answer Span-Oriented Contrastive Learning	20
On the Predictive Power of Objective Intelligibility Metrics for the Subjective Performance of Deep Complex Convolutional Recurrent Speech Enhancement Networks	20
ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations	20
Empathetic Response Generation Based on Plug-and-Play Mechanism With Empathy Perturbation	20
A Flexible Architecture Using Temporal, Spatial and Semantic Correlation-Based Algorithms for Story Segmentation of Broadcast News	20
Latent-Domain Predictive Neural Speech Coding	20
Heterogeneous-Graph Reasoning With Context Paraphrase for Commonsense Question Answering	20
List of Reviewers	20
SpeechLM: Enhanced Speech Pre-Training With Unpaired Textual Data	20
En-HACN: Enhancing Hybrid Architecture With Fast Attention and Capsule Network for End-to-end Speech Recognition	19
Decomposed Meta-Learning for Few-Shot Sequence Labeling	19
EmoInt-Trans: A Multimodal Transformer for Identifying Emotions and Intents in Social Conversations	19
Operation-Augmented Numerical Reasoning for Question Answering	19
Interrelate Training and Clustering for Online Speaker Diarization	19
Cross-Domain Aspect-Based Sentiment Classification With Tripartite Graph Modeling	19
Block-Based Perceptually Adaptive Sound Zones With Reproduction Error Constraints	19
LMD: A Learnable Mask Network to Detect Adversarial Examples for Speaker Verification	18
FTDKD: Frequency-Time Domain Knowledge Distillation for Low-Quality Compressed Audio Deepfake Detection	18
Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis	18
Uncertainty-Driven Knowledge Distillation for Language Model Compression	18
RBA-GCN: Relational Bilevel Aggregation Graph Convolutional Network for Emotion Recognition	18
FxLMS/F Based Tap Decomposed Adaptive Filter for Decentralized Active Noise Control System	18
Multi-Cue Guided Semi-Supervised Learning Toward Target Speaker Separation in Real Environments	18
JMS-QA: A Joint Hierarchical Architecture for Mental Health Question Answering	18
Segment-Less Continuous Speech Separation of Meetings: Training and Evaluation Criteria	18
Dynamic Prompt-Driven Zero-Shot Relation Extraction	18
Gradformer: A Framework for Multi-Aspect Multi-Granularity Pronunciation Assessment	18
Dynamic Convolutional Neural Networks as Efficient Pre-Trained Audio Models	18
Joint Maximum Likelihood Estimation of Microphone Array Parameters for a Reverberant Single Source Scenario	18

Distributed Microphone Array Localization Problem via SDP-SOCP Method	17
Controllable Dialogue Generation With Disentangled Multi-Grained Style Specification and Attribute Consistency Reward	17
Towards Unified Multi-Domain Machine Translation With Mixture of Domain Experts	17
Automatic Detection of Speech Sound Disorder in Cantonese-Speaking Pre-School Children	17
Multi-Task Attentive Residual Networks for Argument Mining	17
Data-Centric Methods for Environmental Sound Classification With Limited Labels	17
Artist Similarity Based on Heterogeneous Graph Neural Networks	17
Direct and Residual Subspace Decomposition of Spatial Room Impulse Responses	16
SinTechSVS: A Singing Technique Controllable Singing Voice Synthesis System	16
Selective Acoustic Feature Enhancement for Speech Emotion Recognition With Noisy Speech	16
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation	16
Joint Dual Learning With Mutual Information Maximization for Natural Language Understanding and Generation in Dialogues	16
NoiseBandNet: Controllable Time-Varying Neural Synthesis of Sound Effects Using Filterbanks	16
Enhancing Low-Resource NLP by Consistency Training With Data and Model Perturbations	16
Interpretable Spectrum Transformation Attacks to Speaker Recognition Systems	16
JoinER-BART: Joint Entity and Relation Extraction With Constrained Decoding, Representation Reuse and Fusion	16
ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing Tasks	16
Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition	16
Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR	16
WDSRL: Multi-Domain Neural Machine Translation With Word-Level Domain-Sensitive Representation Learning	15
A Variance-Preserving Interpolation Approach for Diffusion Models With Applications to Single Channel Speech Enhancement and Recognition	15
Localization-Driven Speech Enhancement in Noisy Multi-Speaker Hospital Environments Using Deep Learning and Meta Learning	15
Adaptive Ensemble Self-Distillation With Consistent Gradients for Fast Inference of Pretrained Language Models	15
STFF-SM: Steganalysis Model Based on Spatial and Temporal Feature Fusion for Speech Streams	15
Statistically Guided Near-End Speech Intelligibility Improvement Through Voice Transformation and Transfer Learning	15
Music Source Separation With Band-Split RNN	15
LegoNN: Building Modular Encoder-Decoder Models	15
STFT-Domain Neural Speech Enhancement With Very Low Algorithmic Latency	15
A Two-Stage Deep Representation Learning-Based Speech Enhancement Method Using Variational Autoencoder and Adversarial Training	14
USDnet: Unsupervised Speech Dereverberation via Neural Forward Filtering	14
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research	14
Hierarchical Topic-Aware Contextualized Transformers	14
Lost in Context? On the Sense-Wise Variance of Contextualized Word Embeddings	14
Sound Events Localization and Detection Using Bio-Inspired Gammatone Filters and Temporal Convolutional Neural Networks	14
The Power of Fragmentation: A Hierarchical Transformer Model for Structural Segmentation in Symbolic Music Generation	14
Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning	14
Multi-Channel Conversational Speaker Separation via Neural Diarization	14
Learning Multi-Dimensional Speaker Localization: Axis Partitioning, Unbiased Label Distribution, and Data Augmentation	14
Training-Based Multiple Source Tracking Using Manifold-Learning and Recursive Expectation-Maximization	14
PQG-A2SA: Performance Quantification Guided Audio-to-Score Alignment for Orchestral Music	14
Assessing the Generalization Gap of Learning-Based Speech Enhancement Systems in Noisy and Reverberant Environments	14
Boosting Cross-Domain Speech Recognition With Self-Supervision	14
Recent Trends in Deep Learning Based Textual Emotion Cause Extraction	14
On the Quantization of Neural Models for Speaker Verification	14
Real-Time Multichannel Deep Speech Enhancement in Hearing Aids: Comparing Monaural and Binaural Processing in Complex Acoustic Scenarios	14
Hiformer: Sequence Modeling Networks With Hierarchical Attention Mechanisms	14
Analysis and Design of Head-Tracked Compensation for Bilateral Ambisonics	14
A Multi-Level Supervised Contrastive Learning Framework for Low-Resource Natural Language Inference	14
Robust Subband Adaptive Filter Algorithms-Based Mixture Correntropy and Application to Acoustic Echo Cancellation	13
Can Pretrained English Language Models Benefit Non-English NLP Systems in Low-Resource Scenarios?	13
Zero-Shot Cross-Lingual Named Entity Recognition via Progressive Multi-Teacher Distillation	13
PoLyScriber: Integrated Fine-Tuning of Extractor and Lyrics Transcriber for Polyphonic Music	13
Attention and DCT Based Global Context Modeling for Text-Independent Speaker Recognition	13
Steganalysis of AMR Speech Stream Based on Multi-Domain Information Fusion	13
AmbiSep: Joint Ambisonic-to-Ambisonic Speech Separation and Noise Reduction	13
Contrastive Self-Supervised Speaker Embedding With Sequential Disentanglement	13
Syntax-Aware Data Augmentation for Neural Machine Translation	13
Multi-Task Multi-Attention Transformer for Generative Named Entity Recognition	13
CausalABSC: Causal Inference for Aspect Debiasing in Aspect-Based Sentiment Classification	13
BioPRO: Context-Infused Prompt Learning for Biomedical Entity Linking	13
Overview of the Tenth Dialog System Technology Challenge: DSTC10	13
Datastore Distillation for Nearest Neighbor Machine Translation	13
Inter-Frequency Phase Difference for Phase Reconstruction Using Deep Neural Networks and Maximum Likelihood	13
Dynamic Processing Neural Network Architecture for Hearing Loss Compensation	12
M3S: Scene Graph Driven Multi-Granularity Multi-Task Learning for Multi-Modal NER	12
Masked Graph Learning With Recurrent Alignment for Multimodal Emotion Recognition in Conversation	12
A Dynamic Convolution Framework for Session-Independent Speaker Embedding Learning	12
Oracle Teacher: Leveraging Target Information for Better Knowledge Distillation of CTC Models	12
Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation	12
BaSFormer: A Balanced Sparsity Regularized Attention Network for Transformer	12
APNet: An All-Frame-Level Neural Vocoder Incorporating Direct Prediction of Amplitude and Phase Spectra	12
Boosting Short Text Classification by Solving the OOV Problem	12
Distributed Sensor Selection for Speech Enhancement With Acoustic Sensor Networks	12
Local Periodicity-Based Beat Tracking for Expressive Classical Piano Music	12
CQT-Based Cepstral Features for Classification of Normal vs. Pathological Infant Cry	12
USAT: A Universal Speaker-Adaptive Text-to-Speech Approach	12
TF-CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation	12
Audio Embedding-Aware Dialogue Policy Learning	12
Detecting the Presence of Sperm Whales' Echolocation Clicks in Noisy Environments	12
Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition	12
Unsupervised Speech Enhancement Using Optimal Transport and Speech Presence Probability	12
Topic-Oriented Dialogue Summarization	12
Decoupling Speaker-Independent Emotions for Voice Conversion via Source-Filter Networks	12
Harmonic-Aware Frequency and Time Attention for Automatic Piano Transcription	12
Decomposition and Reorganization of Phonetic Information for Speaker Embedding Learning	12
Unsupervised Face-Masked Speech Enhancement Using Generative Adversarial Networks With Human-in-the-Loop Assessment Metrics	12
Cacophony: An Improved Contrastive Audio-Text Model	11
Enhancing Multimodal Entity and Relation Extraction With Variational Information Bottleneck	11
A Novel Unsupervised Approach for Cross-Lingual Word Alignment in Low Isomorphic Embedding Spaces	11
Sparsity-Promoting Affine Projection Algorithm With Periodically-Updated Gain Matrix and Its Performance Analysis	11
Exploring the Role of Language Families for Building Indic Speech Synthesisers	11
An Interpretable Deep Mutual Information Curriculum Metric for a Robust and Generalized Speech Emotion Recognition System	11
Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement	11
Low-Latency Active Noise Control Using Attentive Recurrent Network	11
Compression of Higher-Order Ambisonic Signals Using Directional Audio Coding	11