IEEE-ACM Transactions on Audio Speech and Language Processing

Papers
(The TQCC of IEEE-ACM Transactions on Audio Speech and Language Processing is 15. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2022-06-01 to 2026-06-01.)
ArticleCitations
Representation Learning With Hidden Unit Clustering for Low Resource Speech Applications340
Decorrelation in Feedback Delay Networks251
CET2: Modelling Topic Transitions for Coherent and Engaging Knowledge-Grounded Conversations237
WDEA: The Structure and Semantic Fusion With Wasserstein Distance for Low-Resource Language Entity Alignment185
$\mathcal {P}$owMix: A Versatile Regularizer for Multimodal Sentiment Analysis178
Towards Generating Diverse Audio Captions via Adversarial Training171
MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation164
Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition136
DropAttack: A Random Dropped Weight Attack Adversarial Training for Natural Language Understanding130
Reverberant Source Separation Using NTF With Delayed Subsources and Spatial Priors117
Audio-Only Phonetic Segment Classification Using Embeddings Learned From Audio and Ultrasound Tongue Imaging Data98
Envelope-Based Multichannel Noise Reduction for Cochlear Implant Applications88
Generalizing Speaker Verification for Spoof Awareness in the Embedding Space84
Attention-Based Speech Enhancement Using Human Quality Perception Modeling82
Efficient Lightweight Speaker Verification With Broadcasting CNN-Transformer and Knowledge Distillation Training of Self-Attention Maps77
Learning Discriminative Representations and Decision Boundaries for Open Intent Detection71
Multi-Channel to Multi-Channel Noise Reduction and Reverberant Speech Preservation in Time-Varying Acoustic Scenes for Binaural Reproduction70
Improvement of Accent Classification Models Through Grad-Transfer From Spectrograms and Gradient-Weighted Class Activation Mapping62
A User-Centric Approach for Deep Residual-Echo Suppression in Double-Talk61
Enhancing Robustness of Speech Watermarking Using a Transformer-Based Framework Exploiting Acoustic Features60
Review of Methods for Automatic Speaker Verification60
AudioLM: A Language Modeling Approach to Audio Generation55
The VoxCeleb Speaker Recognition Challenge: A Retrospective53
Adaptive Multi-Domain Dialogue State Tracking on Spoken Conversations50
COVID-19 Detection via Fusion of Modulation Spectrum and Linear Prediction Speech Features48
Label-Correction Capsule Network for Hierarchical Text Classification48
IEEE Signal Processing Society Information47
Pronunciation Dictionary-Free Multilingual Speech Synthesis Using Learned Phonetic Representations42
Emotion Prediction Oriented Method With Multiple Supervisions for Emotion-Cause Pair Extraction41
Complex-Domain Pitch Estimation Algorithm for Narrowband Speech Signals41
ReZero: Region-Customizable Sound Extraction41
Disentangled Text Representation Learning With Information-Theoretic Perspective for Adversarial Robustness40
Knowledge-Guided Transformer for Joint Theme and Emotion Classification of Chinese Classical Poetry39
Spherically Steerable Vector Differential Microphone Arrays39
Enhanced Multi-Domain Dialogue State Tracker With Second-Order Slot Interactions38
Exploiting Low-Rank Tensor-Train Deep Neural Networks Based on Riemannian Gradient Descent With Illustrations of Speech Processing38
Implicit Self-Supervised Language Representation for Spoken Language Diarization38
Distinctive and Natural Speaker Anonymization via Singular Value Transformation-Assisted Matrix37
Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation37
SPEC: Summary Preference Decomposition for Low-Resource Abstractive Summarization36
Audio-Visual Cross-Attention Network for Robotic Speaker Tracking36
Predicting Level-Dependent Changes in Concurrent Vowel Scores Using the 2D-CNN Models35
Phrase-Aware Financial Sentiment Analysis Based on Constituent Syntax35
Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach35
Integrated Syntactic and Semantic Tree for Targeted Sentiment Classification Using Dual-Channel Graph Convolutional Network35
End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations34
Sound Field Estimation Based on Physics-Constrained Kernel Interpolation Adapted to Environment34
Hate Speech Detection via Dual Contrastive Learning34
Blind Identification of Ambisonic Reduced Room Impulse Response34
Source Separation of Piano Concertos Using Musically Motivated Augmentation Techniques34
MusicYOLO: A Vision-Based Framework for Automatic Singing Transcription33
Howling Detection and Gain Control for Speech Reinforcement in a Noisy Car Cabin Environment33
Artificial Vocal Learning Guided by Phoneme Recognition and Visual Information33
Generalized Hyperbolic Tangent Based Random Fourier Conjugate Gradient Filter for Nonlinear Active Noise Control32
Weighted Frequency Smoothing for Enhanced Speaker Localization32
Multi-Layer Combined Frequency and Periodicity Representations for Multi-Pitch Estimation of Multi-Instrument Music32
Training a Singing Transcription Model Using Connectionist Temporal Classification Loss and Cross-Entropy Loss32
Grouped Feedback Delay Networks With Frequency-Dependent Coupling31
Learning Dynamic and Static Representations for Extrapolation-Based Temporal Knowledge Graph Reasoning31
TOE: A Grid-Tagging Discontinuous NER Model Enhanced by Embedding Tag/Word Relations and More Fine-Grained Tags31
Learning to Perturb for Contrastive Learning of Unsupervised Sentence Representations30
DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech — A Study Between English and Mandarin30
Multi-Grained Evidence Inference for Multi-Choice Reading Comprehension30
FlowHash: Accelerating Audio Search With Balanced Hashing via Normalizing Flow30
Tackling Interpretability in Audio Classification Networks With Non-negative Matrix Factorization30
Neural Multi-Channel and Multi-Microphone Acoustic Echo Cancellation30
$F0$ Estimation and Voicing Detection With Cascade Architecture in Noisy Speech30
Improving Speech Translation Accuracy and Time Efficiency With Fine-Tuned wav2vec 2.0-Based Speech Segmentation29
Magnitude-Corrected and Time-Aligned Interpolation of Head-Related Transfer Functions29
Cross-Modal Interaction via Reinforcement Feedback for Audio-Lyrics Retrieval29
Query-Efficient Black-Box Adversarial Attacks on Automatic Speech Recognition29
PE-Wav2vec: A Prosody-Enhanced Speech Model for Self-Supervised Prosody Learning in TTS29
Unsupervised Music Source Separation Using Differentiable Parametric Source Models29
Unsupervised Disentanglement Learning Model for Exemplar-Guided Paraphrase Generation28
Towards Recognition for Radio-Echo Speech in Air Traffic Control: Dataset and a Contrastive Learning Approach28
An Analysis of Traditional Noise Power Spectral Density Estimators Based on the Gaussian Stochastic Volatility Model28
A New Diffusion Filtered-X Affine Projection Algorithm: Performance Analysis and Application in Windy Environment27
Visually Grounded Few-Shot Word Learning in Low-Resource Settings27
Anti-Aliasing Speech DOA Estimation Under Spatial Aliasing Conditions27
Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization27
Refining History for Future-Aware Neural Machine Translation27
CoNeTTE: An Efficient Audio Captioning System Leveraging Multiple Datasets With Task Embedding26
Multi-Channel Speech Separation Using Spatially Selective Deep Non-Linear Filters26
Joint Multiscale Cross-Lingual Speaking Style Transfer With Bidirectional Attention Mechanism for Automatic Dubbing26
U-Shaped Transformer With Frequency-Band Aware Attention for Speech Enhancement26
Attention-Based Encoder-Decoder End-to-End Neural Diarization With Embedding Enhancer26
Configurable EBEN: Extreme Bandwidth Extension Network to Enhance Body-Conducted Speech Capture26
Unified Instance and Knowledge Alignment Pretraining for Aspect-Based Sentiment Analysis26
The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation26
Direction of Arrival Estimation of Sound Sources Using Icosahedral CNNs25
Audio-Visual End-to-End Multi-Channel Speech Separation, Dereverberation and Recognition25
Improving Speech Enhancement Performance by Leveraging Contextual Broad Phonetic Class Information25
Decomposition-Based Wiener Filter Using the Kronecker Product and Conjugate Gradient Method25
Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting25
Enhancing Conformer-Based Sound Event Detection Using Frequency Dynamic Convolutions and BEATs Audio Embeddings25
Scalable-Complexity Steered Response Power Based on Low-Rank and Sparse Interpolation25
DiaPer: End-to-End Neural Diarization With Perceiver-Based Attractors25
ASiT: Local-Global Audio Spectrogram Vision Transformer for Event Classification24
The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance24
TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition23
Active Discovering New Slots for Task-Oriented Conversation23
Higher-Order Stereophony23
Optimal Modal Decomposition for Directionally Biased Sound Field Recording23
Contrastive Learning for Target Speaker Extraction With Attention-Based Fusion23
A Perceptually Evaluated Signal Model: Collisions Between a Vibrating Object and an Obstacle23
Distance Metric-Based Open-Set Domain Adaptation for Speaker Verification23
Coefficients-Switched Normalized Least-Mean- Squares Adaption in Echo Canceler of Sparse-Echo-Path23
Hybrid-Frequency-Resolution Adaptive Kalman Filter for Online Identification of Long Acoustic Responses With Low Input-Output Latency22
EchoScan: Scanning Complex Room Geometries via Acoustic Echoes22
Minimum Processing Near-End Listening Enhancement22
Towards Improved Objective Perceptual Audio Quality Assessment - Part 1: A Novel Data-Driven Cognitive Model22
Enhancing Paraphrase Question Generation With Prior Knowledge22
A CTC Alignment-Based Non-Autoregressive Transformer for End-to-End Automatic Speech Recognition22
TriSAT: Trimodal Representation Learning for Multimodal Sentiment Analysis22
Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction With Multi-Modal Embeddings21
A Composite T60 Regression and Classification Approach for Speech Dereverberation21
BEHM-GAN: Bandwidth Extension of Historical Music Using Generative Adversarial Networks21
Consonant-Vowel Transition Models Based on Deep Learning for Objective Evaluation of Articulation21
Multi-Source Discriminant Subspace Alignment for Cross-Domain Speech Emotion Recognition21
A New Virtual Tracking Sub-Algorithm Based Hybrid Active Control System for Narrowband Noise With Impulsive Interference21
Improving Non-Autoregressive Translation Quality With Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC21
Meta-AF: Meta-Learning for Adaptive Filters20
MOSA: Music Motion With Semantic Annotation Dataset for Cross-Modal Music Processing20
Speech Enhancement and Dereverberation With Diffusion-Based Generative Models20
Statistical Analysis for Speaker Recognition Evaluation With Data Dependence and Three Score Distributions20
Multi-Source Localization Using Optimized Time-Frequency Representation and Sparsity Component Analysis20
MRC-PASCL: A Few-Shot Machine Reading Comprehension Approach via Post-Training and Answer Span-Oriented Contrastive Learning20
On the Predictive Power of Objective Intelligibility Metrics for the Subjective Performance of Deep Complex Convolutional Recurrent Speech Enhancement Networks20
ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations20
Empathetic Response Generation Based on Plug-and-Play Mechanism With Empathy Perturbation20
A Flexible Architecture Using Temporal, Spatial and Semantic Correlation-Based Algorithms for Story Segmentation of Broadcast News20
Latent-Domain Predictive Neural Speech Coding20
Heterogeneous-Graph Reasoning With Context Paraphrase for Commonsense Question Answering20
List of Reviewers20
SpeechLM: Enhanced Speech Pre-Training With Unpaired Textual Data20
Block-Based Perceptually Adaptive Sound Zones With Reproduction Error Constraints19
En-HACN: Enhancing Hybrid Architecture With Fast Attention and Capsule Network for End-to-end Speech Recognition19
Decomposed Meta-Learning for Few-Shot Sequence Labeling19
EmoInt-Trans: A Multimodal Transformer for Identifying Emotions and Intents in Social Conversations19
Operation-Augmented Numerical Reasoning for Question Answering19
Interrelate Training and Clustering for Online Speaker Diarization19
Cross-Domain Aspect-Based Sentiment Classification With Tripartite Graph Modeling19
Joint Maximum Likelihood Estimation of Microphone Array Parameters for a Reverberant Single Source Scenario18
LMD: A Learnable Mask Network to Detect Adversarial Examples for Speaker Verification18
FTDKD: Frequency-Time Domain Knowledge Distillation for Low-Quality Compressed Audio Deepfake Detection18
Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis18
Uncertainty-Driven Knowledge Distillation for Language Model Compression18
RBA-GCN: Relational Bilevel Aggregation Graph Convolutional Network for Emotion Recognition18
FxLMS/F Based Tap Decomposed Adaptive Filter for Decentralized Active Noise Control System18
Multi-Cue Guided Semi-Supervised Learning Toward Target Speaker Separation in Real Environments18
JMS-QA: A Joint Hierarchical Architecture for Mental Health Question Answering18
Segment-Less Continuous Speech Separation of Meetings: Training and Evaluation Criteria18
Dynamic Prompt-Driven Zero-Shot Relation Extraction18
Gradformer: A Framework for Multi-Aspect Multi-Granularity Pronunciation Assessment18
Dynamic Convolutional Neural Networks as Efficient Pre-Trained Audio Models18
Distributed Microphone Array Localization Problem via SDP-SOCP Method17
Controllable Dialogue Generation With Disentangled Multi-Grained Style Specification and Attribute Consistency Reward17
Towards Unified Multi-Domain Machine Translation With Mixture of Domain Experts17
Automatic Detection of Speech Sound Disorder in Cantonese-Speaking Pre-School Children17
Multi-Task Attentive Residual Networks for Argument Mining17
Data-Centric Methods for Environmental Sound Classification With Limited Labels17
Artist Similarity Based on Heterogeneous Graph Neural Networks17
Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR16
Direct and Residual Subspace Decomposition of Spatial Room Impulse Responses16
SinTechSVS: A Singing Technique Controllable Singing Voice Synthesis System16
Selective Acoustic Feature Enhancement for Speech Emotion Recognition With Noisy Speech16
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation16
Joint Dual Learning With Mutual Information Maximization for Natural Language Understanding and Generation in Dialogues16
NoiseBandNet: Controllable Time-Varying Neural Synthesis of Sound Effects Using Filterbanks16
Enhancing Low-Resource NLP by Consistency Training With Data and Model Perturbations16
Interpretable Spectrum Transformation Attacks to Speaker Recognition Systems16
JoinER-BART: Joint Entity and Relation Extraction With Constrained Decoding, Representation Reuse and Fusion16
ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing Tasks16
Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition16
STFT-Domain Neural Speech Enhancement With Very Low Algorithmic Latency15
WDSRL: Multi-Domain Neural Machine Translation With Word-Level Domain-Sensitive Representation Learning15
Localization-Driven Speech Enhancement in Noisy Multi-Speaker Hospital Environments Using Deep Learning and Meta Learning15
A Variance-Preserving Interpolation Approach for Diffusion Models With Applications to Single Channel Speech Enhancement and Recognition15
Adaptive Ensemble Self-Distillation With Consistent Gradients for Fast Inference of Pretrained Language Models15
STFF-SM: Steganalysis Model Based on Spatial and Temporal Feature Fusion for Speech Streams15
Music Source Separation With Band-Split RNN15
Statistically Guided Near-End Speech Intelligibility Improvement Through Voice Transformation and Transfer Learning15
LegoNN: Building Modular Encoder-Decoder Models15
1.7897679805756