IEEE-ACM Transactions on Audio Speech and Language Processing

Papers
(The H4-Index of IEEE-ACM Transactions on Audio Speech and Language Processing is 36. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2020-11-01 to 2024-11-01.)
ArticleCitations
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units891
Pre-Training With Whole Word Masking for Chinese BERT589
An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning166
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech155
FSD50K: An Open Dataset of Human-Labeled Sound Events147
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation145
CTNet: Conversational Transformer Network for Emotion Recognition129
SoundStream: An End-to-End Neural Audio Codec119
Wavesplit: End-to-End Speech Separation by Speaker Clustering107
Dense CNN With Self-Attention for Time-Domain Speech Enhancement106
Two Heads are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement97
AudioLM: A Language Modeling Approach to Audio Generation91
PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation73
Investigating Typed Syntactic Dependencies for Targeted Sentiment Classification Using Graph Attention Neural Network66
Overview and Evaluation of Sound Event Localization and Detection in DCASE 201966
Robust Sound Source Tracking Using SRP-PHAT and 3D Convolutional Neural Networks58
Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition57
The Detection of Parkinson's Disease From Speech Using Voice Source Information55
ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild55
Analyzing Multimodal Sentiment Via Acoustic- and Visual-LSTM With Channel-Aware Temporal Convolution Network54
Diffsound: Discrete Diffusion Model for Text-to-Sound Generation50
Towards Model Compression for Deep Learning Based Speech Enhancement50
FluentNet: End-to-End Detection of Stuttered Speech Disfluencies With Deep Learning49
Speech Enhancement and Dereverberation With Diffusion-Based Generative Models48
Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation46
Neural Spectrospatial Filtering41
A Cross-Entropy-Guided Measure (CEGM) for Assessing Speech Recognition Performance and Optimizing DNN-Based Speech Enhancement40
Multiple Source Direction of Arrival Estimations Using Relative Sound Pressure Based MUSIC40
Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks40
Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog39
High-Resolution Piano Transcription With Pedals by Regressing Onset and Offset Times39
Speech Emotion Recognition Considering Nonverbal Vocalization in Affective Conversations38
Multimodal Sentiment Analysis With Two-Phase Multi-Task Learning38
Information Fusion in Attention Networks Using Adaptive and Multi-Level Factorized Bilinear Pooling for Audio-Visual Emotion Recognition37
Audio-Visual Deep Neural Network for Robust Person Verification37
Fast End-to-End Speech Recognition Via Non-Autoregressive Models and Cross-Modal Knowledge Transferring From BERT36
Recent Progress in the CUHK Dysarthric Speech Recognition System36
MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis36
0.065467834472656