IEEE-ACM Transactions on Audio Speech and Language Processing

Papers
(The TQCC of IEEE-ACM Transactions on Audio Speech and Language Processing is 8. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2020-11-01 to 2024-11-01.)
ArticleCitations
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units891
Pre-Training With Whole Word Masking for Chinese BERT589
An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning166
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech155
FSD50K: An Open Dataset of Human-Labeled Sound Events147
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation145
CTNet: Conversational Transformer Network for Emotion Recognition129
SoundStream: An End-to-End Neural Audio Codec119
Wavesplit: End-to-End Speech Separation by Speaker Clustering107
Dense CNN With Self-Attention for Time-Domain Speech Enhancement106
Two Heads are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement97
AudioLM: A Language Modeling Approach to Audio Generation91
PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation73
Investigating Typed Syntactic Dependencies for Targeted Sentiment Classification Using Graph Attention Neural Network66
Overview and Evaluation of Sound Event Localization and Detection in DCASE 201966
Robust Sound Source Tracking Using SRP-PHAT and 3D Convolutional Neural Networks58
Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition57
ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild55
The Detection of Parkinson's Disease From Speech Using Voice Source Information55
Analyzing Multimodal Sentiment Via Acoustic- and Visual-LSTM With Channel-Aware Temporal Convolution Network54
Towards Model Compression for Deep Learning Based Speech Enhancement50
Diffsound: Discrete Diffusion Model for Text-to-Sound Generation50
FluentNet: End-to-End Detection of Stuttered Speech Disfluencies With Deep Learning49
Speech Enhancement and Dereverberation With Diffusion-Based Generative Models48
Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation46
Neural Spectrospatial Filtering41
Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks40
A Cross-Entropy-Guided Measure (CEGM) for Assessing Speech Recognition Performance and Optimizing DNN-Based Speech Enhancement40
Multiple Source Direction of Arrival Estimations Using Relative Sound Pressure Based MUSIC40
Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog39
High-Resolution Piano Transcription With Pedals by Regressing Onset and Offset Times39
Speech Emotion Recognition Considering Nonverbal Vocalization in Affective Conversations38
Multimodal Sentiment Analysis With Two-Phase Multi-Task Learning38
Information Fusion in Attention Networks Using Adaptive and Multi-Level Factorized Bilinear Pooling for Audio-Visual Emotion Recognition37
Audio-Visual Deep Neural Network for Robust Person Verification37
Fast End-to-End Speech Recognition Via Non-Autoregressive Models and Cross-Modal Knowledge Transferring From BERT36
Recent Progress in the CUHK Dysarthric Speech Recognition System36
MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis36
Steering Study of Linear Differential Microphone Arrays35
End-to-End Speech Recognition: A Survey35
Expressive TTS Training With Frame and Style Reconstruction Loss35
Any-to-Many Voice Conversion With Location-Relative Sequence-to-Sequence Modeling35
TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation34
Nearest Kronecker Product Decomposition Based Linear-in-The-Parameters Nonlinear Filters34
Transfer Learning From Speech Synthesis to Voice Conversion With Non-Parallel Training Data33
Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders33
Objective Measures of Perceptual Audio Quality Reviewed: An Evaluation of Their Application Domain Dependence32
Domain Invariant Feature Learning for Speaker-Independent Speech Emotion Recognition32
A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection32
Music Source Separation With Band-Split RNN31
Multi-Task Sequence Tagging for Emotion-Cause Pair Extraction Via Tag Distribution Refinement30
A Unified Target-Oriented Sequence-to-Sequence Model for Emotion-Cause Pair Extraction30
Exploiting Adapters for Cross-Lingual Low-Resource Speech Recognition30
Modified Magnitude-Phase Spectrum Information for Spoofing Detection30
Block-Based High Performance CNN Architectures for Frame-Level Overlapping Speech Detection30
Towards Duration Robust Weakly Supervised Sound Event Detection29
Zero-Shot Audio Classification Via Semantic Embeddings29
Encoder-Decoder Based Attractors for End-to-End Neural Diarization28
Multimodal Emotion Recognition With Temporal and Semantic Consistency27
Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features27
Pretraining Techniques for Sequence-to-Sequence Voice Conversion27
Self-Attending RNN for Speech Enhancement to Improve Cross-Corpus Generalization27
Multi-Classifier Interactive Learning for Ambiguous Speech Emotion Recognition27
Towards Robust Speech Super-Resolution27
Neural Network Adaptation and Data Augmentation for Multi-Speaker Direction-of-Arrival Estimation26
Multi-View Speech Emotion Recognition Via Collective Relation Construction26
DBT-Net: Dual-Branch Federative Magnitude and Phase Estimation With Attention-in-Attention Transformer for Monaural Speech Enhancement25
Insights Into Deep Non-Linear Filters for Improved Multi-Channel Speech Enhancement25
Kronecker Product Multichannel Linear Filtering for Adaptive Weighted Prediction Error-Based Speech Dereverberation25
Voice Activity Detection in the Wild: A Data-Driven Approach Using Teacher-Student Training25
SALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection25
Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech25
DUMA: Reading Comprehension With Transposition Thinking25
ISNet: Individual Standardization Network for Speech Emotion Recognition24
Selective Listening by Synchronizing Speech With Lips24
Comparison of Feature Extraction Methods for Sound-Based Classification of Honey Bee Activity23
Deep Learning Based Real-Time Speech Enhancement for Dual-Microphone Mobile Phones23
Speech Emotion Recognition Using Sequential Capsule Networks22
Exploiting Morphological and Phonological Features to Improve Prosodic Phrasing for Mongolian Speech Synthesis22
Optimal Output-Constrained Active Noise Control Based on Inverse Adaptive Modeling Leak Factor Estimate22
Contrastive Information Extraction With Generative Transformer22
Drone Audition: Sound Source Localization Using On-Board Microphones22
Robust Q-Gradient Subband Adaptive Filter for Nonlinear Active Noise Control22
A Joint Diagonalization Based Efficient Approach to Underdetermined Blind Audio Source Separation Using the Multichannel Wiener Filter22
Neonatal Bowel Sound Detection Using Convolutional Neural Network and Laplace Hidden Semi-Markov Model21
Group Communication With Context Codec for Lightweight Source Separation21
High-Order Pair-Wise Aspect and Opinion Terms Extraction With Edge-Enhanced Syntactic Graph Convolution21
Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation21
The Weighted Cross-Modal Attention Mechanism With Sentiment Prediction Auxiliary Task for Multimodal Sentiment Analysis20
A Wave Digital Newton-Raphson Method for Virtual Analog Modeling of Audio Circuits with Multiple One-Port Nonlinearities20
Improving Chinese Named Entity Recognition by Large-Scale Syntactic Dependency Graph20
Desynchronization Attacks Resilient Watermarking Method Based on Frequency Singular Value Coefficient Modification20
Systematic Review of Machine Learning Approaches for Detecting Developmental Stuttering20
Receptive Field Regularization Techniques for Audio Classification and Tagging With Deep Convolutional Neural Networks20
Multi-Tone Phase Coding of Interaural Time Difference for Sound Source Localization With Spiking Neural Networks20
Reinforcement Learning-Based Dialogue Guided Event Extraction to Exploit Argument Relations19
Exploiting Temporal Context in CNN Based Multisource DOA Estimation19
S-Vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder19
Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System19
On the Robustness of the Superdirective Beamformer19
Hierarchical Neighbor Propagation With Bidirectional Graph Attention Network for Relation Prediction19
Affine Projection Algorithm Over Acoustic Sensor Networks for Active Noise Control19
Multi-Source DOA Estimation in Reverberant Environments by Jointing Detection and Modeling of Time-Frequency Points19
Determined BSS Based on Time-Frequency Masking and Its Application to Harmonic Vector Analysis19
Fundamental Approaches to Robust Differential Beamforming With High Directivity Factors19
PhaseDCN: A Phase-Enhanced Dual-Path Dilated Convolutional Network for Single-Channel Speech Enhancement19
A Time-Frequency Attention Module for Neural Speech Enhancement19
Cascaded Random Fourier Filter for Robust Nonlinear Active Noise Control19
SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing18
Squared Sine Adaptive Algorithm and Its Performance Analysis18
Identification of Room Acoustic Impulse Responses via Kronecker Product Decompositions18
Mixed Source Sound Field Translation for Virtual Binaural Application With Perceptual Validation18
Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition18
Proximal Normalized Subband Adaptive Filtering for Acoustic Echo Cancellation18
Convolutive Transfer Function-Based Multichannel Nonnegative Matrix Factorization for Overdetermined Blind Source Separation18
Sarcasm Detection with Commonsense Knowledge18
Binaural Reproduction Based on Bilateral Ambisonics and Ear-Aligned HRTFs18
Beamforming with Cube Microphone Arrays Via Kronecker Product Decompositions18
Fast Generation of Sound Zones Using Variable Span Trade-Off Filters in the DFT-Domain18
Detection of Multiple Steganography Methods in Compressed Speech Based on Code Element Embedding, Bi-LSTM and CNN With Attention Mechanisms17
USEV: Universal Speaker Extraction With Visual Cue17
Multi-Channel Multi-Frame ADL-MVDR for Target Speech Separation17
Many-to-Many Voice Transformer Network17
Neural Cascade Architecture for Multi-Channel Acoustic Echo Suppression17
Multi-Source Domain Adaptation for Text-Independent Forensic Speaker Recognition17
On the Design of Differential Kronecker Product Beamformers17
Room Acoustical Parameter Estimation From Room Impulse Responses Using Deep Neural Networks17
BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations16
SIFTER: A Framework for Robust Rumor Detection16
M3S: Scene Graph Driven Multi-Granularity Multi-Task Learning for Multi-Modal NER16
Efficient Combinatorial Optimization for Word-Level Adversarial Textual Attack16
End-to-End Speaker Verification via Curriculum Bipartite Ranking Weighted Binary Cross-Entropy16
Affine-Projection-Like Maximum Correntropy Criteria Algorithm for Robust Active Noise Control16
Deep Selective Memory Network With Selective Attention and Inter-Aspect Modeling for Aspect Level Sentiment Classification16
Deep Normalization for Speaker Vectors16
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research16
Spatial Active Noise Control Based on Kernel Interpolation of Sound Field16
Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis16
Enhancing Segment-Based Speech Emotion Recognition by Iterative Self-Learning16
Deformable Self-Attention for Text Classification16
LSBert: Lexical Simplification Based on BERT15
Robust Subband Adaptive Filter Algorithms-Based Mixture Correntropy and Application to Acoustic Echo Cancellation15
Quasi-Periodic Parallel WaveGAN: A Non-Autoregressive Raw Waveform Generative Model With Pitch-Dependent Dilated Convolution Neural Network15
Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech15
Parametric Ambisonic Encoding of Arbitrary Microphone Arrays15
Speech Reconstruction With Reminiscent Sound Via Visual Voice Memory15
StoRM: A Diffusion-Based Stochastic Regeneration Model for Speech Enhancement and Dereverberation15
Phoneme-Unit-Specific Time-Delay Neural Network for Speaker Verification15
Audio-Visual Multi-Channel Integration and Recognition of Overlapped Speech15
Improving Skip-Gram Embeddings Using BERT14
A Novel Loss Function and Training Strategy for Noise-Robust Keyword Spotting14
A Study on Reference Microphone Selection for Multi-Microphone Speech Enhancement14
Robust Voice Feature Selection Using Interval Type-2 Fuzzy AHP for Automated Diagnosis of Parkinson's Disease14
Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning14
Multichannel Blind Source Separation Based on Evanescent-Region-Aware Non-Negative Tensor Factorization in Spherical Harmonic Domain14
Knowing Where to Leverage: Context-Aware Graph Convolutional Network With an Adaptive Fusion Layer for Contextual Spoken Language Understanding14
Learning Deep Direct-Path Relative Transfer Function for Binaural Sound Source Localization14
The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance14
STFT-Domain Neural Speech Enhancement With Very Low Algorithmic Latency14
Diverse Distractor Generation for Constructing High-Quality Multiple Choice Questions14
Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation14
SNR-Based Features and Diverse Training Data for Robust DNN-Based Speech Enhancement14
Meta-AF: Meta-Learning for Adaptive Filters14
Direction of Arrival Estimation of Sound Sources Using Icosahedral CNNs14
A Time-Domain Real-Valued Generalized Wiener Filter for Multi-Channel Neural Separation Systems14
Speech Intelligibility Prediction Using Spectro-Temporal Modulation Analysis14
On Improved Training of CNN for Acoustic Source Localisation14
Differentiable Artificial Reverberation14
Conditioned Source Separation for Musical Instrument Performances14
Automatic Lyrics Transcription of Polyphonic Music With Lyrics-Chord Multi-Task Learning14
Generating Images From Spoken Descriptions13
Unsupervised Speech Segmentation and Variable Rate Representation Learning Using Segmental Contrastive Predictive Coding13
Language-Independent Approach for Automatic Computation of Vowel Articulation Features in Dysarthric Speech Assessment13
Improved Lite Audio-Visual Speech Enhancement13
Meta-Learning With Latent Space Clustering in Generative Adversarial Network for Speaker Diarization13
Filtering and Refining: A Collaborative-Style Framework for Single-Channel Speech Enhancement13
Non-Autoregressive ASR Modeling Using Pre-Trained Language Models for Chinese Speech Recognition13
A Deep Adaptation Network for Speech Enhancement: Combining a Relativistic Discriminator With Multi-Kernel Maximum Mean Discrepancy13
TutorNet: Towards Flexible Knowledge Distillation for End-to-End Speech Recognition13
Learning Speech Emotion Representations in the Quaternion Domain13
Inference Skipping for More Efficient Real-Time Speech Enhancement With Parallel RNNs13
Double-Cross-Correlation Processing for Blind Sampling-Rate and Time-Offset Estimation13
Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation13
Generation of Personal Sound Zones With Physical Meaningful Constraints and Conjugate Gradient Method13
A Novel Approach for Improved Noise Reduction Performance in Feed-Forward Active Noise Control Systems With (Loudspeaker) Saturation Non-Linearity in the Secondary Path13
ETEH: Unified Attention-Based End-to-End ASR and KWS Architecture13
Enhancement of Noisy Reverberant Speech Using Polynomial Matrix Eigenvalue Decomposition13
TDOA-Based Robust Sound Source Localization With Sparse Regularization in Wireless Acoustic Sensor Networks13
Sensor Selection for Relative Acoustic Transfer Function Steered Linearly-Constrained Beamformers12
Neural Cascade Architecture With Triple-Domain Loss for Speech Enhancement12
Decoupled Multiple Speaker Direction-of-Arrival Estimator Under Reverberant Environments12
On the Design of 3D Steerable Beamformers With Uniform Concentric Circular Microphone Arrays12
Audio-Visual Cross-Attention Network for Robotic Speaker Tracking12
Modeling Future Cost for Neural Machine Translation12
A Digital Twin Architecture for Wireless Networked Adaptive Active Noise Control12
SkipConvGAN: Monaural Speech Dereverberation Using Generative Adversarial Networks via Complex Time-Frequency Masking12
Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model With Pitch-Dependent Dilated Convolution Neural Network12
ARoBERT: An ASR Robust Pre-Trained Language Model for Spoken Language Understanding12
Multiple Acoustic Source Localization in Microphone Array Networks12
Acoustic Source Localization in the Circular Harmonic Domain Using Deep Learning Architecture12
Towards Energy-Preserving Natural Language Understanding With Spiking Neural Networks12
Multi-Channel Talker-Independent Speaker Separation Through Location-Based Training12
Localization-Driven Speech Enhancement in Noisy Multi-Speaker Hospital Environments Using Deep Learning and Meta Learning12
Scalable and Efficient Neural Speech Coding: A Hybrid Design12
MuseMorphose: Full-Song and Fine-Grained Piano Music Style Transfer With One Transformer VAE12
A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition12
Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models12
Computation of Spherical Harmonic Representations of Source Directivity Based on the Finite-Distance Signature11
Adaptive Adapters: An Efficient Way to Incorporate BERT Into Neural Machine Translation11
Extracting and Predicting Word-Level Style Variations for Speech Synthesis11
SoundBeam: Target Sound Extraction Conditioned on Sound-Class Labels and Enrollment Clues for Increased Performance and Continuous Learning11
Distributed Combined Acoustic Echo Cancellation and Noise Reduction in Wireless Acoustic Sensor and Actuator Networks11
Incorporating Visual Information in Audio Based Self-Supervised Speaker Recognition11
Deep Learning Approaches in Topics of Singing Information Processing11
Domain-Shift Conditioning Using Adaptable Filtering Via Hierarchical Embeddings for Robust Chinese Spell Check11
Cognitive Load Estimation From Speech Commands to Simulated Aircraft11
Sparsity-Based Audio Declipping Methods: Selected Overview, New Algorithms, and Large-Scale Evaluation11
Nonlinear Spatial Filtering in Multichannel Speech Enhancement11
Incorporating BERT With Probability-Aware Gate for Spoken Language Understanding11
Bayesian Learning of LF-MMI Trained Time Delay Neural Networks for Speech Recognition11
U-Shaped Transformer With Frequency-Band Aware Attention for Speech Enhancement11
Generalized Hyperbolic Tangent Based Random Fourier Conjugate Gradient Filter for Nonlinear Active Noise Control11
Bayesian Neural Network Language Modeling for Speech Recognition11
On the Design of Sparse Arrays With Frequency-Invariant Beam Pattern11
Converting Foreign Accent Speech Without a Reference11
Novel Architectures for Unsupervised Information Bottleneck Based Speaker Diarization of Meetings11
Privacy and Utility of X-Vector Based Speaker Anonymization11
A Joint Model for Named Entity Recognition With Sentence-Level Entity Type Attentions11
Language Agnostic Speaker Embedding for Cross-Lingual Personalized Speech Generation11
Evolving Multi-Resolution Pooling CNN for Monaural Singing Voice Separation11
Improving Automatic Speech Recognition and Speech Translation via Word Embedding Prediction10
Towards Contextual Spelling Correction for Customization of End-to-End Speech Recognition Systems10
Improved Speech Enhancement Considering Speech PSD Uncertainty10
TOE: A Grid-Tagging Discontinuous NER Model Enhanced by Embedding Tag/Word Relations and More Fine-Grained Tags10
From LSAT: The Progress and Challenges of Complex Reasoning10
PROTOTYPE-TO-STYLE: Dialogue Generation With Style-Aware Editing on Retrieval Memory10
Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation10
Chinese Lexical Simplification10
Low Latency Speech Enhancement for Hearing Aids Using Deep Filtering10
End-to-End Multi-Modal Speech Recognition on an Air and Bone Conducted Speech Corpus10
Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition10
Controlling Elevation and Azimuth Beamwidths With Concentric Circular Microphone Arrays10
SBSim: A Sentence-BERT Similarity-Based Evaluation Metric for Indian Language Neural Machine Translation Systems10
Reconfigurable Nonuniform Filter Bank for Hearing Aid Systems10
DNN-Based Mask Estimation for Distributed Speech Enhancement in Spatially Unconstrained Microphone Arrays10
Preordering Encoding on Transformer for Translation10
Bayesian Learning for Deep Neural Network Adaptation10
Word-Region Alignment-Guided Multimodal Neural Machine Translation10
Retrieve-and-Edit Domain Adaptation for End2End Aspect Based Sentiment Analysis10
AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining10
0.044487953186035