IEEE-ACM Transactions on Audio Speech and Language Processing

Papers
(The TQCC of IEEE-ACM Transactions on Audio Speech and Language Processing is 7. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2020-07-01 to 2024-07-01.)
ArticleCitations
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units632
Pre-Training With Whole Word Masking for Chinese BERT508
An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning140
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech138
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation126
CTNet: Conversational Transformer Network for Emotion Recognition116
FSD50K: An Open Dataset of Human-Labeled Sound Events113
Dense CNN With Self-Attention for Time-Domain Speech Enhancement98
Wavesplit: End-to-End Speech Separation by Speaker Clustering96
Two Heads are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement85
SoundStream: An End-to-End Neural Audio Codec68
PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation61
Investigating Typed Syntactic Dependencies for Targeted Sentiment Classification Using Graph Attention Neural Network60
Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition54
Robust Sound Source Tracking Using SRP-PHAT and 3D Convolutional Neural Networks53
Overview and Evaluation of Sound Event Localization and Detection in DCASE 201953
Analyzing Multimodal Sentiment Via Acoustic- and Visual-LSTM With Channel-Aware Temporal Convolution Network49
The Detection of Parkinson's Disease From Speech Using Voice Source Information48
FluentNet: End-to-End Detection of Stuttered Speech Disfluencies With Deep Learning45
AudioLM: A Language Modeling Approach to Audio Generation45
Towards Model Compression for Deep Learning Based Speech Enhancement44
Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation42
Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks38
A Cross-Entropy-Guided Measure (CEGM) for Assessing Speech Recognition Performance and Optimizing DNN-Based Speech Enhancement38
Multiple Source Direction of Arrival Estimations Using Relative Sound Pressure Based MUSIC38
Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog37
Speech Emotion Recognition Considering Nonverbal Vocalization in Affective Conversations35
Expressive TTS Training With Frame and Style Reconstruction Loss34
Audio-Visual Deep Neural Network for Robust Person Verification32
MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis32
Information Fusion in Attention Networks Using Adaptive and Multi-Level Factorized Bilinear Pooling for Audio-Visual Emotion Recognition32
Multimodal Sentiment Analysis With Two-Phase Multi-Task Learning31
Neural Spectrospatial Filtering31
Transfer Learning From Speech Synthesis to Voice Conversion With Non-Parallel Training Data31
Nearest Kronecker Product Decomposition Based Linear-in-The-Parameters Nonlinear Filters31
High-Resolution Piano Transcription With Pedals by Regressing Onset and Offset Times30
Steering Study of Linear Differential Microphone Arrays30
Recent Progress in the CUHK Dysarthric Speech Recognition System30
ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild29
Block-Based High Performance CNN Architectures for Frame-Level Overlapping Speech Detection29
Fast End-to-End Speech Recognition Via Non-Autoregressive Models and Cross-Modal Knowledge Transferring From BERT28
Multi-Task Sequence Tagging for Emotion-Cause Pair Extraction Via Tag Distribution Refinement28
Domain Invariant Feature Learning for Speaker-Independent Speech Emotion Recognition28
Objective Measures of Perceptual Audio Quality Reviewed: An Evaluation of Their Application Domain Dependence28
A Unified Target-Oriented Sequence-to-Sequence Model for Emotion-Cause Pair Extraction28
Any-to-Many Voice Conversion With Location-Relative Sequence-to-Sequence Modeling28
Diffsound: Discrete Diffusion Model for Text-to-Sound Generation27
Modified Magnitude-Phase Spectrum Information for Spoofing Detection27
Multi-Classifier Interactive Learning for Ambiguous Speech Emotion Recognition26
Towards Duration Robust Weakly Supervised Sound Event Detection26
Pretraining Techniques for Sequence-to-Sequence Voice Conversion25
Zero-Shot Audio Classification Via Semantic Embeddings25
Multi-View Speech Emotion Recognition Via Collective Relation Construction25
Towards Robust Speech Super-Resolution25
DUMA: Reading Comprehension With Transposition Thinking24
DBT-Net: Dual-Branch Federative Magnitude and Phase Estimation With Attention-in-Attention Transformer for Monaural Speech Enhancement24
Voice Activity Detection in the Wild: A Data-Driven Approach Using Teacher-Student Training23
Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech23
Self-Attending RNN for Speech Enhancement to Improve Cross-Corpus Generalization23
Optimal Output-Constrained Active Noise Control Based on Inverse Adaptive Modeling Leak Factor Estimate22
Deep Learning Based Real-Time Speech Enhancement for Dual-Microphone Mobile Phones22
Multimodal Emotion Recognition With Temporal and Semantic Consistency22
Speech Enhancement and Dereverberation With Diffusion-Based Generative Models22
Encoder-Decoder Based Attractors for End-to-End Neural Diarization22
Kronecker Product Multichannel Linear Filtering for Adaptive Weighted Prediction Error-Based Speech Dereverberation22
Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features22
Exploiting Adapters for Cross-Lingual Low-Resource Speech Recognition22
Neural Network Adaptation and Data Augmentation for Multi-Speaker Direction-of-Arrival Estimation21
Neonatal Bowel Sound Detection Using Convolutional Neural Network and Laplace Hidden Semi-Markov Model21
A Joint Diagonalization Based Efficient Approach to Underdetermined Blind Audio Source Separation Using the Multichannel Wiener Filter21
Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders21
A Wave Digital Newton-Raphson Method for Virtual Analog Modeling of Audio Circuits with Multiple One-Port Nonlinearities20
ISNet: Individual Standardization Network for Speech Emotion Recognition20
A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection20
S-Vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder20
Contrastive Information Extraction With Generative Transformer20
Reinforcement Learning-Based Dialogue Guided Event Extraction to Exploit Argument Relations19
PhaseDCN: A Phase-Enhanced Dual-Path Dilated Convolutional Network for Single-Channel Speech Enhancement19
Robust Q-Gradient Subband Adaptive Filter for Nonlinear Active Noise Control19
Exploiting Morphological and Phonological Features to Improve Prosodic Phrasing for Mongolian Speech Synthesis19
Drone Audition: Sound Source Localization Using On-Board Microphones19
High-Order Pair-Wise Aspect and Opinion Terms Extraction With Edge-Enhanced Syntactic Graph Convolution19
Group Communication With Context Codec for Lightweight Source Separation19
Receptive Field Regularization Techniques for Audio Classification and Tagging With Deep Convolutional Neural Networks18
Exploiting Temporal Context in CNN Based Multisource DOA Estimation18
Comparison of Feature Extraction Methods for Sound-Based Classification of Honey Bee Activity18
Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation18
Systematic Review of Machine Learning Approaches for Detecting Developmental Stuttering18
Fast Generation of Sound Zones Using Variable Span Trade-Off Filters in the DFT-Domain18
Cascaded Random Fourier Filter for Robust Nonlinear Active Noise Control17
Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System17
On the Robustness of the Superdirective Beamformer17
The Weighted Cross-Modal Attention Mechanism With Sentiment Prediction Auxiliary Task for Multimodal Sentiment Analysis17
Multi-Source DOA Estimation in Reverberant Environments by Jointing Detection and Modeling of Time-Frequency Points17
SALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection17
Multi-Tone Phase Coding of Interaural Time Difference for Sound Source Localization With Spiking Neural Networks17
Affine Projection Algorithm Over Acoustic Sensor Networks for Active Noise Control17
Determined BSS Based on Time-Frequency Masking and Its Application to Harmonic Vector Analysis17
Fundamental Approaches to Robust Differential Beamforming With High Directivity Factors17
Beamforming with Cube Microphone Arrays Via Kronecker Product Decompositions16
Improving Chinese Named Entity Recognition by Large-Scale Syntactic Dependency Graph16
Desynchronization Attacks Resilient Watermarking Method Based on Frequency Singular Value Coefficient Modification16
Insights Into Deep Non-Linear Filters for Improved Multi-Channel Speech Enhancement16
Speech Emotion Recognition Using Sequential Capsule Networks16
Many-to-Many Voice Transformer Network16
Deep Selective Memory Network With Selective Attention and Inter-Aspect Modeling for Aspect Level Sentiment Classification15
Efficient Combinatorial Optimization for Word-Level Adversarial Textual Attack15
On the Design of Differential Kronecker Product Beamformers15
Convolutive Transfer Function-Based Multichannel Nonnegative Matrix Factorization for Overdetermined Blind Source Separation15
Binaural Reproduction Based on Bilateral Ambisonics and Ear-Aligned HRTFs15
Audio-Visual Multi-Channel Integration and Recognition of Overlapped Speech15
Multi-Channel Multi-Frame ADL-MVDR for Target Speech Separation15
Deep Normalization for Speaker Vectors15
Identification of Room Acoustic Impulse Responses via Kronecker Product Decompositions15
Room Acoustical Parameter Estimation From Room Impulse Responses Using Deep Neural Networks15
End-to-End Speaker Verification via Curriculum Bipartite Ranking Weighted Binary Cross-Entropy14
Selective Listening by Synchronizing Speech With Lips14
A Time-Frequency Attention Module for Neural Speech Enhancement14
Detection of Multiple Steganography Methods in Compressed Speech Based on Code Element Embedding, Bi-LSTM and CNN With Attention Mechanisms14
TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation14
Robust Voice Feature Selection Using Interval Type-2 Fuzzy AHP for Automated Diagnosis of Parkinson's Disease14
Multi-Source Domain Adaptation for Text-Independent Forensic Speaker Recognition14
Enhancing Segment-Based Speech Emotion Recognition by Iterative Self-Learning14
LSBert: Lexical Simplification Based on BERT14
Proximal Normalized Subband Adaptive Filtering for Acoustic Echo Cancellation14
Music Source Separation With Band-Split RNN14
Quasi-Periodic Parallel WaveGAN: A Non-Autoregressive Raw Waveform Generative Model With Pitch-Dependent Dilated Convolution Neural Network14
Sarcasm Detection with Commonsense Knowledge14
Deformable Self-Attention for Text Classification14
Generating Images From Spoken Descriptions13
A Novel Approach for Improved Noise Reduction Performance in Feed-Forward Active Noise Control Systems With (Loudspeaker) Saturation Non-Linearity in the Secondary Path13
Diverse Distractor Generation for Constructing High-Quality Multiple Choice Questions13
On Improved Training of CNN for Acoustic Source Localisation13
Enhancement of Noisy Reverberant Speech Using Polynomial Matrix Eigenvalue Decomposition13
Multichannel Blind Source Separation Based on Evanescent-Region-Aware Non-Negative Tensor Factorization in Spherical Harmonic Domain13
Double-Cross-Correlation Processing for Blind Sampling-Rate and Time-Offset Estimation13
Learning Deep Direct-Path Relative Transfer Function for Binaural Sound Source Localization13
Phoneme-Unit-Specific Time-Delay Neural Network for Speaker Verification13
Improved Lite Audio-Visual Speech Enhancement13
Neural Cascade Architecture for Multi-Channel Acoustic Echo Suppression13
Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis13
Hierarchical Neighbor Propagation With Bidirectional Graph Attention Network for Relation Prediction13
Knowing Where to Leverage: Context-Aware Graph Convolutional Network With an Adaptive Fusion Layer for Contextual Spoken Language Understanding13
TutorNet: Towards Flexible Knowledge Distillation for End-to-End Speech Recognition13
Language-Independent Approach for Automatic Computation of Vowel Articulation Features in Dysarthric Speech Assessment13
Speech Intelligibility Prediction Using Spectro-Temporal Modulation Analysis13
End-to-End Speech Recognition: A Survey13
SNR-Based Features and Diverse Training Data for Robust DNN-Based Speech Enhancement13
Mixed Source Sound Field Translation for Virtual Binaural Application With Perceptual Validation13
Meta-Learning With Latent Space Clustering in Generative Adversarial Network for Speaker Diarization12
Affine-Projection-Like Maximum Correntropy Criteria Algorithm for Robust Active Noise Control12
Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech12
Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation12
Direction of Arrival Estimation of Sound Sources Using Icosahedral CNNs12
Speech Reconstruction With Reminiscent Sound Via Visual Voice Memory12
Differentiable Artificial Reverberation12
Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning12
Automatic Lyrics Transcription of Polyphonic Music With Lyrics-Chord Multi-Task Learning12
Unsupervised Speech Segmentation and Variable Rate Representation Learning Using Segmental Contrastive Predictive Coding12
Robust Subband Adaptive Filter Algorithms-Based Mixture Correntropy and Application to Acoustic Echo Cancellation12
A Study on Reference Microphone Selection for Multi-Microphone Speech Enhancement12
Modeling Future Cost for Neural Machine Translation12
SIFTER: A Framework for Robust Rumor Detection12
M3S: Scene Graph Driven Multi-Granularity Multi-Task Learning for Multi-Modal NER12
Neural Cascade Architecture With Triple-Domain Loss for Speech Enhancement12
Generation of Personal Sound Zones With Physical Meaningful Constraints and Conjugate Gradient Method12
Spatial Active Noise Control Based on Kernel Interpolation of Sound Field12
Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition11
Scalable and Efficient Neural Speech Coding: A Hybrid Design11
Multi-Channel Talker-Independent Speaker Separation Through Location-Based Training11
Multiple Acoustic Source Localization in Microphone Array Networks11
Sparsity-Based Audio Declipping Methods: Selected Overview, New Algorithms, and Large-Scale Evaluation11
Meta-AF: Meta-Learning for Adaptive Filters11
Bayesian Learning of LF-MMI Trained Time Delay Neural Networks for Speech Recognition11
Decoupled Multiple Speaker Direction-of-Arrival Estimator Under Reverberant Environments11
Improving Skip-Gram Embeddings Using BERT11
Cognitive Load Estimation From Speech Commands to Simulated Aircraft11
Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models11
TDOA-Based Robust Sound Source Localization With Sparse Regularization in Wireless Acoustic Sensor Networks11
Distributed Combined Acoustic Echo Cancellation and Noise Reduction in Wireless Acoustic Sensor and Actuator Networks11
On the Design of Sparse Arrays With Frequency-Invariant Beam Pattern11
A Deep Adaptation Network for Speech Enhancement: Combining a Relativistic Discriminator With Multi-Kernel Maximum Mean Discrepancy11
Inference Skipping for More Efficient Real-Time Speech Enhancement With Parallel RNNs11
ETEH: Unified Attention-Based End-to-End ASR and KWS Architecture11
Acoustic Source Localization in the Circular Harmonic Domain Using Deep Learning Architecture10
Extracting and Predicting Word-Level Style Variations for Speech Synthesis10
PROTOTYPE-TO-STYLE: Dialogue Generation With Style-Aware Editing on Retrieval Memory10
Filtering and Refining: A Collaborative-Style Framework for Single-Channel Speech Enhancement10
Sensor Selection for Relative Acoustic Transfer Function Steered Linearly-Constrained Beamformers10
Low Latency Speech Enhancement for Hearing Aids Using Deep Filtering10
Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model With Pitch-Dependent Dilated Convolution Neural Network10
A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition10
Computation of Spherical Harmonic Representations of Source Directivity Based on the Finite-Distance Signature10
Converting Foreign Accent Speech Without a Reference10
Conditioned Source Separation for Musical Instrument Performances10
Wave Digital Modeling and Implementation of Nonlinear Audio Circuits With Nullors10
A Novel Loss Function and Training Strategy for Noise-Robust Keyword Spotting10
BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations10
Squared Sine Adaptive Algorithm and Its Performance Analysis10
ARoBERT: An ASR Robust Pre-Trained Language Model for Spoken Language Understanding10
Nonlinear Spatial Filtering in Multichannel Speech Enhancement10
Controlling Elevation and Azimuth Beamwidths With Concentric Circular Microphone Arrays10
SBSim: A Sentence-BERT Similarity-Based Evaluation Metric for Indian Language Neural Machine Translation Systems10
SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing10
A Joint Model for Named Entity Recognition With Sentence-Level Entity Type Attentions10
Improved Speech Enhancement Considering Speech PSD Uncertainty10
Evolving Multi-Resolution Pooling CNN for Monaural Singing Voice Separation10
USEV: Universal Speaker Extraction With Visual Cue10
Domain-Shift Conditioning Using Adaptable Filtering Via Hierarchical Embeddings for Robust Chinese Spell Check9
Localization-Driven Speech Enhancement in Noisy Multi-Speaker Hospital Environments Using Deep Learning and Meta Learning9
Passive Geometry Calibration for Microphone Arrays Based on Distributed Damped Newton Optimization9
Relation Extraction in Dialogues: A Deep Learning Model Based on the Generality and Specialty of Dialogue Text9
Adaptive Adapters: An Efficient Way to Incorporate BERT Into Neural Machine Translation9
Chinese Lexical Simplification9
Improving Automatic Speech Recognition and Speech Translation via Word Embedding Prediction9
Multi-Turn Dialogue Reading Comprehension With Pivot Turns and Knowledge9
A Digital Twin Architecture for Wireless Networked Adaptive Active Noise Control9
Generalized Hyperbolic Tangent Based Random Fourier Conjugate Gradient Filter for Nonlinear Active Noise Control9
Bayesian Learning for Deep Neural Network Adaptation9
MuseMorphose: Full-Song and Fine-Grained Piano Music Style Transfer With One Transformer VAE9
Bayesian Neural Network Language Modeling for Speech Recognition9
From LSAT: The Progress and Challenges of Complex Reasoning9
Privacy and Utility of X-Vector Based Speaker Anonymization9
Keyword Search Using Attention-Based End-to-End ASR and Frame-Synchronous Phoneme Alignments9
U-Shaped Transformer With Frequency-Band Aware Attention for Speech Enhancement9
On the Design of 3D Steerable Beamformers With Uniform Concentric Circular Microphone Arrays9
Review and Arrange: Curriculum Learning for Natural Language Understanding9
Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation9
Mixture Representation Learning for Deep Speaker Embedding9
Counterfactually Fair Automatic Speech Recognition9
Learning Speech Emotion Representations in the Quaternion Domain9
Novel Architectures for Unsupervised Information Bottleneck Based Speaker Diarization of Meetings9
Reconfigurable Nonuniform Filter Bank for Hearing Aid Systems9
A Time-Domain Real-Valued Generalized Wiener Filter for Multi-Channel Neural Separation Systems9
DNN-Based Mask Estimation for Distributed Speech Enhancement in Spatially Unconstrained Microphone Arrays9
Language Agnostic Speaker Embedding for Cross-Lingual Personalized Speech Generation8
Flexibly Focusing on Supporting Facts, Using Bridge Links, and Jointly Training Specialized Modules for Multi-Hop Question Answering8
Adaptive Convolution for Semantic Role Labeling8
Word-Region Alignment-Guided Multimodal Neural Machine Translation8
Regularized Phrase-Based Topic Model for Automatic Question Classification With Domain-Agnostic Class Labels8
A Room Impulse Response Measurement Method Robust Towards Nonlinearities Based on Orthogonal Periodic Sequences8
EfficientTDNN: Efficient Architecture Search for Speaker Recognition8
iEmoTTS: Toward Robust Cross-Speaker Emotion Transfer and Control for Speech Synthesis Based on Disentanglement Between Prosody and Timbre8
Audio-Based Piano Performance Evaluation for Beginners With Convolutional Neural Network and Attention Mechanism8
Similarity Measurement of Segment-Level Speaker Embeddings in Speaker Diarization8
STFT-Domain Neural Speech Enhancement With Very Low Algorithmic Latency8
Switching Independent Vector Analysis and its Extension to Blind and Spatially Guided Convolutional Beamforming Algorithms8
Improved Acoustic Word Embeddings for Zero-Resource Languages Using Multilingual Transfer8
Directly Comparing the Listening Strategies of Humans and Machines8
Retrieve-and-Edit Domain Adaptation for End2End Aspect Based Sentiment Analysis8
0.051408052444458