Language Resources and Evaluation

Papers
(The TQCC of Language Resources and Evaluation is 3. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2021-08-01 to 2025-08-01.)
ArticleCitations
Strategies for managing time and costs in speech corpus creation: insights from the Slovenian ARTUR corpus61
Spelling errors made by people with dyslexia38
Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks37
Commonsense based text mining on urban policy28
From LIMA to DeepLIMA: following a new path of interoperability26
Speech acts in the Dutch COVID-19 Press Conferences23
A survey on geocoding: algorithms and datasets for toponym resolution23
Hope speech detection in Spanish19
Investigating the role of swear words in abusive language detection tasks18
AC-IQuAD: Automatically Constructed Indonesian Question Answering Dataset by Leveraging Wikidata17
The narratives of war (NoW) corpus of written testimonies of the Russia-Ukraine war16
The Visual Language Research Corpus (VLRC): an annotated corpus of comics from Asia, Europe, and the United States15
Prompting encoder models for zero-shot classification: a cross-domain study in Italian14
Brazilian Portuguese corpora for teaching and translation: the CoMET project14
Spontaneous, controlled acts of reference between friends and strangers13
Construction of Amharic information retrieval resources and corpora11
A new evaluation method: evaluation data and metrics for Chinese grammatical error correction11
Understanding conversational interaction in multiparty conversations: the EVA Corpus11
A study on methods for revising dependency treebanks: in search of gold10
Corpus tools for parallel corpora of theatre plays: an introduction to TAligner and ACM-theatre10
Toxic comment classification and rationale extraction in code-mixed text leveraging co-attentive multi-task learning10
Assessing linguistic generalisation in language models: a dataset for Brazilian Portuguese9
CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese8
Human–machine interaction in building an English reference dataset for natural language processing tasks8
Utilizing phonetic similarity for cross-source and cross-language toponym matching: a benchmark and prototype8
LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI8
adaptNMT: an open-source, language-agnostic development environment for neural machine translation8
Sentiment analysis in Portuguese tweets: an evaluation of diverse word representation models7
Perspectivist approaches to natural language processing: a survey7
UHated: hate speech detection in Urdu language using transfer learning7
Automatic readability assessment for sentences: neural, hybrid and large language models7
A comparative analysis of encoder only and decoder only models in intent classification and sentiment analysis: navigating the trade-offs in model size and performance7
DoSLex: automatic generation of all domain semantically rich sentiment lexicon6
TCMeta: a multilingual dataset of COVID tweets for relation-level metaphor analysis6
The Sanskrit Sembank6
Chinese-DiMLex: a lexicon of Chinese discourse connectives6
An integrated framework for emotion and sentiment analysis in Tamil and Malayalam visual content6
Slovenian parliamentary corpus siParl6
Conversion of the Spanish WordNet databases into a Prolog-readable format6
Uzbek news corpus for named entity recognition6
VeLeSpa: An inflected verbal lexicon of Peninsular Spanish and a quantitative analysis of paradigmatic predictability5
Language resources for clinical linguistics: introduction to the special issue5
Managing, storing, and sharing long-form recordings and their annotations5
Open source platform for Estonian speech transcription5
The WASABI song corpus and knowledge graph for music lyrics analysis5
Ulysses Tesemõ: a new large corpus for Brazilian legal and governmental domain5
Benchmarking Hindi-to-English direct speech-to-speech translation with synthetic data5
Identifying communicative functions in discourse with content types5
Sense through time: diachronic word sense annotations for word sense induction and Lexical Semantic Change Detection4
The ParlaMint corpora of parliamentary proceedings4
KurdiSent: a corpus for kurdish sentiment analysis4
ArgRewrite V.2: an annotated argumentative revisions corpus4
Constructing a cross-document event coreference corpus for Dutch4
Studying word meaning evolution through incremental semantic shift detection4
Correction to: Two sepedi‑english code‑switched speech corpora4
Developing and testing syllabification systems for South African Sesotho4
A corpus of English learners with Arabic and Hebrew backgrounds4
Multi-task learning for multi-dialect Arabic sentiment classification and sarcasm detection4
PolitePEER: does peer review hurt? A dataset to gauge politeness intensity in the peer reviews4
Harnessing Indigenous Tweets: The Reo Māori Twitter corpus4
Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian4
kidsNARRATE: a versatile corpus for studying Chinese-english bilingual L2 narrative skills in preschoolers4
The Hmong Medical Corpus: a biomedical corpus for a minority language3
Correction to: Resources for Turkish natural language processing: A critical survey3
Finnish parliament ASR corpus3
Design and construction of Guayaquil radio speech corpus (CHARG)3
Using BERT models for breast cancer diagnosis from Turkish radiology reports3
DILLo: an Italian lexical database for speech-language pathologists3
Correction: COLLIE: a broad-coverage ontology and lexicon of verbs in English3
Benchmark of public intent recognition services3
Correction: Cross-linguistically consistent semantic and syntactic annotation of child-directed speech3
Creation of a gold standard Dutch corpus of clinical notes for adverse drug event detection: the Dutch ADE corpus3
FullStop: punctuation and segmentation prediction for Dutch with transformers3
OMCD: Offensive Moroccan Comments Dataset3
Correction to: Semi-automation of gesture annotation by machine learning and human collaboration3
Sentiment analysis dataset in Moroccan dialect: bridging the gap between Arabic and Latin scripted dialect3
The limitations of irony detection in Dutch social media3
Part of speech (POS) tagging in Roman Urdu: datasets and models3
0.10750603675842