Language Resources and Evaluation

(The median citation count of Language Resources and Evaluation is 1. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2020-07-01 to 2024-07-01.)
Resources and benchmark corpora for hate speech detection: a systematic review132
Machine translation systems and quality assessment: a systematic review51
DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text39
Investigating the effects of gender, dialect, and training size on the performance of Arabic speech recognition22
A comparative evaluation and analysis of three generations of Distributional Semantic Models21
The Natural Stories corpus: a reading-time corpus of English texts containing rare syntactic constructions21
The ParlaMint corpora of parliamentary proceedings17
A large English–Thai parallel corpus from the web and machine-generated text16
Current limitations in cyberbullying detection: On evaluation criteria, reproducibility, and data scarcity16
SENTiVENT: enabling supervised information extraction of company-specific events in economic and financial news15
Low resource language specific pre-processing and features for sentiment analysis task15
Automatic genre identification: a survey15
Introducing the Gab Hate Corpus: defining and applying hate-based rhetoric to social media posts at scale14
Roman Urdu toxic comment classification12
Machine translation in society: insights from UK users11
AI2D-RST: a multimodal corpus of 1000 primary school science diagrams11
Developing computational infrastructure for the CorCenCC corpus: The National Corpus of Contemporary Welsh8
The Electronic Corpus of 17th- and 18th-century Polish Texts8
Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks8
The impact of preprocessing on word embedding quality: a comparative study7
Multiple annotation for biodiversity: developing an annotation framework among biology, linguistics and text technology7
LDC-IL: The Indian repository of resources for language technology7
Comparative performance of ensemble machine learning for Arabic cyberbullying and offensive language detection6
Resources for Turkish dependency parsing: introducing the BOUN Treebank and the BoAT annotation tool6
Improvement of sentiment analysis via re-evaluation of objective words in SenticNet for hotel reviews6
Linguistic resources for paraphrase generation in portuguese: a lexicon-grammar approach6
MEmoFC: introducing the Multilingual Emotional Football Corpus6
The KAS corpus of Slovenian academic writing6
Commonsense based text mining on urban policy6
Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents6
A large and evolving cognate database6
Towards alignment strategies in human-agent interactions based on measures of lexical repetitions5
SetembroBR: a social media corpus for depression and anxiety disorder prediction5
TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese5
Investigating the role of swear words in abusive language detection tasks5
Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian5
Constructing Arabic Reading Comprehension Datasets: Arabic WikiReading and KaifLematha4
Representing variation in a spoken corpus of an endangered dialect: the case of Torlak4
Labelling the past: data set creation and multi-label classification of Dutch archaeological excavation reports4
A multi-source entity-level sentiment corpus for the financial domain: the FinLin corpus4
Resources for Turkish natural language processing: A critical survey4
Finnish parliament ASR corpus4
Detecting explicit lyrics: a case study in Italian music4
PRAUTOCAL corpus: a corpus for the study of Down syndrome prosodic aspects3
Annotating affective dimensions in user-generated content3
Arabic real time entity resolution using inverted indexing3
Nonverbal communication with emojis in social media: dissociating hedonic intensity from frequency3
A Spanish dataset for reproducible benchmarked offline handwriting recognition3
Sentence boundary detection of various forms of Tunisian Arabic3
Content-free speech activity records: interviews with people with schizophrenia3
Register identification from the unrestricted open Web using the Corpus of Online Registers of English3
Unparalleled sarcasm: a framework of parallel deep LSTMs with cross activation functions towards detection and generation of sarcastic statements3
TuLeD (Tupían lexical database): introducing a database of a South American language family3
LexO: an open-source system for managing OntoLex-Lemon resources3
The robotic-surgery propositional bank3
Semi-automation of gesture annotation by machine learning and human collaboration3
DISCO PAL: Diachronic Spanish sonnet corpus with psychological and affective labels3
LanguageCrawl: a generic tool for building language models upon common Crawl3
Semantics-aware typographical choices via affective associations3
Two languages, one treebank: building a Turkish–German code-switching treebank and its challenges3
Predicting lexical complexity in English texts: the Complex 2.0 dataset3
The WASABI song corpus and knowledge graph for music lyrics analysis3
Making the most of comparable corpora in Neural Machine Translation: a case study3
Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system3
Assessment of pragmatic abilities and cognitive substrates (APACS) brief remote: a novel tool for the rapid and tele-evaluation of pragmatic skills in Italian2
Redundancy and coverage aware enriched dragonfly-FL single document summarization2
Label modification and bootstrapping for zero-shot cross-lingual hate speech detection2
ArgRewrite V.2: an annotated argumentative revisions corpus2
Understanding conversational interaction in multiparty conversations: the EVA Corpus2
Harnessing Indigenous Tweets: The Reo Māori Twitter corpus2
FinnSentiment: a Finnish social media corpus for sentiment polarity annotation2
OLID-BR: offensive language identification dataset for Brazilian Portuguese2
Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations2
Sense representations for Portuguese: experiments with sense embeddings and deep neural language models2
Corpus tools for parallel corpora of theatre plays: an introduction to TAligner and ACM-theatre2
Towards the benchmarking of question generation: introducing the Monserrate corpus2
Speech acts in the Dutch COVID-19 Press Conferences2
Broad coverage emotion annotation2
Składnica: a constituency treebank of Polish harmonised with the Walenty valency dictionary2
Corpora compilation for prosody-informed speech processing2
Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser2
Live blog summarization2
The LRE Map: what does it tell us about the last decade of our field?2
A multimodal corpus of simulated consultations between a patient and multiple healthcare professionals2
The Visual Language Research Corpus (VLRC): an annotated corpus of comics from Asia, Europe, and the United States1
Identifying communicative functions in discourse with content types1
Depression symptoms modelling from social media text: an LLM driven semi-supervised learning approach1
RastrOS Project: Natural Language Processing contributions to the development of an eye-tracking corpus with predictability norms for Brazilian Portuguese1
DiaBLa: a corpus of bilingual spontaneous written dialogues for machine translation1
MarIA and BETO are sexist: evaluating gender bias in large language models for Spanish1
EventDNA: a dataset for Dutch news event extraction as a basis for news diversification1
OMCD: Offensive Moroccan Comments Dataset1
FullStop: punctuation and segmentation prediction for Dutch with transformers1
A rich task-oriented dialogue corpus in Vietnamese1
UHated: hate speech detection in Urdu language using transfer learning1
The B-Subtle framework: tailoring subtitles to your needs1
Determinants of grader agreement: an analysis of multiple short answer corpora1
An eye-tracking-with-EEG coregistration corpus of narrative sentences1
Text complexity of open educational resources in Portuguese: mixing written and spoken registers in a multi-task approach1
When MIPVU goes to no man’s land: a new language resource for hybrid, morpheme-based metaphor identification in Hungarian1
Jira: a Central Kurdish speech recognition system, designing and building speech corpus and pronunciation lexicon1
Spelling errors made by people with dyslexia1
Universal Dependencies for Mandarin Chinese1
JWSAN: Japanese word similarity and association norm1
A corpus of Schlieren photography of speech production: potential methodology to study aerodynamics of labial, nasal and vocalic processes1
Manipuri–English comparable corpus for cross-lingual studies1
A semantics-aware approach for multilingual natural language inference1
The CLARIN infrastructure as an interoperable language technology platform for SSH and beyond1
Benchmark of public intent recognition services1
Correction to: The LRE Map: what does it tell us about the last decade of our field?1
Linguistic annotation of Byzantine book epigrams1
Speech emotion recognition for the Urdu language1
Automatic generation of creative text in Portuguese: an overview1
Toxic comment classification and rationale extraction in code-mixed text leveraging co-attentive multi-task learning1
Automatic language identification: a case study of Pahari languages1
Two sepedi-english code-switched speech corpora1
Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform1
Infectious risk events and their novelty in event-based surveillance: new definitions and annotated corpus1
Constructing a cross-document event coreference corpus for Dutch1
A benchmark dataset and evaluation methodology for Chinese zero pronoun translation1
Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels1
TEI-friendly annotation scheme for medieval named entities: a case on a Spanish medieval corpus1
POMET: a corpus for poetic meter classification1
NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese1
Rant or rave: variation over time in the language of online reviews1
adaptNMT: an open-source, language-agnostic development environment for neural machine translation1
ChoCo: a multimodal corpus of the Choctaw language1
Building the VisSE Corpus of Spanish SignWriting1
Multi-domain adaptation for named entity recognition with multi-aspect relevance learning1