Language Resources and Evaluation

(The TQCC of Language Resources and Evaluation is 3. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2020-07-01 to 2024-07-01.)
Resources and benchmark corpora for hate speech detection: a systematic review132
Machine translation systems and quality assessment: a systematic review51
DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text39
Investigating the effects of gender, dialect, and training size on the performance of Arabic speech recognition22
The Natural Stories corpus: a reading-time corpus of English texts containing rare syntactic constructions21
A comparative evaluation and analysis of three generations of Distributional Semantic Models21
The ParlaMint corpora of parliamentary proceedings17
A large English–Thai parallel corpus from the web and machine-generated text16
Current limitations in cyberbullying detection: On evaluation criteria, reproducibility, and data scarcity16
SENTiVENT: enabling supervised information extraction of company-specific events in economic and financial news15
Low resource language specific pre-processing and features for sentiment analysis task15
Automatic genre identification: a survey15
Introducing the Gab Hate Corpus: defining and applying hate-based rhetoric to social media posts at scale14
Roman Urdu toxic comment classification12
Machine translation in society: insights from UK users11
AI2D-RST: a multimodal corpus of 1000 primary school science diagrams11
Developing computational infrastructure for the CorCenCC corpus: The National Corpus of Contemporary Welsh8
The Electronic Corpus of 17th- and 18th-century Polish Texts8
Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks8
The impact of preprocessing on word embedding quality: a comparative study7
Multiple annotation for biodiversity: developing an annotation framework among biology, linguistics and text technology7
LDC-IL: The Indian repository of resources for language technology7
Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents6
A large and evolving cognate database6
Comparative performance of ensemble machine learning for Arabic cyberbullying and offensive language detection6
Resources for Turkish dependency parsing: introducing the BOUN Treebank and the BoAT annotation tool6
Improvement of sentiment analysis via re-evaluation of objective words in SenticNet for hotel reviews6
Linguistic resources for paraphrase generation in portuguese: a lexicon-grammar approach6
MEmoFC: introducing the Multilingual Emotional Football Corpus6
The KAS corpus of Slovenian academic writing6
Commonsense based text mining on urban policy6
Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian5
Towards alignment strategies in human-agent interactions based on measures of lexical repetitions5
SetembroBR: a social media corpus for depression and anxiety disorder prediction5
TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese5
Investigating the role of swear words in abusive language detection tasks5
Detecting explicit lyrics: a case study in Italian music4
Constructing Arabic Reading Comprehension Datasets: Arabic WikiReading and KaifLematha4
Representing variation in a spoken corpus of an endangered dialect: the case of Torlak4
Labelling the past: data set creation and multi-label classification of Dutch archaeological excavation reports4
A multi-source entity-level sentiment corpus for the financial domain: the FinLin corpus4
Resources for Turkish natural language processing: A critical survey4
Finnish parliament ASR corpus4
Making the most of comparable corpora in Neural Machine Translation: a case study3
Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system3
PRAUTOCAL corpus: a corpus for the study of Down syndrome prosodic aspects3
Annotating affective dimensions in user-generated content3
Arabic real time entity resolution using inverted indexing3
Nonverbal communication with emojis in social media: dissociating hedonic intensity from frequency3
A Spanish dataset for reproducible benchmarked offline handwriting recognition3
Sentence boundary detection of various forms of Tunisian Arabic3
Content-free speech activity records: interviews with people with schizophrenia3
Register identification from the unrestricted open Web using the Corpus of Online Registers of English3
Unparalleled sarcasm: a framework of parallel deep LSTMs with cross activation functions towards detection and generation of sarcastic statements3
TuLeD (Tupían lexical database): introducing a database of a South American language family3
LexO: an open-source system for managing OntoLex-Lemon resources3
The robotic-surgery propositional bank3
Semi-automation of gesture annotation by machine learning and human collaboration3
DISCO PAL: Diachronic Spanish sonnet corpus with psychological and affective labels3
LanguageCrawl: a generic tool for building language models upon common Crawl3
Semantics-aware typographical choices via affective associations3
Two languages, one treebank: building a Turkish–German code-switching treebank and its challenges3
Predicting lexical complexity in English texts: the Complex 2.0 dataset3
The WASABI song corpus and knowledge graph for music lyrics analysis3