publications | Michael A. Hedderich

An (often slightly more up-to-date) list of my publications can also be found on Google Scholar.

2025

Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead

Jesujoba O. Alabi, Michael A. Hedderich, David Ifeoluwa Adelani, and Dietrich Klakow

In Accepted at EMNLP’25, 2025
What’s the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns

Michael A. Hedderich, Anyi Wang, Raoyuan Zhao, Florian Eichin, Jonas Fischer, and 1 more author

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025

Prompt engineering for large language models is challenging, as even small prompt perturbations or model changes can significantly impact the generated output texts. Existing evaluation methods of LLM outputs, either automated metrics or human evaluation, have limitations, such as providing limited insights or being labor-intensive. We propose Spotlight, a new approach that combines both automation and human analysis. Based on data mining techniques, we automatically distinguish between random (decoding) variations and systematic differences in language model outputs. This process provides token patterns that describe the systematic differences and guide the user in manually analyzing the effects of their prompts and changes in models efficiently. We create three benchmarks to quantitatively test the reliability of token pattern extraction methods and demonstrate that our approach provides new insights into established prompt data. From a human-centric perspective, through demonstration studies and a user study, we show that our token pattern approach helps users understand the systematic differences of language model outputs. We are further able to discover relevant differences caused by prompt and model changes (e.g. related to gender or culture), thus supporting the prompt engineering process and human-centric model behavior research.
Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set

Florian Eichin, Yang Janet Liu, Barbara Plank, and Michael A. Hedderich

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025

Discourse understanding is essential for many NLP tasks, yet most existing work remains constrained by framework-dependent discourse representations. This work investigates whether large language models (LLMs) capture discourse knowledge that generalizes across languages and frameworks. We address this question along two dimensions: (1) developing a unified discourse relation label set to facilitate cross-lingual and cross-framework discourse analysis, and (2) probing LLMs to assess whether they encode generalizable discourse abstractions. Using multilingual discourse relation classification as a testbed, we examine a comprehensive set of 23 LLMs of varying sizes and multilingual capabilities. Our results show that LLMs, especially those with multilingual training corpora, can generalize discourse information across languages and frameworks. Further layer-wise analyses reveal that language generalization at the discourse level is most salient in the intermediate layers. Lastly, our error analysis provides an account of challenging relation classes.
Explaining crowdworker behaviour through computational rationality

Michael A. Hedderich, and Antti Oulasvirta

In Behaviour & Information Technology, 2025
Grokking ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior

Florian Eichin, Yupei Du, Philipp Mondorf, Barbara Plank, and Michael A. Hedderich

In arXiv 2505.20076, 2025
MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs

Raoyuan Zhao, Beiduo Chen, Barbara Plank, and Michael A. Hedderich

In Accepted at Findings of EMNLP’25, 2025
Do We Know What LLMs Don’t Know? A Study of Consistency in Knowledge Probing

Raoyuan Zhao, Abdullatif Köksal, Ali Modarressi, Michael A. Hedderich, and Hinrich Schütze

In Accepted at Findings of EMNLP’25, 2025
Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics

Florian Eichin, Carolin Schuster, Georg Groh, and Michael A. Hedderich

In Accepted at Findings of EMNLP’25, 2025

2024

A Piece of Theatre: Investigating How Teachers Design LLM Chatbots to Assist Adolescent Cyberbullying Education

Michael A. Hedderich, Natalie N. Bazarova, Wenting Zou, Ryun Shim, Xinda Ma, and 1 more author

Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024
The Potential and Challenges of Evaluating Attitudes, Opinions, and Values in Large Language Models

Bolei Ma, Xinpeng Wang, Tiancheng Hu, Anna-Carolina Haensch, Michael A. Hedderich, and 2 more authors

In EMNLP Findings, 2024
Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination

Qiqi Chen, Xinpeng Wang, Philipp Mondorf, Michael A. Hedderich, and Barbara Plank

In Arxiv Preprint, 2024
Understanding Human-AI Workflows for Generating Personas

Joongi Shin, Michael A. Hedderich, Bartłomiej Jakub Rey, Andrés Lucero, and Antti Oulasvirta

In Proceedings of the 2024 ACM Designing Interactive Systems Conference (DIS), 2024
Facilitating Asynchronous Idea Generation and Selection with Chatbots

Joongi Shin, Ankit Khatri, Michael A. Hedderich, Andrés Lucero, and Antti Oulasvirta

In Proceedings of OzCHI 2024, 2024

2023

An exploration of knowledge‐organizing technologies to advance transdisciplinary back pain research

Jeffrey C. Lotz, Glen Ropella, Paul Anderson, Qian Yang, Michael A. Hedderich, and 2 more authors

In JOR SPINE, 2023

Abstract Chronic low back pain (LBP) is influenced by a broad spectrum of patient-specific factors as codified in domains of the biopsychosocial model (BSM). Operationalizing the BSM into research and clinical care is challenging because most investigators work in silos that concentrate on only one or two BSM domains. Furthermore, the expanding, multidisciplinary nature of BSM research creates practical limitations as to how individual investigators integrate current data into their processes of generating impactful hypotheses. The rapidly advancing field of artificial intelligence (AI) is providing new tools for organizing knowledge, but the practical aspects for how AI may advance LBP research and clinical are beginning to be explored. The goals of the work presented here are to: (1) explore the current capabilities of knowledge integration technologies (large language models (LLM), similarity graphs (SGs), and knowledge graphs (KGs)) to synthesize biomedical literature and depict multimodal relationships reflected in the BSM, and; (2) highlight limitations, implementation details, and future areas of research to improve performance. We demonstrate preliminary evidence that LLMs, like GPT-3, may be useful in helping scientists analyze and distinguish cLBP publications across multiple BSM domains and determine the degree to which the literature supports or contradicts emergent hypotheses. We show that SG representations and KGs enable exploring LBP’s literature in novel ways, possibly providing, trans-disciplinary perspectives or insights that are currently difficult, if not infeasible to achieve. The SG approach is automated, simple, and inexpensive to execute, and thereby may be useful for early-phase literature and narrative explorations beyond one’s areas of expertise. Likewise, we show that KGs can be constructed using automated pipelines, queried to provide semantic information, and analyzed to explore trans-domain linkages. The examples presented support the feasibility for LBP-tailored AI protocols to organize knowledge and support developing and refining trans-domain hypotheses.
Meta Self-Refinement for Robust Learning with Weak Supervision

Dawei Zhu, Xiaoyu Shen, Michael A. Hedderich, and Dietrich Klakow

In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023

Training deep neural networks (DNNs) under weak supervision has attracted increasing research attention as it can significantly reduce the annotation cost. However, labels from weak supervision can be noisy, and the high capacity of DNNs enables them to easily overfit the label noise, resulting in poor generalization. Recent methods leverage self-training to build noise-resistant models, in which a teacher trained under weak supervision is used to provide highly confident labels for teaching the students. Nevertheless, the teacher derived from such frameworks may have fitted a substantial amount of noise and therefore produce incorrect pseudo-labels with high confidence, leading to severe error propagation. In this work, we propose Meta Self-Refinement (MSR), a noise-resistant learning framework, to effectively combat label noise from weak supervision. Instead of relying on a fixed teacher trained with noisy labels, we encourage the teacher to refine its pseudo-labels. At each training step, MSR performs a meta gradient descent on the current mini-batch to maximize the student performance on a clean validation set. Extensive experimentation on eight NLP benchmarks demonstrates that MSR is robust against label noise in all settings and outperforms state-of-the-art methods by up to 11.4% in accuracy and 9.26% in F1 score.
SparseIMU: Computational Design of Sparse IMU Layouts for Sensing Fine-Grained Finger Microgestures

Adwait Sharma, Christina Salchow-Hömmen, Vimal Suresh Mollyn, Aditya Shekhar Nittala, Michael A. Hedderich, and 3 more authors

ACM Trans. Comput.-Hum. Interact., 2023

Gestural interaction with freehands and while grasping an everyday object enables always-available input. To sense such gestures, minimal instrumentation of the user’s hand is desirable. However, the choice of an effective but minimal IMU layout remains challenging, due to the complexity of the multi-factorial space that comprises diverse finger gestures, objects and grasps. We present SparseIMU, a rapid method for selecting minimal inertial sensor-based layouts for effective gesture recognition. Furthermore, we contribute a computational tool to guide designers with optimal sensor placement. Our approach builds on an extensive microgestures dataset that we collected with a dense network of 17 inertial measurement units (IMUs). We performed a series of analyses, including an evaluation of the entire combinatorial space for freehand and grasping microgestures (393K layouts), and quantified the performance across different layout choices, revealing new gesture detection opportunities with IMUs. Finally, we demonstrate the versatility of our method with four scenarios.
Understanding and Mitigating Classification Errors Through Interpretable Token Patterns

Michael A. Hedderich, Jonas Fischer, Dietrich Klakow, and Jilles Vreeken

In Abstract at BlackboxNLP @EMNLP’23, 2023
From an Analog to a Digital Workflow: An Introductory Approach to Digital Editions in Assyriology

Timo Homburg, Tim Brandes, Eva-Maria Huber, and Michael A. Hedderich

Cuneiform Digital Library Bulletin, 2023

2022

Chatbots Facilitating Consensus-Building in Asynchronous Co-Design

Joongi Shin, Michael A. Hedderich, Andrés Lucero, and Antti Oulasvirta

In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (UIST), 2022

Consensus-building is an essential process for the success of co-design projects. To build consensus, stakeholders need to discuss conflicting needs and viewpoints, converge their ideas toward shared interests, and grow their willingness to commit to group decisions. However, managing group discussions is challenging in large co-design projects with multiple stakeholders. In this paper, we investigate the interaction design of a chatbot that can mediate consensus-building conversationally. By interacting with individual stakeholders, the chatbot collects ideas to satisfy conflicting needs and engages stakeholders to consider others’ viewpoints, without having stakeholders directly interact with each other. Results from an empirical study in an educational setting (N = 12) suggest that the approach can increase stakeholders’ commitment to group decisions and maintain the effect even on the group decisions that conflict with personal interests. We conclude that chatbots can facilitate consensus-building in small-to-medium-sized projects, but more work is needed to scale up to larger projects.
MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Miaoran Zhang, Marius Mosbach, David Adelani, Michael Hedderich, and Dietrich Klakow

In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jul 2022

Learning semantically meaningful sentence embeddings is an open problem in natural language processing. In this work, we propose a sentence embedding learning approach that exploits both visual and textual information via a multimodal contrastive objective. Through experiments on a variety of semantic textual similarity tasks, we demonstrate that our approach consistently improves the performance across various datasets and pre-trained encoders. In particular, combining a small amount of multimodal data with a large text-only corpus, we improve the state-of-the-art average Spearman’s correlation by 1.7%. By analyzing the properties of the textual embedding space, we show that our model excels in aligning semantically similar sentences, providing an explanation for its improved performance.
Weak Supervision and Label Noise Handling for Natural Language Processing in Low-Resource Scenarios

Michael A. Hedderich

Jul 2022
Label-Descriptive Patterns and Their Application to Characterizing Classification Errors

Michael A. Hedderich, Jonas Fischer, Dietrich Klakow, and Jilles Vreeken

In International Conference on Machine Learning (ICML), Jul 2022
Task-Adaptive Pre-Training for Boosting Learning With Noisy Labels: A Study on Text Classification for African Languages

Dawei Zhu, Michael A. Hedderich, Fangzhou Zhai, David Ifeoluwa Adelani, and Dietrich Klakow

In Proceedings of the ICLR 2022 Workshop AfricaNLP, Jul 2022
Is BERT Robust to Label Noise? A Study on Learning with Noisy Labels in Text Classification

Dawei Zhu, Michael A. Hedderich, Fangzhou Zhai, David Adelani, and Dietrich Klakow

In Proceedings of the Third Workshop on Insights from Negative Results in NLP, Jul 2022

Incorrect labels in training data occur when human annotators make mistakes or when the data is generated via weak or distant supervision. It has been shown that complex noise-handling techniques - by modeling, cleaning or filtering the noisy instances - are required to prevent models from fitting this label noise. However, we show in this work that, for text classification tasks with modern NLP models like BERT, over a variety of noise types, existing noise-handling methods do not always improve its performance, and may even deteriorate it, suggesting the need for further investigation. We also back our observations with a comprehensive analysis.

2021

On the Correlation of Context-Aware Language Models With the Intelligibility of Polish Target Words to Czech Readers

Klára Jágrová, Michael Hedderich, Marius Mosbach, Tania Avgustinova, and Dietrich Klakow

Frontiers in Psychology, Jul 2021

This contribution seeks to provide a rational probabilistic explanation for the intelligibility of words in a genetically related language that is unknown to the reader, a phenomenon referred to as intercomprehension. In this research domain, linguistic distance, among other factors, was proved to correlate well with the mutual intelligibility of individual words. However, the role of context for the intelligibility of target words in sentences was subject to very few studies. To address this, we analyze data from web-based experiments in which Czech (CS) respondents were asked to translate highly predictable target words at the final position of Polish sentences. We compare correlations of target word intelligibility with data from 3-g language models (LMs) to their correlations with data obtained from context-aware LMs. More specifically, we evaluate two context-aware LM architectures: Long Short-Term Memory (LSTMs) that can, theoretically, take infinitely long-distance dependencies into account and Transformer-based LMs which can access the whole input sequence at the same time. We investigate how their use of context affects surprisal and its correlation with intelligibility.
Proceedings of the First Workshop on Weakly Supervised Learning (WeaSuL)

Jul 2021
Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

Jul 2021
Estimating Formulas for Model Performance Under Noisy Labels Using Symbolic Regression

In Proceedings of the 31th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Jul 2021
A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

Michael A. Hedderich, Lukas Lange, Heike Adel, Jannik Strötgen, and Dietrich Klakow

In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Jul 2021
Analysing the Noise Model Error for Realistic Noisy Label Data

Michael A. Hedderich, Dawei Zhu, and Dietrich Klakow

In Thirty-Fifth AAAI Conference on Artificial Intelligence, Jul 2021
SoloFinger: Robust Microgestures While Grasping Everyday Objects

Adwait Sharma, Michael A. Hedderich, Divyanshu Bhardwaj, Bruno Fruchard, Jess McIntosh, and 4 more authors

In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, Jul 2021

Using microgestures, prior work has successfully enabled gestural interactions while holding objects. Yet, these existing methods are prone to false activations caused by natural finger movements while holding or manipulating the object. We address this issue with SoloFinger, a novel concept that allows design of microgestures that are robust against movements that naturally occur during primary activities. Using a data-driven approach, we establish that single-finger movements are rare in everyday hand-object actions and infer a single-finger input technique resilient to false activation. We demonstrate this concept’s robustness using a white-box classifier on a pre-existing dataset comprising 36 everyday hand-object actions. Our findings validate that simple SoloFinger gestures can relieve the need for complex finger configurations or delimiting gestures and that SoloFinger is applicable to diverse hand-object actions. Finally, we demonstrate SoloFinger’s high performance on commodity hardware using random forest classifiers.
ANEA: Distant Supervision for Low-Resource Named Entity Recognition

Michael A. Hedderich, Lukas Lange, and Dietrich Klakow

In ICML 2021 Workshop on Practical Machine Learning For Developing Countries, Jul 2021
Rekonstruktion von fragmentierten Dokumenten mit NLP

Johannes Bernhard, and Michael A. Hedderich

In IDCS Workshop Von Analog zu Digital, Jul 2021

2020

Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages

Michael A. Hedderich, David Adelani, Dawei Zhu, Jesujoba Alabi, Udia Markus, and 1 more author

In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Jul 2020

Multilingual transformer models like mBERT and XLM-RoBERTa have obtained great improvements for many NLP tasks on a variety of languages. However, recent works also showed that results from high-resource languages could not be easily transferred to realistic, low-resource scenarios. In this work, we study trends in performance for different amounts of available resources for the three African languages Hausa, isiXhosa and on both NER and topic classification. We show that in combination with transfer learning or distant supervision, these models can achieve with as little as 10 or 100 labeled sentences the same performance as baselines with much more supervised training data. However, we also find settings where this does not hold. Our discussions and additional experiments on assumptions such as time and hardware restrictions highlight challenges and opportunities in low-resource learning.
On the Interplay Between Fine-tuning and Sentence-Level Probing for Linguistic Knowledge in Pre-Trained Transformers

Marius Mosbach, Anna Khokhlova, Michael A. Hedderich, and Dietrich Klakow

In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, Nov 2020

Fine-tuning pre-trained contextualized embedding models has become an integral part of the NLP pipeline. At the same time, probing has emerged as a way to investigate the linguistic knowledge captured by pre-trained models. Very little is, however, understood about how fine-tuning affects the representations of pre-trained models and thereby the linguistic knowledge they encode. This paper contributes towards closing this gap. We study three different pre-trained models: BERT, RoBERTa, and ALBERT, and investigate through sentence-level probing how fine-tuning affects their representations. We find that for some probing tasks fine-tuning leads to substantial changes in accuracy, possibly suggesting that fine-tuning introduces or even removes linguistic knowledge from a pre-trained model. These changes, however, vary greatly across different models, fine-tuning and probing tasks. Our analysis reveals that while fine-tuning indeed changes the representations of a pre-trained model and these changes are typically larger for higher layers, only in very few cases, fine-tuning has a positive effect on probing accuracy that is larger than just using the pre-trained model with a strong pooling method. Based on our findings, we argue that both positive and negative effects of fine-tuning on probing require a careful interpretation.
Distant Supervision and Noisy Label Learning for Low Resource Named Entity Recognition: A Study on Hausa and Yorùbá

David Ifeoluwa Adelani, Michael A. Hedderich, Dawei Zhu, Esther Berg, and Dietrich Klakow

Nov 2020

2019

Using Multi-Sense Vector Embeddings for Reverse Dictionaries

Michael A. Hedderich, Andrew Yates, Dietrich Klakow, and Gerard Melo

In Proceedings of the 13th International Conference on Computational Semantics - Long Papers, May 2019

Popular word embedding methods such as word2vec and GloVe assign a single vector representation to each word, even if a word has multiple distinct meanings. Multi-sense embeddings instead provide different vectors for each sense of a word. However, they typically cannot serve as a drop-in replacement for conventional single-sense embeddings, because the correct sense vector needs to be selected for each word. In this work, we study the effect of multi-sense embeddings on the task of reverse dictionaries. We propose a technique to easily integrate them into an existing neural network architecture using an attention mechanism. Our experiments demonstrate that large improvements can be obtained when employing multi-sense embeddings both in the input sequence as well as for the target representation. An analysis of the sense distributions and of the learned attention is provided as well.
Handling Noisy Labels for Robustly Learning from Self-Training Data for Low-Resource Sequence Labeling

Debjit Paul, Mittul Singh, Michael A. Hedderich, and Dietrich Klakow

In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, Jun 2019

In this paper, we address the problem of effectively self-training neural networks in a low-resource setting. Self-training is frequently used to automatically increase the amount of training data. However, in a low-resource scenario, it is less effective due to unreliable annotations created using self-labeling of unlabeled data. We propose to combine self-training with noise handling on the self-labeled data. Directly estimating noise on the combined clean training set and self-labeled data can lead to corruption of the clean data and hence, performs worse. Thus, we propose the Clean and Noisy Label Neural Network which trains on clean and noisy self-labeled data simultaneously by explicitly modelling clean and noisy labels separately. In our experiments on Chunking and NER, this approach performs more robustly than the baselines. Complementary to this explicit approach, noise can also be handled implicitly with the help of an auxiliary learning task. To such a complementary approach, our method is more beneficial than other baseline methods and together provides the best performance overall.
Feature-Dependent Confusion Matrices for Low-Resource NER Labeling with Noisy Labels

Lukas Lange, Michael A. Hedderich, and Dietrich Klakow

In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Jun 2019

In low-resource settings, the performance of supervised labeling models can be improved with automatically annotated or distantly supervised data, which is cheap to create but often noisy. Previous works have shown that significant improvements can be reached by injecting information about the confusion between clean and noisy labels in this additional training data into the classifier training. However, for noise estimation, these approaches either do not take the input features (in our case word embeddings) into account, or they need to learn the noise modeling from scratch which can be difficult in a low-resource setting. We propose to cluster the training data using the input features and then compute different confusion matrices for each cluster. To the best of our knowledge, our approach is the first to leverage feature-dependent noise modeling with pre-initialized confusion matrices. We evaluate on low-resource named entity recognition settings in several languages, showing that our methods improve upon other confusion-matrix based methods by up to 9%.

2018

Training a Neural Network in a Low-Resource Setting on Automatically Annotated Noisy Data

Michael A. Hedderich, and Dietrich Klakow

In Proceedings of the Workshop on Deep Learning Approaches for Low-Resource NLP, Jun 2018

Manually labeled corpora are expensive to create and often not available for low-resource languages or domains. Automatic labeling approaches are an alternative way to obtain labeled data in a quicker and cheaper way. However, these labels often contain more errors which can deteriorate a classifier’s performance when trained on this data. We propose a noise layer that is added to a neural network architecture. This allows modeling the noise and train on a combination of clean and noisy data. We show that in a low-resource NER task we can improve performance by up to 35% by using additional, noisy data and handling the noise.