Overview
- Modern machine learning approaches often require large amounts of labeled training data. We study how one can train such models in low-resource scenarios.
- This includes transfer learning and distant supervision for African low-resource languages.
- Distant and weak supervision allow to leverage insights from experts efficently and label large amounts of unlabeled data automatically. However, this labeling tends to contain errors. We propose methods to model the label noise and leverage these labels more effectively.
WeaSuL Workshop
Weak and distant supervision is a popular topic in machine learning, computer vision and NLP both from a theoretic and applied/industry perspective. To bring together researchers from these different perspectives and to help new people into the field, we organize the WeaSuL workshop at ICLR’21.
Workshop website
Visual Guide
As companion to our survey, we published a more applied and visual guide for low-resource NLP. It is available on towards data science
Publications
-
-
Dawei Zhu, Michael A. Hedderich, Fangzhou Zhai, David Ifeoluwa Adelani, and Dietrich Klakow
In Proceedings of the ICLR 2022 Workshop AfricaNLP, 2022
-
Dawei Zhu, Michael A. Hedderich, Fangzhou Zhai, David Adelani, and Dietrich Klakow
In Proceedings of the Third Workshop on Insights from Negative Results in NLP, 2022
Incorrect labels in training data occur when human annotators make mistakes or when the data is generated via weak or distant supervision. It has been shown that complex noise-handling techniques - by modeling, cleaning or filtering the noisy instances - are required to prevent models from fitting this label noise. However, we show in this work that, for text classification tasks with modern NLP models like BERT, over a variety of noise types, existing noise-handling methods do not always improve its performance, and may even deteriorate it, suggesting the need for further investigation. We also back our observations with a comprehensive analysis.
-
Michael A. Hedderich, Lukas Lange, Heike Adel, Jannik Strötgen, and Dietrich Klakow
In Proceedings of the 2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2021
-
Michael A. Hedderich, Dawei Zhu, and Dietrich Klakow
In Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021
-
Michael A. Hedderich, Lukas Lange, and Dietrich Klakow
In ICML 2021 Workshop on Practical Machine Learning For Developing Countries, 2021
-
Michael A. Hedderich, David Adelani, Dawei Zhu, Jesujoba Alabi, Udia Markus, and
1 more author
In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Multilingual transformer models like mBERT and XLM-RoBERTa have obtained great improvements for many NLP tasks on a variety of languages. However, recent works also showed that results from high-resource languages could not be easily transferred to realistic, low-resource scenarios. In this work, we study trends in performance for different amounts of available resources for the three African languages Hausa, isiXhosa and on both NER and topic classification. We show that in combination with transfer learning or distant supervision, these models can achieve with as little as 10 or 100 labeled sentences the same performance as baselines with much more supervised training data. However, we also find settings where this does not hold. Our discussions and additional experiments on assumptions such as time and hardware restrictions highlight challenges and opportunities in low-resource learning.
-
David Ifeoluwa Adelani, Michael A. Hedderich, Dawei Zhu, Esther Berg, and Dietrich Klakow
2020
-
Debjit Paul, Mittul Singh, Michael A. Hedderich, and Dietrich Klakow
In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, Jun 2019
In this paper, we address the problem of effectively self-training neural networks in a low-resource setting. Self-training is frequently used to automatically increase the amount of training data. However, in a low-resource scenario, it is less effective due to unreliable annotations created using self-labeling of unlabeled data. We propose to combine self-training with noise handling on the self-labeled data. Directly estimating noise on the combined clean training set and self-labeled data can lead to corruption of the clean data and hence, performs worse. Thus, we propose the Clean and Noisy Label Neural Network which trains on clean and noisy self-labeled data simultaneously by explicitly modelling clean and noisy labels separately. In our experiments on Chunking and NER, this approach performs more robustly than the baselines. Complementary to this explicit approach, noise can also be handled implicitly with the help of an auxiliary learning task. To such a complementary approach, our method is more beneficial than other baseline methods and together provides the best performance overall.
-
Lukas Lange, Michael A. Hedderich, and Dietrich Klakow
In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Jun 2019
In low-resource settings, the performance of supervised labeling models can be improved with automatically annotated or distantly supervised data, which is cheap to create but often noisy. Previous works have shown that significant improvements can be reached by injecting information about the confusion between clean and noisy labels in this additional training data into the classifier training. However, for noise estimation, these approaches either do not take the input features (in our case word embeddings) into account, or they need to learn the noise modeling from scratch which can be difficult in a low-resource setting. We propose to cluster the training data using the input features and then compute different confusion matrices for each cluster. To the best of our knowledge, our approach is the first to leverage feature-dependent noise modeling with pre-initialized confusion matrices. We evaluate on low-resource named entity recognition settings in several languages, showing that our methods improve upon other confusion-matrix based methods by up to 9%.
-
Michael A. Hedderich, and Dietrich Klakow
In Proceedings of the Workshop on Deep Learning Approaches for Low-Resource NLP, Jun 2018
Manually labeled corpora are expensive to create and often not available for low-resource languages or domains. Automatic labeling approaches are an alternative way to obtain labeled data in a quicker and cheaper way. However, these labels often contain more errors which can deteriorate a classifier’s performance when trained on this data. We propose a noise layer that is added to a neural network architecture. This allows modeling the noise and train on a combination of clean and noisy data. We show that in a low-resource NER task we can improve performance by up to 35% by using additional, noisy data and handling the noise.