Author: Davis David
Africa has over 2000 languages however, these languages are not well represented in the existing Natural language processing (NLP) ecosystem. One of the challenges is the lack of useful African language datasets that can be used to solve different social and economical problems.
In this article, I have compiled a list of African language datasets from across the web. These datasets can be used in numerous NLP tasks such as text classification, named entity recognition, machine translation, sentiment analysis, speech recognition, and topic modeling.
This collection of datasets have been made public to give you an opportunity to use your skills and help solving different challenges.
Text Classification
Text classification datasets are categorized or organized into different groups based on their contents.
Below is the list of African language datasets for Text classification.
1.Swahili news Dataset
The Swahili news dataset contains more than 31,000 news articles from different news categories such as Local, International, Business or Financial, health, sports, and Entertainment. The Swahili language is one of the most spoken languages in Africa, it is spoken by 100-150 million people across East Africa.
The data was collected from different news publication platforms inside and outside of Tanzania. The dataset can be used to develop a multi-class classification model to classify news content according to their specific categories specified.
The model can be used by Swahili online news platforms to automatically group news according to their categories and help readers find the specific news they want to read.
You can also download this dataset from the datasets python library:
from datasets import load_dataset
dataset = load_dataset("swahili_news")
Note: The Swahili news dataset has an imbalance of category distribution. It contains few news articles in the following categories:
- International News( 6.2%)
- Health News(4.9%)
- Business News(4.3%)
2.Chichewa News Dataset
This dataset consists of news articles in Chichewa. Chichewa is a Bantu language spoken in much of Southern, Southeast, and East Africa, namely the countries of Malawi and Zambia, where it is an official language.
The dataset contains a collection of 3,482 articles, containing over 930,000 words, and over 48,000 sentences. The Chichewa news articles have been categorized into 19 categories such as education, law/order.politics, culture, arts and crafts, farming, economy, and wildlife.
You can also download this dataset from the following link: AI4D Malawi News Classification Zindi Challenge.
Named-entity Recognition
Named-entity Recognition datasets are used to extract information by locating and classifying named entities mentioned in unstructured text. Examples of entities are person names, organizations, locations, times, and dates.
NER is an essential component of numerous applications including spellcheckers, conversational agents, and localization of voice and dialogue systems.
Below is the list of African language datasets for Named-entity Recognition.
3.Masakhane-ner Datasets
Masakhane is a grassroots NLP community for Africa, by Africans with a mission to strengthen and spur NLP research in African languages. The community created the first large publicly available high-quality dataset for named entity recognition (NER) in ten African languages.
- Amharic
- Hausa
- Igbo
- Kinyarwanda
- Luganda
- Luo
- Naija Pidgin
- Swahili
- Wolof
- Yorùbá
You can read the research paper here MasakhaNER: Named Entity Recognition for African Languages and download the ten NER datasets here.
Machine Translation
Machine translation (MT) is the task of translating a text or speech in a source language to a different target language. Machine translation can be used to translate large volumes of text quickly without any human input.
Machine translation datasets can be used to create MT models for different purpose such as:
- Internal emails and other written or oral communication.
- Documentation and instructions for products or services.
Below is the list of African language datasets for Machine translation.
4.French to Ewe and French to Fongbe Machine Translation Dataset
This is a parallel corpus dataset for machine translation from French to Ewe and French to Fongbe.
Fonbge and Ewe are Niger-Congo languages, Fongbe is spoken in Benin with approximately 4.1 million speakers while Ewe is spoken in Togo and southeastern Ghana with approximately 4.5 million speakers.
This dataset contains roughly 23,000 French to Ewe and 53,000 French to Fongbe parallel sentences, collected from blogs, tales, newspapers, daily conversations, webpages and annotated for neural machine translation.
5.Yorùbá to English Machine Translation Dataset
This is a parallel sentence corpus dataset for machine translation from the Yorùbá language to the English language.
Yorùbá is the Niger-Congo language and it is spoken in West Africa (southwestern Nigeria). The number of Yorùbá speakers is estimated at between 45 to 55 million.
The dataset consists of 10,054 parallel Yorùbá-English sentences from different domains like news, Yorùbá proverbs, movie transcript, localization translation, and books.
6.English to Luganda Machine Translation Dataset
This is a parallel sentence corpus dataset for machine translation from the English language to the Luganda language.
Luganda is a Bantu language and it is one of the major languages in Uganda. It is spoken by more than 8.5 million Baganda and other people in Kampala(capital city of Uganda).
The dataset consists of 15,022 parallel English-Luganda sentences and it was created by a team of researchers from the AI & Data Science research Lab at Makerere University with a team of Luganda teachers, students, and freelancers.
Sentiment Analysis
Sentiment Analysis Datasets are used for the interpretation and classification of emotions (positive, negative, and neutral) within text data using different text analysis methods.
Sentiment analysis has found its applications in various fields such as social media monitoring, brand monitoring, customer service, and market research.
Below is the list of African language datasets for Sentiment Analysis.
7.Tunizi Dataset
Tunizi is the first Tunisian Arabizi sentiment analysis dataset. Tunisian Arabizi represents the Tunisian dialect that is written in Latin characters and numbers rather than Arabic letters.
iCompass gathered comments from social media platforms that express sentiment about popular topics. They extracted 100k comments using public streaming APIs.
The collected comments were manually annotated using an overall polarity:
- Positive (1)
- Negative (-1)
- Neutral (0)
The annotators were diverse in gender, age, and social background.
You can also download this dataset from the datasets python library:
from datasets import load_dataset
dataset = load_dataset("tunizi")
Speech Recognition
Speech recognition, also known as Automatic Speech Recognition (ASR) can be defined as a technology that analyzes human speech and formulates an output, often a written transcription, in real-time. Sometimes referring to the process as “speech to text.”
Don’t confuse this with voice recognition, as voice recognition just seeks to identify an individual user’s voice.
Below is the list of African language datasets for Speech Recognition.
Wolof is the language of Senegal, the Gambia, and Mauritania. It is spoken by more than 10 million people and about 40 percent (approximately 5 million people) of Senegal’s population speak Wolof as their native language.
The ASR dataset has a total of 6,683 audio files and transcriptions and it was created by a team of researchers from Baamtu Datamation company in Senegal.
9. Speech Recognition dataset in Kinyarwanda
Kinyarwanda is the Bantu language and an official language of Rwanda. It is spoken by at least 12 million people in Rwanda, the Eastern Democratic Republic of the Congo, and southern Uganda.
The dataset was created by 895 speakers from different genders and ages in a common voice platform. The dataset has a total of 1,183 hours of validated speech. The current dataset size is 40 GB.
Topic Modeling
Topic modeling uses unsupervised learning techniques to extract the main topic or set of topics that occur in a collection of text documents.
Below is the list of African language datasets for Topic Modelling.
10.South African News Dataset
This is the news dataset from South Africa. The news data were collected from SABC4 Facebook pages. The SABC is the public broadcaster from South Africa.
The dataset contains news headlines (i.e short text) from Setswana and Sepedi languages. Setswana is a Bantu language spoken in Southern Africa by about 8.2 million people while Sepedi is mainly spoken in the northern parts of South Africa by 4.7 million people.
Since the dataset is not annotated, you can use it to create a Topic model to cluster news data into different news topics such as sports, politics, culture, and entertainment.
Final Thoughts on African Language Datasets
I hope you found this list of different African language datasets useful and you can use them in your next data science project. I will be happy to see what applications/solutions you will create from these datasets. If you couldn’t find the dataset you need, please check out the following links:
You can also find the author on Twitter @Davis_McDavid.
Republished from Hackernoon