Hebrew and Arabic NLP Resources
Home
Basic Filters
Radio Buttons
Only allow one selection
Checkboxes
Allow multiple selections
Buttons (Single-Select)
Buttons that act like radio buttons
Buttons (Multi-Select)
Buttons that act like checkboxes
Link Blocks
Maximum control over filter styles
Multi-Ref Filters
Match Any Selection
Item has Option A
or
Option B
Match All Selections
Item has Option A
and
Option B
Combine Filters
Multi Filters
Filter by multiple fields
Dropdown Multi Filters
Use dropdowns to save space
Multi Filters + Search
The full power of Jetboost!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Hebrew
Arabic
Arabic Stories Corpus
Corpora
Unannotated Corpora
146 Arabic children stories (MSA).
Language
Arabic
License
Apache License 2.0
Arabic Stories Corpus
OSAC
Corpora
Unannotated Datasets
Open Source Arabic Corpora (MSA). 22,000 text documents, each belonging to 1 of 10 categories: Economics, History, Entertainments, etc.
Language
Arabic
License
Unkown
OSAC
ArCOV-19
Corpora
Unannotated Corpora
The First Arabic COVID-19 Twitter Dataset with Propagation Networks. About 3.2M tweets in mixed dialect Arabic associated with COVID-19, an ongoing collection starting at January 2020.
Language
Arabic
License
Unknown
ArCOV-19
Shami
Corpora
Unannotated Corpora
A Corpus of Levantine Arabic Dialects. 117,805 natural sentences from conversations in various Levantine dialects: Jordania, Palestinian, Lebanese, Syrian
Language
Arabic
License
Apache License 2.0
Shami
Abuelkhair Corpus
Corpora
Unannotated Corpora
More than 5 million newspaper articles in MSA.
Language
Arabic
License
Unknown
Abuelkhair Corpus
ArabicWeb16
Corpora
Unannotated Corpora
A Crawl for Today’s Arabic Web. 150M Arabic Web pages with high coverage of dialectal Arabic, Egyptian, Gulf, Levantine (~7M) and Maghrebi, as well as MSA, from a variety of sources - Wikipedia, Alexa, ArClueWeb09, and Twitter, etc.
Language
Arabic
ArabicWeb16
PADIC
Corpora
Annotated Datasets
A multilingual Parallel Arabic DIalectal Corpus. A parallel corpus of 6,400 sentences in multiple Arabic dialects: Algerian, Maghreb, Syrian, Palestinian and MSA, for dialect detection and machine translation.
Language
Arabic
License
GPLv3
Task
Dialect Identification, Machine Translation
PADIC
QCRI Parallel Tweets
Corpora
Annotated Datasets
Bilingual Corpus of Arabic-English Parallel Tweets. 166,000 bilingual tweets in Arabic and English (parallel) for machine translation tasks.
Language
Arabic
License
Apache License 2.0
Task
Machine Translation
QCRI Parallel Tweets
Kawarith
Corpora
Annotated Datasets, Unannotated Corpora
An Arabic Twitter Corpus for Crisis Events. A large-scale crisis-related multi-dialect Arabic Twitter corpus of 1,658,795 unique tweets from 22 emergency events. This corpus can be leveraged for several tasks, including crisis detection and crisis type classification.
Language
Arabic
License
CC BY-NC 4.0
Task
Topic Classification, Topic Modeling
Kawarith
Habibi
Corpora
Unannotated Corpora
a multi Dialect multi National Arabic Song Lyrics Corpus. More than 30,000 Arabic song lyrics in 6 Arabic dialects (Egyptian, Levantine, etc.) for singers from 18 different Arabic countries, segmented into sentences and words and labeled with song information.
Language
Arabic
License
Unknown
Habibi
Arabic Wiki Data Dumps
Corpora
Unannotated Corpora
Wikipedia, the free encyclopedia, publishes dumps of its content as XML files on a monthly basis.
Language
Arabic
Arabic Wiki Data Dumps
QASR
Corpora
Recorded Speech and Audio Corpora
QCRI Al Jazeera Speech Resource - The largest transcribed Arabic speech corpus with around 2,000 hours with multi-layer annotation, in multi-dialect and code-switching speech, crawled from the Al Jazeera news channel, for speech recognition, dialect identification, punctuation restoration, speaker identification, speaker linking, etc.
Language
Arabic
License
unknown
Task
Speech Recognition, Speech Synthesis, Speech-to-Text (STT), Text-to-Speech (TTS)
QASR
DiaCorpus
Corpora
Annotated Datasets, Recorded Speech and Audio Corpora
The DiaCorpus project is a collaboration between the Data Science Institute (DSI) and Israeli Innovation authority. The purpose of the project is to create a first of a kind Arabic textual repository, in a local dialect (Israeli / Palestinian). This project is part of the National Language Processing plan of Israel.
Language
Arabic
License
CC BY-SA 4.0
Task
Coreference Resolution, Emotion Detection, Name Entity Recognition (NER), Sentiment Analysis, Summarization
DiaCorpus
DART
Corpora
Annotated Datasets
A Large Dataset of Dialectal Arabic Tweets. About 25K tweets that are annotated via crowdsourcing for 5 Arabic dialects: Egyptian, Maghrebi, Levantine, Gulf, and Iraqi.
Language
Arabic
License
Unknown
Task
Dialect Identification
DART
IAHLT Named Entities Dataset (Arabic Subset)
Corpora
Annotated Datasets
This dataset contains named entity annotations for Arabic texts from various sources, curated as part of the IAHLT multilingual NER project. The Arabic portion is provided here as a cleaned subset intended for training and evaluation in named entity recognition tasks.
Language
Arabic
License
CC BY 4.0
Task
Named Entity Recognition (NER)
IAHLT Named Entities Dataset (Arabic Subset)
ASTD: Arabic Sentiment Tweets Dataset
Annotated Datasets
Annotated Datasets
10,000 tweets classified as objective, subjective positive, subjective negative, and subjective mixed.
Language
Arabic
License
GPLv2
Task
Sentiment Analysis
ASTD: Arabic Sentiment Tweets Dataset
Arabic 100k Reviews
Corpora
Annotated Datasets
Reviews with three classes from different services. The dataset combines reviews from hotels, books, movies, products and a few airlines. It has three classes (Mixed, Negative and Positive). Most were mapped from reviewers' ratings with 3 being mixed, above 3 positive and below 3 negative. Each row has a label and text separated by a tab (tsv). Text (reviews) were cleaned by removing Arabic diacritics and non-Arabic characters. The dataset has no duplicate reviews. The hotels and book reviews are a subset of HARD and BRAD. The rest were selected from hadyelsahar with a little over 100 airlines reviews collected manually.
Language
Arabic
License
Unknown
Task
Sentiment Analysis
Arabic 100k Reviews
HARD
Corpora
Annotated Datasets
Hotel Arabic-Reviews Dataset. This dataset contains 93700 hotel reviews in Arabic language. The hotel reviews were collected from Booking.com website during June/July 2016. The reviews are expressed in Modern Standard Arabic as well as dialectal Arabic.
Language
Arabic
License
Unknown
Task
Sentiment Analysis
HARD
BARD
Corpora
Annotated Datasets
Books Reviews in Arabic Dataset. This dataset contains 510,600 book reviews in Arabic language. The reviews were collected from GoodReads.com website during June/July 2016. This work is an extension of the early dataset of large-scale Arabic dataset, LABR, which has around 63K Arabic Book Reviews collected from GoodReads.com. The reviews are expressed mainly in Modern Standard Arabic but there are reviews in dialectal Arabic as well.
Language
Arabic
License
Unknown
Task
Sentiment Analysis
BARD
Large Multi-Domain Resources for Arabic Sentiment Analysis
Corpora
Automatically Annotated
33K Automatically annotated Reviews in Domains of Movies, Hotels, Restaurants and Products
Language
Arabic
License
Unknown
Task
Sentiment Analysis
Large Multi-Domain Resources for Arabic Sentiment Analysis
AJGT
Corpora
Annotated Datasets
Arabic Jordanian General Tweets (AJGT) Corpus consisted of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.
Language
Arabic
License
Unknown
Task
Sentiment Analysis
AJGT
ArSAS
Corpora
Annotated Datasets
An Arabic Speech-Act and Sentiment Corpus of Tweets. 21,000 tweets manually annotated for six different classes of speech-act labels.
Language
Arabic
License
Unknown
Task
Sentiment Analysis
ArSAS
AraSenCorpus
Corpora
Annotated Datasets, Unannotated Corpora
4.5 million tweets annotated positive, negative and neutral. This is a semi-supervised framework to annotate a large Arabic text corpus using a small portion of manually annotated tweets (15,000 tweets) and extending it from a large set of unlabeled tweets (34.7 million tweets) to reduce human effort in annotation and providing a middle ground between manual and automatic labeling of a large dataset.
Language
Arabic
License
MIT
Task
Sentiment Analysis
AraSenCorpus
LABR
Corpora
Annotated Datasets
A Large-SCale Arabic Book Reviews Dataset. 63,000 book reviews in mixed dialect Arabic for sentiment analysis. Reviews with ratings 4 or 5 ere considered positive, and those with ratings 1 or 2 were considered negative. Reviews with rating 3 are considered neutral and not included in the polarity classification
Language
Arabic
License
GPLv2
Task
Sentiment Analysis
LABR
TEAD
Corpora
Automatically Annotated
Large Scale Arabic Dataset for Sentiment Analysis. 6 million mixed dialect Arabic tweets with a vocabulary of 602,721 distinct entities, annotated by emojis and sentiment lexicon as subjective positive, subjective negative and neutral, dialectal tweets “translated” into MSA) {GNU General Public License v3.0
Language
Arabic
License
GPLv3
Task
Sentiment Analysis
TEAD
MASC
Corpora
Annotated Datasets
Multi-domain Arabic Sentiment Corpus. 8,860 positive and negative reviews from different domains, in a variety of dialects, as well as a list of 3,880 positive and negative synsets annotated with their part of speech, polarity scores, dialects synsets and inflected forms
Language
Arabic
License
Unknown
Task
Sentiment Analysis
MASC
KALIMAT
Corpora
Automatically Annotated
A Multipurpose Arabic Corpus. 20,200 from the Omani newspaper Al Watan with summaries, named entities, art-of-speech tagging, and morphological analysis.
Language
Arabic
License
Unknown
Task
Morphological Analysis, Named Entity Recognition (NER), Part-of-speech Tagging, Text Summarization
KALIMAT
Dialectal Arabic Datasets
Corpora
Annotated Datasets
1,400 manually segmented and POS tagged tweets in four dialects, Egyptian, Levantine, Gulf, and Maghrebi
Language
Arabic
License
Apache License 2.0
Task
Morphological Segmentation, Part-of-speech (POS) Tagging
Dialectal Arabic Datasets
The MADAR Arabic Dialect Corpus
Corpora
Annotated Datasets
A collection of ~12,000 parallel sentences covering the dialects of 25 cities from the Arab World, in addition to English, French, and MSA.
Language
Arabic
License
Custom Terms of Use
Task
Dialect Identification
The MADAR Arabic Dialect Corpus
MSDA
Corpora
Annotated Datasets
An open access NLP dataset for Arabic dialects. +50K tweets in five (5) national dialects, labeled for several applications: dialect detection, topic detection and sentiment analysis.
Language
Arabic
License
Unknown
Task
Dialect Identification, Sentiment Analysis, Topic Classification
MSDA
Prague Arabic Dependency Treebank 1.0
Corpora
Annotated Datasets
Language resource for Arabic natural language processing (NLP), a collection of parsed sentences annotated with syntactic structures.
Language
Arabic
License
Custom Terms of Use
Task
Dependency Treebanks, Morphological Analysis, Morphological segmentation, Part-of-speech (POS) Tagging, Stemming and Lemmatization
Prague Arabic Dependency Treebank 1.0
OCLAR
Corpora
Annotated Datasets
Opinion Corpus for Lebanese Arabic Reviews. 3900 Arabic customer reviews, on a wide scope of domain, including restaurants, hotels, hospitals, local shops, etc
Language
Arabic
License
Unknown
Task
Sentiment Analysis
OCLAR
Arabic Twitter Corpus for Flood Detection
Corpora
Annotated Datasets
4,037 human-labelled Arabic Twitter messages in Middle Eastern dialects, for four high-risk flood events that occurred in 2018, labelled based on relatedness to the crisis and information type.
Language
Arabic
License
Unknown
Task
Topic Classification, Topic Modeling
Arabic Twitter Corpus for Flood Detection
Satirical Fake News Dataset
Corpora
Annotated Datasets
Scraped from two satirical news websites, Al-Hudood and Al-Ahram Al-Mexici, for training fake news classifier/identifier.
Arabic
Arabic
License
Unknown
Task
Text Classification
Satirical Fake News Dataset
Omcca
Corpora
Annotated Datasets
Opinion Mining: Analysis of Comments Written in Arabic Colloquial. 28,576 reviews, which represents sentiments of 5,422 different reviewers, covering 27 different categories, collected from Jeeran web site, in Saudi and Jordanian Arabic.
Language
Arabic
License
Unknown
Task
Sentiment Analysis
Omcca
Annotated Shami Corpus
Corpora
Corpora
Lebanese Arabic corpus annotated for numerous morphological features and for orthography standardization.
Language
Arabic
License
Unknown
Task
Morphological Analysis
Annotated Shami Corpus
NSAR
Corpora
Annotated Datasets
Negation and Speculation in Arabic Review. 3K review sentences annotated with negation and speculation in Egyptian dialect.
Language
Arabic
License
Unknown
Task
Sentiment Analysis
NSAR
Dialectal Arabic Code-Switching Dataset
Corpora
Annotated Datasets
Transcribed audio in Egyptian dialect annotated at word-level for Code Switching (CS).
Language
Arabic
License
MIT
Task
Dialect Identification
Dialectal Arabic Code-Switching Dataset
BAEC
Corpora
Annotated Datasets
The Bangor Arabic–English Code-switching (BAEC) corpus. 45,251 words manually annotated for code-switching between Saudi, Egyptian and MSA Arabic and English.
Language
Arabic
License
Unknown
Task
Dialect identification
BAEC
AraCOVID19-MFH
Corpora
Annotated Datasets
Arabic COVID-19 Multi-label Fake News & Hate Speech Detection Dataset. 10,828 mixed dialect Arabic tweets annotated with 10 different labels concerning fake news and hate speech.
Language
Arabic
License
CC BY-NC-SA 4.0
Task
Content Moderation, Text Classification
AraCOVID19-MFH
L-HSAB
Corpora
Annotated Datasets
A Levantine Twitter Dataset for Hate Speech and Abusive Language. 5,846 Syrian/Lebanese political tweets labeled as normal, abusive or hate
Language
Arabic
License
Unknown
Task
Content Moderation, Text Classification
L-HSAB
Let-Mi
Corpora
Annotated Datasets
An Arabic Levantine Twitter Dataset for Misogynistic Language. 6,603 tweets in Levantine Arabic annotated as either non-misogynistic or one of seven misogynistic language categories.
Language
Arabic
License
Unknown
Task
Content Moderation, Text Classification
Let-Mi
MPOLD
Corpora
Annotated Datasets
Arabic Offensive Comments dataset from Multiple Social Media Platforms. Annotated social media comment dataset with (not) offensive language tags for Arabic social media comments collected from three different online platforms: Twitter, Facebook and YouTube.
Language
Arabic
License
Apache License 2.0
Task
Content Moderation, Text Classification
MPOLD
A Corpus of Offensive Language in Arabic
Corpora
Annotated Datasets
16,000 comments on YouTube videos from different nationalities annotated for offensive language.
Language
Arabic
License
Unknown
Task
Content Moderation, Text Classification
A Corpus of Offensive Language in Arabic
Religious Hate Speech Detection for Arabic Tweets
Corpora
Annotated Datasets
Tweets in MSA and Dialectal Arabic annotated for hate speech, training dataset contains 5,569 examples, while the testing dataset contains 567 examples.
Language
Arabic
License
Unknown
Task
Content Moderation, Text Classification
Religious Hate Speech Detection for Arabic Tweets
COVID-FAKES
Corpora
Annotated Datasets
Bilingual (Arabic/English) COVID-19 Twitter dataset for misleading information detection. Automatically annotated Arabic/English COVID-19 Twitter dataset, using the shared information on the official websites Twitter accounts of the WHO, UNICEF, and UN as a source of reliable information, tweets annotated using 13 different machine learning algorithms and employing 7 different feature extraction technique.
Language
Arabic
License
Unknown
Task
Content Moderation, Text Classification
COVID-FAKES
Adult Content Detection on Arabic Twitter
Corpora
Annotated Datasets
16,000 comments on YouTube videos from different nationalities annotated for offensive language.
Language
Arabic
License
Unknown
Task
Content Moderation, Text Classification
Adult Content Detection on Arabic Twitter
Fine-Grained Hate Speech Detection on Arabic Twitter
Corpora
Annotated Datasets
12,700 tweets in mixed dialect Arabic, no bias towards specific topics, genres, or dialects, each judged by 3 annotators for offensiveness classified into one of the hate speech types: Race, Religion, Ideology, Disability, Social Class, and Gender, and also judged whether a tweet has vulgar language or violence.
Language
Arabic
License
CC BY 4.0
Task
Content Moderation, Text Classification
Fine-Grained Hate Speech Detection on Arabic Twitter
ArCOV19-Rumors
Corpora
Annotated Datasets
An Arabic COVID-19 Twitter dataset for misinformation detection. 138 verified claims, mostly from popular fact-checking websites, and identified 9.4K relevant tweets to those claims, then manually-annotated the tweets by veracity to support research on misinformation detection.
Language
Arabic
License
Unknown
Task
Content Moderation, Text Classification
ArCOV19-Rumors
Journalist Questions on Twitter
Corpora
Annotated Datasets
10,000 mixed dialect Arabic tweets manually annotated for question type.
Language
Arabic
License
Unknown
Task
Question Classification
Journalist Questions on Twitter
AraFacts Dataset
Corpora
Annotated Datasets
An Arabic COVID-19 Twitter dataset for misinformation detection. 138 verified claims, mostly from popular fact-checking websites, and identified 9.4K relevant tweets to those claims, then manually-annotated the tweets by veracity to support research on misinformation detection.
Language
Arabic
License
CC BY-NC 4.0
Task
Content Moderation, Text Classification
AraFacts Dataset
DAICT
Corpora
Annotated Datasets
A Dialectal Arabic Irony Corpus Extracted from Twitter. 5,588 tweets - written in both MSA and mixed dialectal Arabic - manually annotated by two professional linguistics from HBKU for irony.
Language
Arabic
License
Unknown
Task
Sentiment Analysis
DAICT
IDAT
Corpora
Annotated Datasets
Irony Detection in Arabic Tweets. ~5.5k mixed dialect Arabic tweets annotated by two native Arabic speakers appended with another randomly 5.5k sampled tweets from the original unannotated corpus.
Language
Arabic
License
GPLv3
Task
Sentiment Analysis
IDAT
iSarcasm
Corpora
Annotated Datasets
A Dataset of Intended Sarcasm. Dataset of tweets in Arabic and English labeled for sarcasm directly by their authors.
Language
Arabic
License
MIT
Task
Sentiment Analysis
iSarcasm
AraCovid19-SSD
Corpora
Annotated Datasets
Arabic COVID-19 Sentiment and Sarcasm Detection Dataset. Manually annotated multi-label Arabic COVID-19 Sentiment and Sarcasm Detection Dataset. The dataset contains 5,162 annotated tweets.
Language
Arabic
License
CC BY-NC-SA 4.0
Task
Sentiment Analysis
AraCovid19-SSD
BOLT
Corpora
Annotated Datasets
Egyptian Arabic SMS/Chat and Transliteration. 1,856 naturally-occuring Arabizi conversations transliterated from the original romanized Arabizi script into standard Arabic orthography.
Language
Arabic
License
Custom Terms of Use
Task
Transliteration
BOLT
ARC-WMI
Corpora
Annotated Datasets
Arabic collection of written medicine information annotated with readability levels. Contains 4476 sentences with over 61k words, extracted from 94 sources of Arabic written medicine information, annotated and assigned a readability level by a panel of health-care professionals.
Language
Arabic
License
CC BY-NC-SA 4.0
Task
Readability Assessment
ARC-WMI
UD_Arabic-PADT
Corpora
Annotated Datasets
The Arabic-PADT UD treebank is based on the Prague Arabic Dependency Treebank (PADT), created at the Charles University in Prague. The treebank consists of 7,664 sentences (282,384 tokens) and its domain is mainly newswire.
Language
Arabic
License
CC BY-NC-SA 3.0
Task
Dependency Treebanks, Morphological Analysis, Morphological Segmentation, Part-of-speech (POS) Tagging, Stemming and Lemmatization
UD_Arabic-PADT
Tashkeela
Corpora
Annotated Datasets
Arabic diacritization corpus. Data is a collection of Arabic vocalized texts, which covers modern and classical Arabic language. The Data contains over 75 million of fully vocalized words obtained from 97 books, structured in text files. The corpus is collected mostly from Islamic classical books [14], and using semi-automatic web crawling process. The Modern Standard Arabic texts crawled from the Internet represent 1.15% of the corpus, about 867,913 words, while the most part is collected from Shamela Library, which represent 98.85%, with 74,762,008 words contained in 97 books.
Language
Arabic
License
GPLv2
Task
Diacritization/Vocalization
Tashkeela
Arabic Sentiment Analysis
Corpora
Annotated Datasets, Unannotated Corpora
36K tweets labeled into positive and negative, employed distant supervision and self-training approaches into the corpus to annotate it. 8K tweets manually annotated as a gold standard. Corpus evaluated intrinsically by comparing it to human classification and pre-trained sentiment analysis models. Extrinsic evaluation methods exploiting sentiment analysis task applied, achieving an accuracy of 86%.
Language
Arabic
License
Apache License 2.0
Task
Sentiment Analysis
Arabic Sentiment Analysis
Arabic Sentiment Analysis and Cross-lingual Sentiment Resources
Corpora
Annotated Datasets, Unannotated Corpora
(1) BBN Blog Posts Sentiment Corpus - A random subset of 1200 Levantine dialectal sentences, annotated for sentiment (both manual and automatic). (2) Syria Tweets Sentiment Corpus - A dataset of 2000 tweets originating from Syria. .annotated for sentiment (both manual and automatic).
Language
Arabic
License
Custom Terms of Use
Task
Sentiment Analysis
Arabic Sentiment Analysis and Cross-lingual Sentiment Resources
TyDiQA
Corpora
Annotated Datasets
A question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. Arabic dataset is 15,645 question-answer pairs.
Language
Arabic
License
Apache License 2.0
Task
Question Answering (QA)
TyDiQA
SenZi
Lexicons
A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi) - Lebanese dialect Arabizi sentiment lexicon, sentiment annotated datasets, and a Facebook corpus.
Language
Arabic
License
Unknown
Task
Sentiment Analysis
SenZi
DiaLex
Evaluation, Word Embeddings
A testbank of word pairs for six syntactic and semantic relations across five important Arabic dialects was created, and used to evaluate a set of existing and new Arabic word embeddings.
Language
Arabic
License
Unknown
DiaLex
ivrit.ai Community-Driven Transcription Project
Collaborative Projects
Recorded Speech and Audio Corpora
This project harnesses the collaborative efforts of volunteers to transcribe audio recordings in Hebrew.
Language
Hebrew
Task
Speech Recognition, Speech Synthesis, Text-to-Speech (TTS), Speech-to-Text (STT)
ivrit.ai Community-Driven Transcription Project
Recital: ivrit.ai Community-Driven Recording Project
Collaborative Projects
Recorded Speech and Audio Corpora
This project leverages the collaborative efforts of volunteers to record and contribute spoken of Hebrew texts, building an open and diverse speech corpus.
Language
Hebrew
Task
Speech Recognition, Speech Synthesis, Text-to-Speech (TTS), Speech-to-Text (STT)
Recital: ivrit.ai Community-Driven Recording Project
WikiQAar
Corpora
Annotated Datasets
A bilingual English-Arabic Question Answering corpus built on top of WIKIQA.
Language
Arabic
License
Unknown
Task
Question Answering (QA)
WikiQAar
Maknuune
Lexicons
A large open lexicon for the Palestinian Arabic dialect. Maknuune has over 36K entries from 17K lemmas, and 3.7K roots. All entries include diacritized Arabic orthography, phonological transcription and English glosses. Some entries are enriched with additional information.
Language
Arabic
License
CC BY-SA 4.0
Task
Diacritization/Vocalization, Transliteration
Maknuune
Arabizi-Transliteration Corpus
Dictionaries & Word Lists
The first large-scale "Arabizi to Arabic script" parallel corpus focusing on the Jordanian dialect and consisting of more than 25k pairs carefully created and inspected by native speakers to ensure highest quality, taken from Twitter, Facebook and ASK
Language
Arabic
License
Unknown
Task
Transliteration
Arabizi-Transliteration Corpus
Arabic Stop Words
Dictionaries & Word Lists
A list of ~750 possible stop words in Arabic.
Language
Arabic
License
MIT
Task
Stopwords Removal
Arabic Stop Words
Buckwalter’s list of Arabic roots
Dictionaries & Word Lists
Language
Arabic
License
Unknown
Task
Stopwords Removal
Buckwalter’s list of Arabic roots
NileULex
Lexicons
Nile University's Arabic sentiment Lexicon. Egyptian Arabic and Modern Standard Arabic sentiment words and their polarity, available for research, commercial use requires author permission.
Language
Arabic
License
Unknown
Task
Sentiment Analysis
NileULex
DAWQAS
Corpora
Annotated Datasets
A Dataset for Arabic Why Question Answering System. 3,205 why question-answer pairs scraped from public Arabic websites
Language
Arabic
License
Unknown
Task
Question Answering (QA)
DAWQAS
AQAD
Corpora
Automatically Annotated
17,000+ Arabic Questions & Answers dataset - 17,000+ questions, collected via fully automated data collector on a set of Arabic Wikipedia articles for extractive question answering task.
Language
Arabic
License
Unknown
Task
Question Answering (QA)
AQAD
ARCD
Corpora
Annotated Datasets
Wikipedia open-domain Question Answering. 1,395 questions posed by crowdworkers on Wikipedia articles, and a machine translation of SQuAD.
Language
Arabic
License
MIT
Task
Question Answering (QA)
ARCD
ANERcorp
Corpora
300 documents annotated for entity recognition.
Language
Arabic
License
CC BY-SA 4.0
Task
Named Entity Recognition (NER)
ANERcorp
Arabic-Tashkeela-Model
Models and Tools
Other Models and Tools
A diacritization model for Arabic language. This model was built/trained using the Tashkeela: the Arabic diacritization corpus on Kaggle.
Language
Arabic
License
Unknown
Task
Diacritization/Vocalization
Arabic-Tashkeela-Model
AraBERT
Models and Tools
Fine-Tuned Language Models, Pre-Trained Language Models
Transformer-based Model for Arabic Language Understanding.
Language
Arabic
License
Multiple
Task
Named Entity Recognition (NER), Question Answering (QA), Sentiment Analysis
AraBERT
CAMeLBERT
Models and Tools
Fine-Tuned Language Models, Pre-Trained Language Models
A collection of pre-trained models for Arabic NLP tasks. The models were fine-tuned for Sentiment Analysis, Dialect Identification, Poetry Classification, NER, POS Tagging.
Language
Arabic
License
MIT
Task
Dialect Identification, Named Entity Recognition (NER), Part-of-speech (POS) Tagging, Sentiment Analysis, Text Classification
CAMeLBERT
Spark NLP for Arabic
Models and Tools, Word Embeddings
Fine-Tuned Language Models, Other Models and Tools, Pipelines/Parsers, Pre-Trained Language Models
45 pre-trained models covering Named entity recognition, Translation, Word and sentence embeddings, Named Entity Recognition (NER), Stop words removal, Part-of-Speech (POS), and Lemmatization.
Language
Arabic
License
Multiple
Task
Machine Translation, Named Entity Recognition (NER), Part-of-speech (POS) Tagging, Stemming and Lemmatization, Stopwords Removal
Spark NLP for Arabic
AraELECTRA
Models and Tools
Fine-Tuned Language Models, Pre-Trained Language Models
An Arabic language representation model, pretrained using the replaced token detection objective on large Arabic text corpora. ARAELECTRA’s performance is validated on three Arabic NLP tasks i.e. question answering (QA), sentiment analysis (SA) and named-entity recognition (NER).
Language
Arabic
License
Custom Terms of Use
Task
Named Entity Recognition (NER), Question Answering (QA), Sentiment Analysis
AraELECTRA
AraGPT2
Models and Tools
Casual Language Models (CLM), Pre-Trained Language Models
Pre-Trained Transformer for Arabic Language Generation.
Language
Arabic
License
Custom Terms of Use
Task
Language Generation
AraGPT2
QARiB
Models and Tools
Pre-Trained Language Models
QCRI Arabic and Dialectal BERT (QARiB) model, was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text.
Language
Language
License
Apache License 2.0
QARiB
ARBERT & MARBERT
Models and Tools
Fine-Tuned Language Models, Pre-Trained Language Models
A large scale pre-training masked language model focused on both Dialectal Arabic (DA) and MSA; fine-tuned on ArBench: Sentiment Analysis, Social Meaning, Topic Classification, Dialect Identification, Named Entity Recogntion.
Language
Arabic
License
Unknown
Task
Content Moderation, Dialect Identification, Emotion Detection, Named Entity Recognition (NER), Sentiment Analysis, Topic Classification
ARBERT & MARBERT
Khoja
Models and Tools
Other Models and Tools
A root-based stemmer (heavy stemming; root extractor; rule-based). The algorithm was widely used in Arabic IR. It renders inflectional forms of words to produce their roots by removing their longest prefixes and suffixes, at first. The resulting word is then matched with some predefined patterns and some list-driven roots. The selected pattern depends on the length of the extracted word. Finally, in the algorithm, the extracted root is compared to a list of roots to check its validity.
Language
Arabic
License
Apache License 2.0
Task
Stemming and Lemmatization
Khoja
Light10
Models and Tools
Other Models and Tools
A tokenizer and stemmer for Arabic based on Lucene's UTF-8 tokenizer and ArabicStemmer. The ArabicStemmer is Lucene's implementation of Larkey’s light stemmer Light10.
Language
Arabic
License
Apache License 2.0
Task
Stemming and Lemmatization, Tokenization
Light10
BAMA 2.0
Models and Tools
Other Models and Tools
Buckwalter Arabic Morphological Analyzer Version 2.0. A stem-based morphological analyzer (stemmer).
Language
Arabic
License
Custom Terms of Use
Task
Morphological Analysis, Stemming and Lemmatization
BAMA 2.0
SAMA 3.1
Models and Tools
Other Models and Tools
Standard Arabic Morphological Analyzer (SAMA) Version 3.1. The LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 is based on, and updates, Buckwalter Arabic Morphological Analyzer (BAMA) 2.0. SAMA is a software tool for the morphological analysis of Standard Arabic. It considers each Arabic word token in all possible prefix-stem-suffix segmentations, and lists all known/possible annotation solutions, with assignment of all diacritic marks, morpheme boundaries (separating clitics and inflectional morphemes from stems), and all Part-of-Speech (POS) labels and glosses for each morpheme segment. The generated output may then be reviewed by users, and the most appropriate annotation selected from among several choices. The input format, output format, and data layer of SAMA 3.1 were designed to be backward compatible with BAMA. Incremental changes to the data layer in SAMA have resulted in: 1) increased lexicon coverage in the dictionary files, 2) important changes and additions to the inventory of POS tags, and 3) more possible solutions generated for numerous word forms.
Language
Arabic
License
Custom Terms of Use
Task
Morphological Analysis, Stemming and Lemmatization
SAMA 3.1
Tashaphyne
Models and Tools
Other Models and Tools
Arabic light stemmer and segmenter. It mainly supports light stemming (removing prefixes and suffixes) and gives all possible segmentations. It uses a modified finite state Automaton which allows generating all segmentations. It extracts all possible affixation from a word and provides all possible segmentations of a given word. To extract stem, Tashaphyne removes the longest affix from the word, then the affixes can be validated against a valid affixes list.
Language
Arabic
License
GPLv3
Task
Morphological Segmentation, Stemming and Lemmatization
Tashaphyne
Assem
Models and Tools
Other Models and Tools
Assem's Arabic Light Stemmer is a snowball-based stemming algorithm for Arabic aimed mainly to improve search. Assem stemmer is fast and can be generated in many programming languages through Snowball (a small string processing language designed for creating stemming algorithms to be used in IR systems). Assem stemmer offers light stemming and text normalization. It can be configured to run as root extractor or stemmer, but in two separate packages, because the Snowball framework does not support stemming and rooting at the same time.
Language
Arabic
License
Custom Terms of Use
Task
Stemming and Lemmatization, Text Normalization
Assem
MOTAZ
Models and Tools
Other Models and Tools
Motaz stemmer provides both root extraction and light stemming. The root extraction part is an implementation of Khoja stemmer with the only difference being using another stopwords list. For the light stemming part, it is an implementation of the Light10 Arabic light stemming algorithm proposed by Larkey and colleagues. Before applying the Light10 algorithm, Motaz stemmer normalizes the input word by removing diacritics, replacing all the forms of Hamza with ا, replacing ة with ه and replacing ى with ي.
Language
Arabic
License
Apache License 2.0
Task
Stemming and Lemmatization, Text Normalization
MOTAZ
Al-Stem (Darwish)
Models and Tools
Other Models and Tools
Al-stem is a light stemmer, which lightly chops off the following prefixes but in order from right to left (وال، فال، بال، بت، يت، لت، مت، وت، ست، نت، بم، لم، وم، كم، فم، ال، لل، في، وا، وا، فا، لا،با) plus the following suffixes starting from right to left, too (ات، وا، ون، وه، ان، تي، ته، تم، كم، هم، هن، ها، ية، تك، نا، ين، يه، ة، هـ، ي، ا). Darwish and Oard used Al-stem in their experiment to develop a technique for Arabic-English cross-language information retrieval at TREC 2002. By the term cross-language IR, it means the query is written in a language that is different from the documents’ language. Later, Al-Stem has been modified by David Graff from the Linguistic data Consortium (LDC) to strip-off the suffixes (تا and ا) and the prefixes (سي and تت) from the list of suffixes in Al-Stem.
Language
Arabic
License
Unknown
Task
Stemming and Lemmatization
Al-Stem (Darwish)
Sebawai (Darwish)
Models and Tools
Other Models and Tools
a root-based analyzer that is based on automatically derived rules and statistics. Sebawai has two main modules: The first module constructs a list of “word-root” pairs, using a morphological analyzer called ALPNET. Then, it extracts a list of prefixes, suffixes and stem templates, and estimates the probability that a prefix, suffix or stem template would occur. The second module takes a word and produces the possible combinations among prefixes, suffixes and templates. These combinations are obtained by eliminating prefixes and suffixes from words and then comparing all the produced stems to templates. As a result, a list of ranked roots is produced. These roots will be matched automatically against the list of the 10,000 roots extracted from an electronic copy of Lisan Al-Arab to confirm their existence.
Language
Arabic
License
Unknown
Task
Stemming and Lemmatization
Sebawai (Darwish)
AlKhalil
Models and Tools
Other Models and Tools
A diacritizer, POS-Tagger, root extractor, stemmer, lemmatizer, and morphosyntactic analyzer.
Language
Arabic
License
Apache License 2.0
Task
Diacritization/Vocalization, Morphological Analysis, Part-of-speech (POS) Tagging, Stemming and Lemmatization
AlKhalil
MADAMIRA
Models and Tools
Pipelines/Parsers
MADAMIRA is a morphological analyzer that provides tokenization, part-of-speech tagging, Morphological disambiguation for full range of morphological features, lemmatization, diacritization, named entity recognition and base phrase chunking.
Language
Arabic
License
Custom Terms of Use
Task
Diacritization/Vocalization, Morphological Analysis, Named Entity Recognition (NER), Part-of-speech (POS) Tagging, Stemming and Lemmatization, Tokenization
MADAMIRA
CAMeL Tools
Models and Tools
Other Models and Tools
CAMeL Tools is a suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi. It is a collection of open-source tools for Arabic natural language processing in Python. CAMeL Tools currently provides utilities for pre-processing, morphological modeling, dialect identification, named entity recognition and sentiment analysis.
Language
Arabic
License
MIT
Task
Diacritization/Vocalization, Dialect Identification, Morphological Analysis, Named Entity Recognition (NER), Part-of-speech (POS) Tagging, Sentiment Analysis, Stemming and Lemmatization, Text Normalization, Tokenization, Transliteration
CAMeL Tools
ElixirFM
Models and Tools
Piplines/Parsers
ElixirFM is a functional morphological analyzer that utilizes syntactic features to distinguish a word's sense. ElixirFM uses the correlation between Arabic grammar and morphology to improve the root extraction process; it uses Prague Arabic Dependency Treebank (PADT) to provide annotated syntactic features associated with stem dictionary (ElixirFM lexicon) for additional morphological knowledge. The lexicon of ElixirFM is derived from the open source Buckwalter lexicon.
Language
Arabic
License
Unknown
Task
Dependency Treebanks, Morphological Analysis, Morphological Inflection, Part-of-speech (POS) Tagging, Stemming and Lemmatization, Transliteration
ElixirFM
ADAM
Models and Tools
Other Models and Tools
Analyzer for Dialectal Arabic Morphology. ADAM is built based on the SAMA database, and can analyze both Egyptian and Levantine dialects.
Language
Arabic
License
Custom Terms of Use
Task
Morphological Analysis, Part-of-speech (POS) Tagging, Stemming and Lemmatization
ADAM
Qutuf
Models and Tools
Other Models and Tools
An Arabic Morphological Analyzer (Including Stemming and Root Extraction) and Part-Of-Speech Tagger as an Expert System. Qutuf is aimed to be the Core of a Framework for Arabic NLP. At Qutuf, some new concepts have been identified and implemented. Like First Normalization and Second Normalization text forms at the preprocessing phase and the Premature and Overdue Tagging at the Part-Of-Speech tagging task. Moreover, the POS tagging is designed and implemented as a rule-based expert system. A POS tagset, which is built based on a morphological feature tagset, has been designed and used in Qutuf. Morphological Analysis Includes both Stemming (light stemming) and Root Extraction (heavy stemming). It achieves this by using finite state automata and rules for agreement developed for cliticization parsing. It also uses AlKhalil Morpho Sys open source database for root extraction, pattern matching, morphological feature and POS assignment and closed nouns after enriching it. See also online: qutuf.com
Language
Arabic
License
Apache License 2.0
Task
Morphological Analysis, Part-of-speech (POS) Tagging, Stemming and Lemmatization, Text Normalization
Qutuf
Qalsadi
Models and Tools
Other Models and Tools
Arabic morphological analyzer Library for python.
Language
Arabic
License
GPL
Task
Morphological Analysis, Part-of-speech (POS) Tagging, Stemming and Lemmatization
Qalsadi
Previous
Next
Want to suggest a new resource? Email us at
hilla@webiks.com
to help improve our platform!
No items found.
Clear Filters