Hebrew and Arabic NLP Resources

Arabic Stories Corpus

Unannotated Corpora

146 Arabic children stories (MSA).

Apache License 2.0

Arabic Stories Corpus

Unannotated Datasets

Open Source Arabic Corpora (MSA). 22,000 text documents, each belonging to 1 of 10 categories: Economics, History, Entertainments, etc.

Unannotated Corpora

The First Arabic COVID-19 Twitter Dataset with Propagation Networks. About 3.2M tweets in mixed dialect Arabic associated with COVID-19, an ongoing collection starting at January 2020.

Unannotated Corpora

A Corpus of Levantine Arabic Dialects. 117,805 natural sentences from conversations in various Levantine dialects: Jordania, Palestinian, Lebanese, Syrian

Apache License 2.0

Abuelkhair Corpus

Unannotated Corpora

More than 5 million newspaper articles in MSA.

Abuelkhair Corpus

Unannotated Corpora

A Crawl for Today’s Arabic Web. 150M Arabic Web pages with high coverage of dialectal Arabic, Egyptian, Gulf, Levantine (~7M) and Maghrebi, as well as MSA, from a variety of sources - Wikipedia, Alexa, ArClueWeb09, and Twitter, etc.

Annotated Datasets

A multilingual Parallel Arabic DIalectal Corpus. A parallel corpus of 6,400 sentences in multiple Arabic dialects: Algerian, Maghreb, Syrian, Palestinian and MSA, for dialect detection and machine translation.

Dialect Identification, Machine Translation

QCRI Parallel Tweets

Annotated Datasets

Bilingual Corpus of Arabic-English Parallel Tweets. 166,000 bilingual tweets in Arabic and English (parallel) for machine translation tasks.

Apache License 2.0

Machine Translation

QCRI Parallel Tweets

Annotated Datasets, Unannotated Corpora

An Arabic Twitter Corpus for Crisis Events. A large-scale crisis-related multi-dialect Arabic Twitter corpus of 1,658,795 unique tweets from 22 emergency events. This corpus can be leveraged for several tasks, including crisis detection and crisis type classification.

Topic Classification, Topic Modeling

Unannotated Corpora

a multi Dialect multi National Arabic Song Lyrics Corpus. More than 30,000 Arabic song lyrics in 6 Arabic dialects (Egyptian, Levantine, etc.) for singers from 18 different Arabic countries, segmented into sentences and words and labeled with song information.

Arabic Wiki Data Dumps

Unannotated Corpora

Wikipedia, the free encyclopedia, publishes dumps of its content as XML files on a monthly basis.

Arabic Wiki Data Dumps

Recorded Speech and Audio Corpora

QCRI Al Jazeera Speech Resource - The largest transcribed Arabic speech corpus with around 2,000 hours with multi-layer annotation, in multi-dialect and code-switching speech, crawled from the Al Jazeera news channel, for speech recognition, dialect identification, punctuation restoration, speaker identification, speaker linking, etc.

Speech Recognition, Speech Synthesis, Speech-to-Text (STT), Text-to-Speech (TTS)

Annotated Datasets, Recorded Speech and Audio Corpora

The DiaCorpus project is a collaboration between the Data Science Institute (DSI) and Israeli Innovation authority. The purpose of the project is to create a first of a kind Arabic textual repository, in a local dialect (Israeli / Palestinian). This project is part of the National Language Processing plan of Israel.

Coreference Resolution, Emotion Detection, Name Entity Recognition (NER), Sentiment Analysis, Summarization

Annotated Datasets

A Large Dataset of Dialectal Arabic Tweets. About 25K tweets that are annotated via crowdsourcing for 5 Arabic dialects: Egyptian, Maghrebi, Levantine, Gulf, and Iraqi.

Dialect Identification

IAHLT Named Entities Dataset (Arabic Subset)

Annotated Datasets

This dataset contains named entity annotations for Arabic texts from various sources, curated as part of the IAHLT multilingual NER project. The Arabic portion is provided here as a cleaned subset intended for training and evaluation in named entity recognition tasks.

Named Entity Recognition (NER)

IAHLT Named Entities Dataset (Arabic Subset)

Annotated Datasets

Arabic Summaries with Annotated Support (Arabic: أساس “foundation”) is a multi‑register Arabic summarization corpus designed to emphasize longer source texts and longer, higher‑quality summaries. Each summary sentence is paired with human validation and supporting evidence extracted verbatim from the source.

Apache License 2.0

Coreference Resolution, Emotion Detection, Name Entity Recognition (NER), Sentiment Analysis, Summarization

ASTD: Arabic Sentiment Tweets Dataset

Annotated Datasets

Annotated Datasets

10,000 tweets classified as objective, subjective positive, subjective negative, and subjective mixed.

Sentiment Analysis

ASTD: Arabic Sentiment Tweets Dataset

Arabic 100k Reviews

Annotated Datasets

Reviews with three classes from different services. The dataset combines reviews from hotels, books, movies, products and a few airlines. It has three classes (Mixed, Negative and Positive). Most were mapped from reviewers' ratings with 3 being mixed, above 3 positive and below 3 negative. Each row has a label and text separated by a tab (tsv). Text (reviews) were cleaned by removing Arabic diacritics and non-Arabic characters. The dataset has no duplicate reviews. The hotels and book reviews are a subset of HARD and BRAD. The rest were selected from hadyelsahar with a little over 100 airlines reviews collected manually.

Sentiment Analysis

Arabic 100k Reviews

Annotated Datasets

Hotel Arabic-Reviews Dataset. This dataset contains 93700 hotel reviews in Arabic language. The hotel reviews were collected from Booking.com website during June/July 2016. The reviews are expressed in Modern Standard Arabic as well as dialectal Arabic.

Sentiment Analysis

Annotated Datasets

Books Reviews in Arabic Dataset. This dataset contains 510,600 book reviews in Arabic language. The reviews were collected from GoodReads.com website during June/July 2016. This work is an extension of the early dataset of large-scale Arabic dataset, LABR, which has around 63K Arabic Book Reviews collected from GoodReads.com. The reviews are expressed mainly in Modern Standard Arabic but there are reviews in dialectal Arabic as well.

Sentiment Analysis

Large Multi-Domain Resources for Arabic Sentiment Analysis

Automatically Annotated

33K Automatically annotated Reviews in Domains of Movies, Hotels, Restaurants and Products

Sentiment Analysis

Large Multi-Domain Resources for Arabic Sentiment Analysis

Annotated Datasets

Arabic Jordanian General Tweets (AJGT) Corpus consisted of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.

Sentiment Analysis

Annotated Datasets

An Arabic Speech-Act and Sentiment Corpus of Tweets. 21,000 tweets manually annotated for six different classes of speech-act labels.

Sentiment Analysis

Annotated Datasets, Unannotated Corpora

4.5 million tweets annotated positive, negative and neutral. This is a semi-supervised framework to annotate a large Arabic text corpus using a small portion of manually annotated tweets (15,000 tweets) and extending it from a large set of unlabeled tweets (34.7 million tweets) to reduce human effort in annotation and providing a middle ground between manual and automatic labeling of a large dataset.

Sentiment Analysis

Annotated Datasets

A Large-SCale Arabic Book Reviews Dataset. 63,000 book reviews in mixed dialect Arabic for sentiment analysis. Reviews with ratings 4 or 5 ere considered positive, and those with ratings 1 or 2 were considered negative. Reviews with rating 3 are considered neutral and not included in the polarity classification

Sentiment Analysis

Automatically Annotated

Large Scale Arabic Dataset for Sentiment Analysis. 6 million mixed dialect Arabic tweets with a vocabulary of 602,721 distinct entities, annotated by emojis and sentiment lexicon as subjective positive, subjective negative and neutral, dialectal tweets “translated” into MSA) {GNU General Public License v3.0

Sentiment Analysis

Annotated Datasets

Multi-domain Arabic Sentiment Corpus. 8,860 positive and negative reviews from different domains, in a variety of dialects, as well as a list of 3,880 positive and negative synsets annotated with their part of speech, polarity scores, dialects synsets and inflected forms

Sentiment Analysis

Automatically Annotated

A Multipurpose Arabic Corpus. 20,200 from the Omani newspaper Al Watan with summaries, named entities, art-of-speech tagging, and morphological analysis.

Morphological Analysis, Named Entity Recognition (NER), Part-of-speech Tagging, Text Summarization

Dialectal Arabic Datasets

Annotated Datasets

1,400 manually segmented and POS tagged tweets in four dialects, Egyptian, Levantine, Gulf, and Maghrebi

Apache License 2.0

Morphological Segmentation, Part-of-speech (POS) Tagging

Dialectal Arabic Datasets

The MADAR Arabic Dialect Corpus

Annotated Datasets

A collection of ~12,000 parallel sentences covering the dialects of 25 cities from the Arab World, in addition to English, French, and MSA.

Custom Terms of Use

Dialect Identification

The MADAR Arabic Dialect Corpus

Annotated Datasets

An open access NLP dataset for Arabic dialects. +50K tweets in five (5) national dialects, labeled for several applications: dialect detection, topic detection and sentiment analysis.

Dialect Identification, Sentiment Analysis, Topic Classification

Prague Arabic Dependency Treebank 1.0

Annotated Datasets

Language resource for Arabic natural language processing (NLP), a collection of parsed sentences annotated with syntactic structures.

Custom Terms of Use

Dependency Treebanks, Morphological Analysis, Morphological segmentation, Part-of-speech (POS) Tagging, Stemming and Lemmatization

Prague Arabic Dependency Treebank 1.0

Annotated Datasets

Opinion Corpus for Lebanese Arabic Reviews. 3900 Arabic customer reviews, on a wide scope of domain, including restaurants, hotels, hospitals, local shops, etc

Sentiment Analysis

Arabic Twitter Corpus for Flood Detection

Annotated Datasets

4,037 human-labelled Arabic Twitter messages in Middle Eastern dialects, for four high-risk flood events that occurred in 2018, labelled based on relatedness to the crisis and information type.

Topic Classification, Topic Modeling

Arabic Twitter Corpus for Flood Detection

Satirical Fake News Dataset

Annotated Datasets

Scraped from two satirical news websites, Al-Hudood and Al-Ahram Al-Mexici, for training fake news classifier/identifier.

Text Classification

Satirical Fake News Dataset

Annotated Datasets

Opinion Mining: Analysis of Comments Written in Arabic Colloquial. 28,576 reviews, which represents sentiments of 5,422 different reviewers, covering 27 different categories, collected from Jeeran web site, in Saudi and Jordanian Arabic.

Sentiment Analysis

Annotated Shami Corpus

Lebanese Arabic corpus annotated for numerous morphological features and for orthography standardization.

Morphological Analysis

Annotated Shami Corpus

Annotated Datasets

Negation and Speculation in Arabic Review. 3K review sentences annotated with negation and speculation in Egyptian dialect.

Sentiment Analysis

Dialectal Arabic Code-Switching Dataset

Annotated Datasets

Transcribed audio in Egyptian dialect annotated at word-level for Code Switching (CS).

Dialect Identification

Dialectal Arabic Code-Switching Dataset

Annotated Datasets

The Bangor Arabic–English Code-switching (BAEC) corpus. 45,251 words manually annotated for code-switching between Saudi, Egyptian and MSA Arabic and English.

Dialect identification

Annotated Datasets

Arabic COVID-19 Multi-label Fake News & Hate Speech Detection Dataset. 10,828 mixed dialect Arabic tweets annotated with 10 different labels concerning fake news and hate speech.

CC BY-NC-SA 4.0

Content Moderation, Text Classification

Annotated Datasets

A Levantine Twitter Dataset for Hate Speech and Abusive Language. 5,846 Syrian/Lebanese political tweets labeled as normal, abusive or hate

Content Moderation, Text Classification

Annotated Datasets

An Arabic Levantine Twitter Dataset for Misogynistic Language. 6,603 tweets in Levantine Arabic annotated as either non-misogynistic or one of seven misogynistic language categories.

Content Moderation, Text Classification

Annotated Datasets

Arabic Offensive Comments dataset from Multiple Social Media Platforms. Annotated social media comment dataset with (not) offensive language tags for Arabic social media comments collected from three different online platforms: Twitter, Facebook and YouTube.

Apache License 2.0

Content Moderation, Text Classification

A Corpus of Offensive Language in Arabic

Annotated Datasets

16,000 comments on YouTube videos from different nationalities annotated for offensive language.

Content Moderation, Text Classification

A Corpus of Offensive Language in Arabic

Religious Hate Speech Detection for Arabic Tweets

Annotated Datasets

Tweets in MSA and Dialectal Arabic annotated for hate speech, training dataset contains 5,569 examples, while the testing dataset contains 567 examples.

Content Moderation, Text Classification

Religious Hate Speech Detection for Arabic Tweets

Annotated Datasets

Bilingual (Arabic/English) COVID-19 Twitter dataset for misleading information detection. Automatically annotated Arabic/English COVID-19 Twitter dataset, using the shared information on the official websites Twitter accounts of the WHO, UNICEF, and UN as a source of reliable information, tweets annotated using 13 different machine learning algorithms and employing 7 different feature extraction technique.

Content Moderation, Text Classification

Adult Content Detection on Arabic Twitter

Annotated Datasets

16,000 comments on YouTube videos from different nationalities annotated for offensive language.

Content Moderation, Text Classification

Adult Content Detection on Arabic Twitter

Fine-Grained Hate Speech Detection on Arabic Twitter

Annotated Datasets

12,700 tweets in mixed dialect Arabic, no bias towards specific topics, genres, or dialects, each judged by 3 annotators for offensiveness classified into one of the hate speech types: Race, Religion, Ideology, Disability, Social Class, and Gender, and also judged whether a tweet has vulgar language or violence.

Content Moderation, Text Classification

Fine-Grained Hate Speech Detection on Arabic Twitter

Annotated Datasets

An Arabic COVID-19 Twitter dataset for misinformation detection. 138 verified claims, mostly from popular fact-checking websites, and identified 9.4K relevant tweets to those claims, then manually-annotated the tweets by veracity to support research on misinformation detection.

Content Moderation, Text Classification

Journalist Questions on Twitter

Annotated Datasets

10,000 mixed dialect Arabic tweets manually annotated for question type.

Question Classification

Journalist Questions on Twitter

AraFacts Dataset

Annotated Datasets

An Arabic COVID-19 Twitter dataset for misinformation detection. 138 verified claims, mostly from popular fact-checking websites, and identified 9.4K relevant tweets to those claims, then manually-annotated the tweets by veracity to support research on misinformation detection.

Content Moderation, Text Classification

AraFacts Dataset

Annotated Datasets

A Dialectal Arabic Irony Corpus Extracted from Twitter. 5,588 tweets - written in both MSA and mixed dialectal Arabic - manually annotated by two professional linguistics from HBKU for irony.

Sentiment Analysis

Annotated Datasets

Irony Detection in Arabic Tweets. ~5.5k mixed dialect Arabic tweets annotated by two native Arabic speakers appended with another randomly 5.5k sampled tweets from the original unannotated corpus.

Sentiment Analysis

Annotated Datasets

A Dataset of Intended Sarcasm. Dataset of tweets in Arabic and English labeled for sarcasm directly by their authors.

Sentiment Analysis

Annotated Datasets

Arabic COVID-19 Sentiment and Sarcasm Detection Dataset. Manually annotated multi-label Arabic COVID-19 Sentiment and Sarcasm Detection Dataset. The dataset contains 5,162 annotated tweets.

CC BY-NC-SA 4.0

Sentiment Analysis

Annotated Datasets

Egyptian Arabic SMS/Chat and Transliteration. 1,856 naturally-occuring Arabizi conversations transliterated from the original romanized Arabizi script into standard Arabic orthography.

Custom Terms of Use

Transliteration

Annotated Datasets

Arabic collection of written medicine information annotated with readability levels. Contains 4476 sentences with over 61k words, extracted from 94 sources of Arabic written medicine information, annotated and assigned a readability level by a panel of health-care professionals.

CC BY-NC-SA 4.0

Readability Assessment

Annotated Datasets

The Arabic-PADT UD treebank is based on the Prague Arabic Dependency Treebank (PADT), created at the Charles University in Prague. The treebank consists of 7,664 sentences (282,384 tokens) and its domain is mainly newswire.

CC BY-NC-SA 3.0

Dependency Treebanks, Morphological Analysis, Morphological Segmentation, Part-of-speech (POS) Tagging, Stemming and Lemmatization

Annotated Datasets

Arabic diacritization corpus. Data is a collection of Arabic vocalized texts, which covers modern and classical Arabic language. The Data contains over 75 million of fully vocalized words obtained from 97 books, structured in text files. The corpus is collected mostly from Islamic classical books [14], and using semi-automatic web crawling process. The Modern Standard Arabic texts crawled from the Internet represent 1.15% of the corpus, about 867,913 words, while the most part is collected from Shamela Library, which represent 98.85%, with 74,762,008 words contained in 97 books.

Diacritization/Vocalization

Arabic Sentiment Analysis

Annotated Datasets, Unannotated Corpora

36K tweets labeled into positive and negative, employed distant supervision and self-training approaches into the corpus to annotate it. 8K tweets manually annotated as a gold standard. Corpus evaluated intrinsically by comparing it to human classification and pre-trained sentiment analysis models. Extrinsic evaluation methods exploiting sentiment analysis task applied, achieving an accuracy of 86%.

Apache License 2.0

Sentiment Analysis

Arabic Sentiment Analysis

Arabic Sentiment Analysis and Cross-lingual Sentiment Resources

Annotated Datasets, Unannotated Corpora

(1) BBN Blog Posts Sentiment Corpus - A random subset of 1200 Levantine dialectal sentences, annotated for sentiment (both manual and automatic). (2) Syria Tweets Sentiment Corpus - A dataset of 2000 tweets originating from Syria. .annotated for sentiment (both manual and automatic).

Custom Terms of Use

Sentiment Analysis

Arabic Sentiment Analysis and Cross-lingual Sentiment Resources

Annotated Datasets

A question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. Arabic dataset is 15,645 question-answer pairs.

Apache License 2.0

Question Answering (QA)

A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi) - Lebanese dialect Arabizi sentiment lexicon, sentiment annotated datasets, and a Facebook corpus.

Sentiment Analysis

Evaluation, Word Embeddings

A testbank of word pairs for six syntactic and semantic relations across five important Arabic dialects was created, and used to evaluate a set of existing and new Arabic word embeddings.

ivrit.ai Community-Driven Transcription Project

Collaborative Projects

Recorded Speech and Audio Corpora

This project harnesses the collaborative efforts of volunteers to transcribe audio recordings in Hebrew.

Speech Recognition, Speech Synthesis, Text-to-Speech (TTS), Speech-to-Text (STT)

ivrit.ai Community-Driven Transcription Project

Recital: ivrit.ai Community-Driven Recording Project

Collaborative Projects

Recorded Speech and Audio Corpora

This project leverages the collaborative efforts of volunteers to record and contribute spoken of Hebrew texts, building an open and diverse speech corpus.

Speech Recognition, Speech Synthesis, Text-to-Speech (TTS), Speech-to-Text (STT)

Recital: ivrit.ai Community-Driven Recording Project

Annotated Datasets

A bilingual English-Arabic Question Answering corpus built on top of WIKIQA.

Question Answering (QA)

A large open lexicon for the Palestinian Arabic dialect. Maknuune has over 36K entries from 17K lemmas, and 3.7K roots. All entries include diacritized Arabic orthography, phonological transcription and English glosses. Some entries are enriched with additional information.

Diacritization/Vocalization, Transliteration

Arabizi-Transliteration Corpus

Dictionaries & Word Lists

The first large-scale "Arabizi to Arabic script" parallel corpus focusing on the Jordanian dialect and consisting of more than 25k pairs carefully created and inspected by native speakers to ensure highest quality, taken from Twitter, Facebook and ASK

Transliteration

Arabizi-Transliteration Corpus

Arabic Stop Words

Dictionaries & Word Lists

A list of ~750 possible stop words in Arabic.

Stopwords Removal

Arabic Stop Words

Buckwalter’s list of Arabic roots

Dictionaries & Word Lists

Stopwords Removal

Buckwalter’s list of Arabic roots

Nile University's Arabic sentiment Lexicon. Egyptian Arabic and Modern Standard Arabic sentiment words and their polarity, available for research, commercial use requires author permission.

Sentiment Analysis

Annotated Datasets

A Dataset for Arabic Why Question Answering System. 3,205 why question-answer pairs scraped from public Arabic websites

Question Answering (QA)

Automatically Annotated

17,000+ Arabic Questions & Answers dataset - 17,000+ questions, collected via fully automated data collector on a set of Arabic Wikipedia articles for extractive question answering task.

Question Answering (QA)

Annotated Datasets

Wikipedia open-domain Question Answering. 1,395 questions posed by crowdworkers on Wikipedia articles, and a machine translation of SQuAD.

Question Answering (QA)

300 documents annotated for entity recognition.

Named Entity Recognition (NER)

Arabic-Tashkeela-Model

Models and Tools

Other Models and Tools

A diacritization model for Arabic language. This model was built/trained using the Tashkeela: the Arabic diacritization corpus on Kaggle.

Diacritization/Vocalization

Arabic-Tashkeela-Model

Models and Tools

Fine-Tuned Language Models, Pre-Trained Language Models

Transformer-based Model for Arabic Language Understanding.

Named Entity Recognition (NER), Question Answering (QA), Sentiment Analysis

Models and Tools

Fine-Tuned Language Models, Pre-Trained Language Models

A collection of pre-trained models for Arabic NLP tasks. The models were fine-tuned for Sentiment Analysis, Dialect Identification, Poetry Classification, NER, POS Tagging.

Dialect Identification, Named Entity Recognition (NER), Part-of-speech (POS) Tagging, Sentiment Analysis, Text Classification

Spark NLP for Arabic

Models and Tools, Word Embeddings

Fine-Tuned Language Models, Other Models and Tools, Pipelines/Parsers, Pre-Trained Language Models

45 pre-trained models covering Named entity recognition, Translation, Word and sentence embeddings, Named Entity Recognition (NER), Stop words removal, Part-of-Speech (POS), and Lemmatization.

Machine Translation, Named Entity Recognition (NER), Part-of-speech (POS) Tagging, Stemming and Lemmatization, Stopwords Removal

Spark NLP for Arabic

Models and Tools

Fine-Tuned Language Models, Pre-Trained Language Models

An Arabic language representation model, pretrained using the replaced token detection objective on large Arabic text corpora. ARAELECTRA’s performance is validated on three Arabic NLP tasks i.e. question answering (QA), sentiment analysis (SA) and named-entity recognition (NER).

Custom Terms of Use

Named Entity Recognition (NER), Question Answering (QA), Sentiment Analysis

Models and Tools

Casual Language Models (CLM), Pre-Trained Language Models

Pre-Trained Transformer for Arabic Language Generation.

Custom Terms of Use

Language Generation

Models and Tools

Pre-Trained Language Models

QCRI Arabic and Dialectal BERT (QARiB) model, was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text.

Apache License 2.0

ARBERT & MARBERT

Models and Tools

Fine-Tuned Language Models, Pre-Trained Language Models

A large scale pre-training masked language model focused on both Dialectal Arabic (DA) and MSA; fine-tuned on ArBench: Sentiment Analysis, Social Meaning, Topic Classification, Dialect Identification, Named Entity Recogntion.

Content Moderation, Dialect Identification, Emotion Detection, Named Entity Recognition (NER), Sentiment Analysis, Topic Classification

ARBERT & MARBERT

Models and Tools

Other Models and Tools

A root-based stemmer (heavy stemming; root extractor; rule-based). The algorithm was widely used in Arabic IR. It renders inflectional forms of words to produce their roots by removing their longest prefixes and suffixes, at first. The resulting word is then matched with some predefined patterns and some list-driven roots. The selected pattern depends on the length of the extracted word. Finally, in the algorithm, the extracted root is compared to a list of roots to check its validity.

Apache License 2.0

Stemming and Lemmatization

Models and Tools

Other Models and Tools

A tokenizer and stemmer for Arabic based on Lucene's UTF-8 tokenizer and ArabicStemmer. The ArabicStemmer is Lucene's implementation of Larkey’s light stemmer Light10.

Apache License 2.0

Stemming and Lemmatization, Tokenization

Models and Tools

Other Models and Tools

Buckwalter Arabic Morphological Analyzer Version 2.0. A stem-based morphological analyzer (stemmer).

Custom Terms of Use

Morphological Analysis, Stemming and Lemmatization

Models and Tools

Other Models and Tools

Standard Arabic Morphological Analyzer (SAMA) Version 3.1. The LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 is based on, and updates, Buckwalter Arabic Morphological Analyzer (BAMA) 2.0. SAMA is a software tool for the morphological analysis of Standard Arabic. It considers each Arabic word token in all possible prefix-stem-suffix segmentations, and lists all known/possible annotation solutions, with assignment of all diacritic marks, morpheme boundaries (separating clitics and inflectional morphemes from stems), and all Part-of-Speech (POS) labels and glosses for each morpheme segment. The generated output may then be reviewed by users, and the most appropriate annotation selected from among several choices. The input format, output format, and data layer of SAMA 3.1 were designed to be backward compatible with BAMA. Incremental changes to the data layer in SAMA have resulted in: 1) increased lexicon coverage in the dictionary files, 2) important changes and additions to the inventory of POS tags, and 3) more possible solutions generated for numerous word forms.

Custom Terms of Use

Morphological Analysis, Stemming and Lemmatization

Models and Tools

Other Models and Tools

Arabic light stemmer and segmenter. It mainly supports light stemming (removing prefixes and suffixes) and gives all possible segmentations. It uses a modified finite state Automaton which allows generating all segmentations. It extracts all possible affixation from a word and provides all possible segmentations of a given word. To extract stem, Tashaphyne removes the longest affix from the word, then the affixes can be validated against a valid affixes list.

Morphological Segmentation, Stemming and Lemmatization

Models and Tools

Other Models and Tools

Assem's Arabic Light Stemmer is a snowball-based stemming algorithm for Arabic aimed mainly to improve search. Assem stemmer is fast and can be generated in many programming languages through Snowball (a small string processing language designed for creating stemming algorithms to be used in IR systems). Assem stemmer offers light stemming and text normalization. It can be configured to run as root extractor or stemmer, but in two separate packages, because the Snowball framework does not support stemming and rooting at the same time.

Custom Terms of Use

Stemming and Lemmatization, Text Normalization

Models and Tools

Other Models and Tools

Motaz stemmer provides both root extraction and light stemming. The root extraction part is an implementation of Khoja stemmer with the only difference being using another stopwords list. For the light stemming part, it is an implementation of the Light10 Arabic light stemming algorithm proposed by Larkey and colleagues. Before applying the Light10 algorithm, Motaz stemmer normalizes the input word by removing diacritics, replacing all the forms of Hamza with ا, replacing ة with ه and replacing ى with ي.

Apache License 2.0

Stemming and Lemmatization, Text Normalization

Al-Stem (Darwish)

Models and Tools

Other Models and Tools

Al-stem is a light stemmer, which lightly chops off the following prefixes but in order from right to left (وال، فال، بال، بت، يت، لت، مت، وت، ست، نت، بم، لم، وم، كم، فم، ال، لل، في، وا، وا، فا، لا،با) plus the following suffixes starting from right to left, too (ات، وا، ون، وه، ان، تي، ته، تم، كم، هم، هن، ها، ية، تك، نا، ين، يه، ة، هـ، ي، ا). Darwish and Oard used Al-stem in their experiment to develop a technique for Arabic-English cross-language information retrieval at TREC 2002. By the term cross-language IR, it means the query is written in a language that is different from the documents’ language. Later, Al-Stem has been modified by David Graff from the Linguistic data Consortium (LDC) to strip-off the suffixes (تا and ا) and the prefixes (سي and تت) from the list of suffixes in Al-Stem.

Stemming and Lemmatization

Al-Stem (Darwish)

Sebawai (Darwish)

Models and Tools

Other Models and Tools

a root-based analyzer that is based on automatically derived rules and statistics. Sebawai has two main modules: The first module constructs a list of “word-root” pairs, using a morphological analyzer called ALPNET. Then, it extracts a list of prefixes, suffixes and stem templates, and estimates the probability that a prefix, suffix or stem template would occur. The second module takes a word and produces the possible combinations among prefixes, suffixes and templates. These combinations are obtained by eliminating prefixes and suffixes from words and then comparing all the produced stems to templates. As a result, a list of ranked roots is produced. These roots will be matched automatically against the list of the 10,000 roots extracted from an electronic copy of Lisan Al-Arab to confirm their existence.

Stemming and Lemmatization

Sebawai (Darwish)

Models and Tools

Other Models and Tools

A diacritizer, POS-Tagger, root extractor, stemmer, lemmatizer, and morphosyntactic analyzer.

Apache License 2.0

Diacritization/Vocalization, Morphological Analysis, Part-of-speech (POS) Tagging, Stemming and Lemmatization

Models and Tools

Pipelines/Parsers

MADAMIRA is a morphological analyzer that provides tokenization, part-of-speech tagging, Morphological disambiguation for full range of morphological features, lemmatization, diacritization, named entity recognition and base phrase chunking.

Custom Terms of Use

Diacritization/Vocalization, Morphological Analysis, Named Entity Recognition (NER), Part-of-speech (POS) Tagging, Stemming and Lemmatization, Tokenization

Models and Tools

Other Models and Tools

CAMeL Tools is a suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi. It is a collection of open-source tools for Arabic natural language processing in Python. CAMeL Tools currently provides utilities for pre-processing, morphological modeling, dialect identification, named entity recognition and sentiment analysis.

Diacritization/Vocalization, Dialect Identification, Morphological Analysis, Named Entity Recognition (NER), Part-of-speech (POS) Tagging, Sentiment Analysis, Stemming and Lemmatization, Text Normalization, Tokenization, Transliteration

Models and Tools

Piplines/Parsers

ElixirFM is a functional morphological analyzer that utilizes syntactic features to distinguish a word's sense. ElixirFM uses the correlation between Arabic grammar and morphology to improve the root extraction process; it uses Prague Arabic Dependency Treebank (PADT) to provide annotated syntactic features associated with stem dictionary (ElixirFM lexicon) for additional morphological knowledge. The lexicon of ElixirFM is derived from the open source Buckwalter lexicon.

Dependency Treebanks, Morphological Analysis, Morphological Inflection, Part-of-speech (POS) Tagging, Stemming and Lemmatization, Transliteration

Models and Tools

Other Models and Tools

Analyzer for Dialectal Arabic Morphology. ADAM is built based on the SAMA database, and can analyze both Egyptian and Levantine dialects.

Custom Terms of Use

Morphological Analysis, Part-of-speech (POS) Tagging, Stemming and Lemmatization

Models and Tools

Other Models and Tools

An Arabic Morphological Analyzer (Including Stemming and Root Extraction) and Part-Of-Speech Tagger as an Expert System. Qutuf is aimed to be the Core of a Framework for Arabic NLP. At Qutuf, some new concepts have been identified and implemented. Like First Normalization and Second Normalization text forms at the preprocessing phase and the Premature and Overdue Tagging at the Part-Of-Speech tagging task. Moreover, the POS tagging is designed and implemented as a rule-based expert system. A POS tagset, which is built based on a morphological feature tagset, has been designed and used in Qutuf. Morphological Analysis Includes both Stemming (light stemming) and Root Extraction (heavy stemming). It achieves this by using finite state automata and rules for agreement developed for cliticization parsing. It also uses AlKhalil Morpho Sys open source database for root extraction, pattern matching, morphological feature and POS assignment and closed nouns after enriching it. See also online: qutuf.com

Apache License 2.0

Morphological Analysis, Part-of-speech (POS) Tagging, Stemming and Lemmatization, Text Normalization

Want to suggest a new resource? Email us at hilla@webiks.com to help improve our platform!

No items found.