Hebrew and Arabic NLP Resources
Home
Basic Filters
Radio Buttons
Only allow one selection
Checkboxes
Allow multiple selections
Buttons (Single-Select)
Buttons that act like radio buttons
Buttons (Multi-Select)
Buttons that act like checkboxes
Link Blocks
Maximum control over filter styles
Multi-Ref Filters
Match Any Selection
Item has Option A
or
Option B
Match All Selections
Item has Option A
and
Option B
Combine Filters
Multi Filters
Filter by multiple fields
Dropdown Multi Filters
Use dropdowns to save space
Multi Filters + Search
The full power of Jetboost!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Hebrew
Arabic
Legal HeBERT
Models and Tools
Pre-Trained Language Models, Fine-Tuned Language Models
a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. Two versions: (1) a fine-tuned model of HeBERT applied on legal and legislative documents, and (2) uses HeBERT's architecture guidlines to train a BERT model from scratch.
Language
Hebrew
License
unknown
Task
Language Modeling, Text Classification
Legal HeBERT
Neural Sentiment Analyzer for Modern Hebrew
Models and Tools
Other Models and Tools
This code and dataset provide an established benchmark for neural sentiment analysis for Modern Hebrew.
Language
Hebrew
License
MIT
Task
Sentiment Analysis
Neural Sentiment Analyzer for Modern Hebrew
AlephBERT
Models and Tools
Pre-Trained Language Models
a large pre-trained language model for Modern Hebrew, publicly available, pre-training on Oscar, Texts of Hebrew tweets, all of Hebrew Wikipedia, published by the OnlpLab team.
Language
Hebrew
License
Apache License 2.0
Task
Morphological Analysis, Named Entity Recognition (NER), Part-of-speech (POS) Tagging, Sentiment Analysis
AlephBERT
HeBERT
Models and Tools
Pre-Trained Language Models
HeBERT is a Hebrew pretrained language model for Polarity Analysis and Emotion Recognition, published by Dr. Inbal Yahav Shenberger and Avichay Chriqui. It is based on Google's BERT architecture and it is BERT-Base config. HeBert was trained on three dataset: OSCAR, A Hebrew dump of Wikipedia, Emotion User Generated Content (UGC) data that was collected for the purpose of this study. The model was evaluated on downstream tasks: HebEMO - emotion recognition model, and sentiment analysis.
Language
Hebrew
License
MIT
Task
Emotion Detection, Sentiment Analysis
HeBERT
AlephBERTGimmel
Models and Tools
Pre-Trained Language Models
a Hebrew pre-trained language model, trained on the same dataset as the previous SOTA Hebrew PLM AlephBERT, consisting of approximately 2 billion words of text but with a substantially increased vocabulary of 128,000 word pieces. Published as a collaboration of the OnlpLab team and Dicta.
Language
Hebrew
License
CC0 1.0
Task
Morphological Analysis, Morphological Segmentation, Named Entity Recognition (NER), Part-of-speech (POS) Tagging, Sentiment Analysis, Tokenization
AlephBERTGimmel
Hebrew Psychological Lexicons (Code)
Models and Tools
Other Models and Tools
Easy-to-use Python interface for Hebrew clinical psychology text analysis. Useful for various psychology applications such as detecting emotional state, well being, relationship quality in conversation, identifying topics (e.g., family, work) and many more.
Language
Hebrew
License
Apache License 2.0
Task
Emotion Detection, Topic Classification
Hebrew Psychological Lexicons (Code)
DictaBERT
Model and Tools
Pre-Trained Language Models
a pre-trained BERT model for modern Hebrew, with the masked-language-modeling objective.
Hebrew
Hebrew
License
CC BY 4.0
DictaBERT
DictaBERT-seg
Model and Tools
Fine-Tuned Language Models
A fine-tuned model for prefix segmentation task.
Hebrew
Hebrew
License
CC BY 4.0
Task
Morphological Segmentation
DictaBERT-seg
DictaBERT-morph
Model and Tools
Fine-Tuned Language Models
A fine-tuned model for morphological tagging task.
Hebrew
Hebrew
License
CC BY 4.0
Task
Morphological Analysis, Part-of-speech Tagging (POS)
DictaBERT-morph
BGU NLP - LemLDA: an LDA Package for Hebrew
Models and Tools
Other Models and Tools
The package is based on Heinrich's java implementation of collapsed Gibbs sampling with an extra variable to model the generative nature of lemmas in Hebrew.
Language
Hebrew
License
GPL
Task
Topic Modeling
BGU NLP - LemLDA: an LDA Package for Hebrew
Neural Modeling for Named Entities and Morphology (NEMO2)
Models and Tools
Other Models and Tools
OnlpLab's code and models for neural modeling of Hebrew NER. Described in the TACL paper Neural Modeling for Named Entities and Morphology (NEMO2).
Language
Hebrew
License
Apache License 2.0
Task
Named Entity Recognition (NER)
Neural Modeling for Named Entities and Morphology (NEMO2)
MDTEL (Code)
Models and Tools
Other Models and Tools
Yonatan Bitton's code that recognizes medical entities in a Hebrew text.
Language
Hebrew
License
unknown
Task
Named Entity Recognition (NER)
MDTEL (Code)
HebSpacy
Models and Tools
Pipelines/Parsers
A custom spaCy pipeline for Hebrew text including a transformer-based multitask NER model that recognizes 16 entity types in Hebrew, including GPE, PER, LOC and ORG.
Language
Hebrew
License
MIT
Task
Named Entity Recognition (NER)
HebSpacy
HebSafeHarbor
Models and Tools
Pipelines/Parsers
A de-identification toolkit for clinical text in Hebrew. Demo: https://hebsafeharbor-demo.azurewebsites.net/
Language
Hebrew
License
MIT
Task
Named Entity Recognition (NER), Temporal Information Extraction
HebSafeHarbor
HebSafeHarbor – CLALIT Validation
Models and Tools
Other Models and Tools
A de-identification toolkit for clinical text in Hebrew. An improved version of Microsoft's HebSafeHarbor project.
Language
Hebrew
License
MIT
Task
Named Entity Recognition (NER), Temporal Information Extraction
HebSafeHarbor – CLALIT Validation
Text Fabric
Models and Tools
Other Models and Tools
A Python package for browsing and processing ancient corpora, focused on the Hebrew Bible Database.
Language
Hebrew
License
CC BY-NC 4.0
Task
Optical Character Recognition (OCR)
Text Fabric
word2word (Code)
Models and Tools
Other Models and Tools
Easy-to-use Python interface for accessing top-k word translations and for building a new bilingual lexicon from a custom parallel corpus.
Language
Arabic, Hebrew
License
Apache License 2.0
Task
Machine Translation
word2word (Code)
Universal Language Model Fine-tuning for Text Classification (ULMFiT) in Hebrew
Models and Tools
Fine-Tuned Language Models, Multilingual Models
The weights (e.g. a trained model) for a Hebrew version for Howard's and Ruder's ULMFiT model. Trained on the Hebrew Wikipedia corpus.
Language
Hebrew
License
Unknown
Task
Text Classification
Universal Language Model Fine-tuning for Text Classification (ULMFiT) in Hebrew
TaatikNet
Models and Tools
Fine-Tuned Language Models
Sequence-to-sequence learning for Hebrew transliteration (converting between Hebrew text and Latin transliteration).
Language
Hebrew
License
CC BY-SA 3.0
Task
Transliteration
TaatikNet
SPMRL to UD
Models and Tools
Other Models and Tools
Converts YAP's output from the SPMRL scheme to UD v2.
Language
Hebrew
License
Apache License 2.0
SPMRL to UD
Hebrew GPT neo
Models and Tools
Causal Language Models (CLM)
Doron Adler's Hebrew text generation model based on EleutherAI's gpt-neo.
Language
Hebrew
License
MIT
Task
Language Generation
Hebrew GPT neo
DICTA
Commercial and Online Services
Analytical tools for Jewish texts. They also have a GitHub organization: https://github.com/Dicta-Israel-Center-for-Text-Analysis.
Language
Hebrew
License
CC BY-SA 4.0
Task
Diacritization/Vocalization, Optical Character Recognition (OCR), Text Classification
DICTA
wordfreq 3.0.3
Commercial and Online Services
A Python library for looking up the frequencies of words in 44 languages, including Hebrew. The Hebrew data is based on Wikipedia, OPUS OpenSubtitles 2018 and SUBTLEX, Google Books Ngrams 2012, Web text from OSCAR and Twitter.
Language
Hebrew
License
MIT
wordfreq 3.0.3
Eyfo
Commercial and Online Services
A commercial engine for search and entity tagging in Hebrew.
Language
Hebrew
License
Unknown
Task
Named Entity Recognition (NER)
Eyfo
Melingo's ICA (Intelligent Content Analysis)
Commercial and Online Services
A text analysis and textual categorized entity extraction API for Hebrew, Arabic and Farsi texts.
Language
Hebrew
License
Unknown
Task
Morphological Analysis, Named Entity Recognition (NER), Sentiment Analysis, Text Classification
Melingo's ICA (Intelligent Content Analysis)
Genius
Commercial and Online Services
Automatic analysis of free text in Hebrew.
Language
Hebrew
License
Unknown
Genius
AlmaReader
Commercial and Online Services
Online text-to-speech service for Hebrew.
Language
Hebrew
License
Unknown
Task
Text-to-Speech (TTS)
AlmaReader
Amnon The Transcriber
Commercial and Online Services
a WhatsApp bot that receives a voice note and transcribe it to text.
Language
Hebrew
License
Unknown
Task
Speech-to-Text (STT)
Amnon The Transcriber
Callee
Commercial and Online Services
a WhatsApp bot that receives a voice note and transcribe it to text.
Language
Hebrew
License
Unknown
Task
Speech-to-Text (STT), Text Summarization
Callee
Verbit
Commercial and Online Services
Transcription.
Language
Arabic, Hebrew
License
Unknown
Task
Speech-to-Text (STT)
Verbit
Text Analytics for health containers
Commercial and Online Services
Language
Arabic, Hebrew
License
Unknown
Text Analytics for health containers
Hebrew-NLP
Commercial and Online Services
Language
Hebrew
License
Unknown
Task
Morphological Analysis, Morphological Segmentation, Named Entity Recognition (NER), Stemming and Lemmatization, Text Normalization, Tokenization
Hebrew-NLP
LightTag
Annotation Tools
A tool for managing annotation projects. Handles right-to-left and part-of-word marking. Tutorial video: https://www.youtube.com/watch?v=eTlrTC_n_yg
Language
Arabic, Hebrew
Unknown
LightTag
Recogito
Annotation Tools
A tool for linked data annotation.
Language
Arabic, Hebrew
License
Apache License 2.0
Recogito
Doccano
Annotation Tools
an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequence to sequence tasks. So, you can create labeled data for sentiment analysis, named entity recognition, text summarization and so on.
Language
Arabic, Hebrew
License
MIT
Task
Named Entity Recognition (NER), Sentiment Analysis, Text Classification, Text Summarization
Doccano
Hebrew SimLex-999
Evaluation
A Hebrew version of the Simlex-999 resource for the evaluation of models that learn the meaning of words and concepts.
Language
Hebrew
License
Unknown
Hebrew SimLex-999
CATMA
Annotation Tools
A web-based tool for research and collaboration over text data. Handles right-to-left and part-of-word marking.
Language
Arabic, Hebrew
License
Unknown
CATMA
WebAnno
Annotation Tools
Web-based. Support RTL and project management.
Language
Arabic, Hebrew
License
Apache License 2.0
WebAnno
openNLP
Annotation Tools
OpenNLP has a tagging tool.
Language
Arabic, Hebrew
License
Apache License 2.0
openNLP
opeNER
Annotation Tools
opeNER has a tagging tool.
Language
Arabic, Hebrew
License
Unknown
Task
Named Entity Recognition (NER), Sentiment Analysis
opeNER
Arethusa: Annotation Environment
Annotation Tools
A backend-independent client-side annotation framework.
Language
Arabic, Hebrew
License
MIT
Arethusa: Annotation Environment
rasa-nlu-trainer
Annotation Tools
A tool to edit training examples for rasa NLU. Handles right-to-left and part-of-word marking.
Language
Arabic, Hebrew
License
MIT
rasa-nlu-trainer
brat
Annotation Tools
An online environment for collaborative text annotation. Does not support right-to-left.
Language
Arabic, Hebrew
License
MIT
brat
pybossa
Annotation Tools
A framework for crowdsourcing of data analysis and enrichment tasks.
Language
Arabic, Hebrew
License
AGPL-3.0
pybossa
TextThresher
Annotation Tools
A crowdsourced text annotator. Built with React and Redux (possibly also with pybossa).
Language
Arabic, Hebrew
License
Unknown
TextThresher
SHEBANQ
Annotation Tools
System for HEBrew Text: ANnotations for Queries and Markup. SHEBANQ is an online environment for studying the Hebrew Bible.
Language
Hebrew
License
Unknown
SHEBANQ
The ONLP Lab at Bar Ilan University
Labs & Researchers
The ONLP Lab team research Natural Language Processing (NLP) foundations, algorithms and applications, and develop morphological, syntactic and semantic models.
The ONLP Lab at Bar Ilan University
Prof. Reut Tsarfaty
Labs & Researchers
Prof. Reut Tsarfaty is Associate Professor at the Computer Science department at Bar-Ilan University, and the head of the ONLP Lab. She is interested in developing models for analyzing, understanding and generating utterances in natural language, and in applications of natural language processing to dowbstream applications such as natural language programming, natural language navigation, and the analysis and generation of user-content in social media. Her research is funded by an ISF grant (1739/26) and an ERC Starting Grant (677352).
Prof. Reut Tsarfaty
Dan Bareket
Labs & Researchers
Data Scientist - the ONLP Lab.
Dan Bareket
The Natural Language Processing Lab at Bar Ilan University
Labs & Researchers
The Natural Language Processing Lab at Bar Ilan University
Prof. Ido Dagan
Labs & Researchers
The founder of the Natural Language Processing (NLP) Lab at Bar-Ilan.
Prof. Ido Dagan
Prof. Yoav Goldberg
Labs & Researchers
Professor at Bar Ilan University's Computer Science Department, and the Research Director of the Israeli branch of the Allen Institute for Artificial Intelligence.
Prof. Yoav Goldberg
Dr. Avi Shmidman
Lecturer, Bar Ilan University, The Department of the Literature of the Jewish People
Dr. Avi Shmidman
Prof. Moshe Koppel
Labs & Researchers
Prof. Moshe Koppel of the Department of Computer Science conducts research on a variety of machine learning applications including text categorization, image processing, speaker recognition and automated game playing.
Prof. Moshe Koppel
The Open Media and Information Lab (OMILab) at the Open University of Israel
Labs & Researchers
An interdisciplinary center for research and for teaching in new media and related areas, such as big data, information science, network cultures and digital sociology.
The Open Media and Information Lab (OMILab) at the Open University of Israel
Dr. Vered Silber-Varod
Labs & Researchers
Director of the Open Media and Information Lab (OMILab). Research interests and publications focus on various aspects of speech sciences, with expertise in speech prosody, acoustic phonetics, and speech communication and text analytics.
Dr. Vered Silber-Varod
Dr. Anat Lerner
Labs & Researchers
Interested in speech prosody analyses, combinatorial auctions and computer Networks (especially Ad-Hoc networks, mobile and cellular networks).
Dr. Anat Lerner
Natural Language Processing Lab at Ben Gurion University
Labs & Researchers
Research topics: EasyFirst Syntactic Dependency Parsing, Hebrew NLP, Medical NLP, Text Summarization, Natural Language Generation (NLG).
Natural Language Processing Lab at Ben Gurion University
Prof. Michael Elhadad
Labs & Researchers
Professor at the Department of Computer Science, Ben-Gurion University of the Negev. His research interests are in Computational Linguistics, Natural Language Generation and Intelligent User Interfaces.
Prof. Michael Elhadad
Dr. Meni Adler
Labs & Researchers
Teaching fellow and Researcher, Ben Gurion University, Israel. Area of interest: Computational Linguistics, Morphology, Hebrew.
Dr. Meni Adler
Dr. Oren Tzur
Labs & Researchers
Dr. Oren Tzur is an Assistant Professor (Senior Lecturer) at the department of Software & Information Systems Engineering (SISE) at Ben Gurion University, and the head of the NLP and Social dynamics lab (NAS-LAB).
Dr. Oren Tzur
Prof. Shuly Wintner
Labs & Researchers
Prof. Shuly Wintner is a professor of Computer Science at the University of Haifa, Israel. His main areas of interest are computational linguistics and natural language processing. Specific research topics include linguistic formalisms, formal grammar, computational morphology and syntax, processing of Hebrew, machine translation and computational approaches to language acquisition.
Prof. Shuly Wintner
Dr. Einat Minkov
Labs & Researchers
Working on Information Extraction and Semantics, as well as in other Natural Language Processing applications. I am also interested in Machine Learning - and the application of learning to NLP problems.
Dr. Einat Minkov
Prof. Jonathan Berant
Labs & Researchers
Prof. Jonathan Berant is Associate Professor at the Blavatnik School of Computer Science, Tel-Aviv University. He works on Natural Language Understanding problems such as Semantic Parsing, Question Answering, Paraphrasing, Reading Comprehension, and Textual Entailment.
Prof. Jonathan Berant
Prof. Joseph (Yossi) Keshet
Labs & Researchers
Prof. Joseph (Yossi) Keshet is an Associate Professor at the Andrew and Erna Viterbi Faculty of Electrical and Computer Engineering. He is the director of the Speech, Language, and Deep Learning Lab and affiliated with the Signal and Image processing Lab (SIPL).
Prof. Joseph (Yossi) Keshet
Dr. Yonatan Belinkov
Labs & Researchers
Assistant Professor at the faculty of Computer Science. Focus: interpretability and robustness.
Dr. Yonatan Belinkov
Prof. Alon Itai
Labs & Researchers
(retired)
Prof. Alon Itai
Prof. Roi Reichart
Labs & Researchers
An Assistant Professor at the faculty of Industrial Engineering and Management of the Technion. Working on Natural Language Processing (NLP). Interested in language learning in its context and design models that integrate domain and world knowledge with data-driven methods.
Prof. Roi Reichart
Prof. Ronen Feldman
Labs & Researchers
Feldman's main areas of research are natural language processing, entity extraction and text relations, text sentiment analysis, and language processing for algorithmic trading. He is one of the founder of the discipline of text mining.
Prof. Ronen Feldman
Prof. Ari Rappoport
Labs & Researchers
With his main contribution in the area of Neuroscience, where he developed a comprehensive theory of the brain, Prof. Rappoport's Computer Science area of interest is language (Computational Linguistics, Natural Language Processing (NLP)), from cognitive science and machine learning perspectives.
Prof. Ari Rappoport
Prof. Omri Abend
Labs & Researchers
His fields of interest are Computational Linguistics and Natural Language Processing. Specifically, I conduct research on semantic (meaning) representation from a computational perspective. My research is tightly linked to statistical learning, language technology (such as Machine Translation and Information Extraction), and computational modeling of child language acquisition.
Prof. Omri Abend
Prof. Dafna Shahaf
Labs & Researchers
Prof. Shahaf's research focuses on helping people make sense of the world. She designs algorithms that help people understand the underlying structure of complex topics, and connect the dots between different pieces. She also likes to formalize intuitive notions; see recent work on Computational Humor.
Prof. Dafna Shahaf
Dr. Yael Netzer
Labs & Researchers
Dr. Yael Netzer has PhD in Computer Science and MA studies in Hebrew Literature at Ben Gurion University. She is a teaching fellow in Ben Gurion University and Haifa University, teaching Digital Humanities for Computer Science and for the Humanities, and in Tel Aviv University teaching Digital Humanities for archivists. She works at Dicta, the Israeli Center for Text Analysis. She is working as a DH consultant in the Digital Humanities lab in Haifa University. In recent years, Netzer develops and implements methods for digital personal archives, and is most interested in knowledge representation for archives, libraries and for the humanities.
Dr. Yael Netzer
The Neurolinguistics Laboratory at the Edmond and Lily Safra Center for Brain Sciences (ELSC)
Labs & Researchers
Studies the neural bases of linguistic knowledge and processing.
The Neurolinguistics Laboratory at the Edmond and Lily Safra Center for Brain Sciences (ELSC)
Prof. Yosef Grodzinsky
Labs & Researchers
Research fields: functional anatomy of language, linguistic theory (syntax, semantics), language acquisition, aphasia, individual variation.
Prof. Yosef Grodzinsky
The NLPH Facebook Group
Courses, Presentations and Meetups
Language
Hebrew
The NLPH Facebook Group
The Israeli Natural Language Processing Meetup
Courses, Presentations and Meetups
Language
Arabic, Hebrew
The Israeli Natural Language Processing Meetup
Bar Ilan University's NLP course
Courses, Presentations and Meetups
Bar Ilan University's NLP course
ONLP April 2019 Meetup lecture slides
Courses, Presentations and Meetups
ONLP April 2019 Meetup lecture slides
Big DataNights NLP 2020
Courses, Presentations and Meetups
Big DataNights NLP 2020
Arabic Stories Corpus
Corpora
Unannotated Corpora
146 Arabic children stories (MSA).
Language
Arabic
License
Apache License 2.0
Arabic Stories Corpus
OSAC
Corpora
Unannotated Datasets
Open Source Arabic Corpora (MSA). 22,000 text documents, each belonging to 1 of 10 categories: Economics, History, Entertainments, etc.
Language
Arabic
License
Unkown
OSAC
ArCOV-19
Corpora
Unannotated Corpora
The First Arabic COVID-19 Twitter Dataset with Propagation Networks. About 3.2M tweets in mixed dialect Arabic associated with COVID-19, an ongoing collection starting at January 2020.
Language
Arabic
License
Unknown
ArCOV-19
Shami
Corpora
Unannotated Corpora
A Corpus of Levantine Arabic Dialects. 117,805 natural sentences from conversations in various Levantine dialects: Jordania, Palestinian, Lebanese, Syrian
Language
Arabic
License
Apache License 2.0
Shami
Abuelkhair Corpus
Corpora
Unannotated Corpora
More than 5 million newspaper articles in MSA.
Language
Arabic
License
Unknown
Abuelkhair Corpus
ArabicWeb16
Corpora
Unannotated Corpora
A Crawl for Today’s Arabic Web. 150M Arabic Web pages with high coverage of dialectal Arabic, Egyptian, Gulf, Levantine (~7M) and Maghrebi, as well as MSA, from a variety of sources - Wikipedia, Alexa, ArClueWeb09, and Twitter, etc.
Language
Arabic
ArabicWeb16
PADIC
Corpora
Annotated Datasets
A multilingual Parallel Arabic DIalectal Corpus. A parallel corpus of 6,400 sentences in multiple Arabic dialects: Algerian, Maghreb, Syrian, Palestinian and MSA, for dialect detection and machine translation.
Language
Arabic
License
GPLv3
Task
Dialect Identification, Machine Translation
PADIC
QCRI Parallel Tweets
Corpora
Annotated Datasets
Bilingual Corpus of Arabic-English Parallel Tweets. 166,000 bilingual tweets in Arabic and English (parallel) for machine translation tasks.
Language
Arabic
License
Apache License 2.0
Task
Machine Translation
QCRI Parallel Tweets
Kawarith
Corpora
Annotated Datasets, Unannotated Corpora
An Arabic Twitter Corpus for Crisis Events. A large-scale crisis-related multi-dialect Arabic Twitter corpus of 1,658,795 unique tweets from 22 emergency events. This corpus can be leveraged for several tasks, including crisis detection and crisis type classification.
Language
Arabic
License
CC BY-NC 4.0
Task
Topic Classification, Topic Modeling
Kawarith
Habibi
Corpora
Unannotated Corpora
a multi Dialect multi National Arabic Song Lyrics Corpus. More than 30,000 Arabic song lyrics in 6 Arabic dialects (Egyptian, Levantine, etc.) for singers from 18 different Arabic countries, segmented into sentences and words and labeled with song information.
Language
Arabic
License
Unknown
Habibi
Arabic Wiki Data Dumps
Corpora
Unannotated Corpora
Wikipedia, the free encyclopedia, publishes dumps of its content as XML files on a monthly basis.
Language
Arabic
Arabic Wiki Data Dumps
QASR
Corpora
Recorded Speech and Audio Corpora
QCRI Al Jazeera Speech Resource - The largest transcribed Arabic speech corpus with around 2,000 hours with multi-layer annotation, in multi-dialect and code-switching speech, crawled from the Al Jazeera news channel, for speech recognition, dialect identification, punctuation restoration, speaker identification, speaker linking, etc.
Language
Arabic
License
unknown
Task
Speech Recognition, Speech Synthesis, Speech-to-Text (STT), Text-to-Speech (TTS)
QASR
DiaCorpus
Corpora
Annotated Datasets, Recorded Speech and Audio Corpora
The DiaCorpus project is a collaboration between the Data Science Institute (DSI) and Israeli Innovation authority. The purpose of the project is to create a first of a kind Arabic textual repository, in a local dialect (Israeli / Palestinian). This project is part of the National Language Processing plan of Israel.
Language
Arabic
License
CC BY-SA 4.0
Task
Coreference Resolution, Emotion Detection, Name Entity Recognition (NER), Sentiment Analysis, Summarization
DiaCorpus
DART
Corpora
Annotated Datasets
A Large Dataset of Dialectal Arabic Tweets. About 25K tweets that are annotated via crowdsourcing for 5 Arabic dialects: Egyptian, Maghrebi, Levantine, Gulf, and Iraqi.
Language
Arabic
License
Unknown
Task
Dialect Identification
DART
ASTD: Arabic Sentiment Tweets Dataset
Annotated Datasets
Annotated Datasets
10,000 tweets classified as objective, subjective positive, subjective negative, and subjective mixed.
Language
Arabic
License
GPLv2
Task
Sentiment Analysis
ASTD: Arabic Sentiment Tweets Dataset
Arabic 100k Reviews
Corpora
Annotated Datasets
Reviews with three classes from different services. The dataset combines reviews from hotels, books, movies, products and a few airlines. It has three classes (Mixed, Negative and Positive). Most were mapped from reviewers' ratings with 3 being mixed, above 3 positive and below 3 negative. Each row has a label and text separated by a tab (tsv). Text (reviews) were cleaned by removing Arabic diacritics and non-Arabic characters. The dataset has no duplicate reviews. The hotels and book reviews are a subset of HARD and BRAD. The rest were selected from hadyelsahar with a little over 100 airlines reviews collected manually.
Language
Arabic
License
Unknown
Task
Sentiment Analysis
Arabic 100k Reviews
HARD
Corpora
Annotated Datasets
Hotel Arabic-Reviews Dataset. This dataset contains 93700 hotel reviews in Arabic language. The hotel reviews were collected from Booking.com website during June/July 2016. The reviews are expressed in Modern Standard Arabic as well as dialectal Arabic.
Language
Arabic
License
Unknown
Task
Sentiment Analysis
HARD
BARD
Corpora
Annotated Datasets
Books Reviews in Arabic Dataset. This dataset contains 510,600 book reviews in Arabic language. The reviews were collected from GoodReads.com website during June/July 2016. This work is an extension of the early dataset of large-scale Arabic dataset, LABR, which has around 63K Arabic Book Reviews collected from GoodReads.com. The reviews are expressed mainly in Modern Standard Arabic but there are reviews in dialectal Arabic as well.
Language
Arabic
License
Unknown
Task
Sentiment Analysis
BARD
Large Multi-Domain Resources for Arabic Sentiment Analysis
Corpora
Automatically Annotated
33K Automatically annotated Reviews in Domains of Movies, Hotels, Restaurants and Products
Language
Arabic
License
Unknown
Task
Sentiment Analysis
Large Multi-Domain Resources for Arabic Sentiment Analysis
AJGT
Corpora
Annotated Datasets
Arabic Jordanian General Tweets (AJGT) Corpus consisted of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.
Language
Arabic
License
Unknown
Task
Sentiment Analysis
AJGT
Previous
Next
Want to suggest a new resource? Email us at
hilla@webiks.com
to help improve our platform!
No items found.
Clear Filters