Hebrew and Arabic NLP Resources

Models and Tools

Fine-Tuned Lanhuahe Models, Pre-Trained Language Models

State-of-the-art Longformer language model for Hebrew.

Text Classification

Models and Tools

Other Models and Tools

A language modeling Diacritics (`Niqqud’) free Text-To-Speech (TTS) approach, for the task of Hebrew TTS. The model operates on discrete speech representations and is conditioned on a word-piece tokenizer.

Apache License 2.0

Speech Synthesis, Text-to-Speech (TTS)

Models and Tools

Pre-Trained Language Models

A new pretrained dedicated BERT model, dubbed MsBERT (short for: Manuscript BERT), designed from the ground up to handle Hebrew manuscript text. MsBERT substantially outperforms all existing Hebrew BERT models regarding the prediction of missing words in fragmentary Hebrew manuscript transcriptions in multiple genres, as well as regarding the task of differentiating between quoted passages and exegetical elaborations. We provide MsBERT for free download and unrestricted use, and we also provide an interactive and user-friendly website to allow manuscript scholars to leverage the power of MsBERT in their scholarly work of reconstructing fragmentary Hebrew manuscripts.

ivrit.ai Free and Open Transcription Tool

Models and Tools

Other Models and Tools

This website enables users to transcribe easily Hebrew audio files for free using ivrit.ai models, fostering accessibility and innovation in language technology.

Speech-to-Text (STT)

ivrit.ai Free and Open Transcription Tool

ivrit-ai Whisper Large v3 turbo ct2

Models and Tools

Fine-Tuned Language Models, Other Models and Tools

This is ivrit.ai's faster-whisper model, based on the ivrit-ai/whisper-large-v3-turbo Whisper model. Training data includes 295 hours of volunteer-transcribed speech from the ivrit-ai/crowd-transcribe-v5 dataset, as well as 93 hours of professional transcribed speech from other sources.

Apache License 2.0

Speech-to-Text (STT)

ivrit-ai Whisper Large v3 turbo ct2

Models and tools

Pipelines/Parsers

Morphological Analysis, disambiguation and dependency Parser. Morphological Analyzer relies on the BGU Lexicon. Demo: https://nlp.biu.ac.il/~rtsarfaty/onlp/hebrew/

Apache License 2.0

Dependency Treebanks, Morphological Analysis, Morphological Segmentation, Part-of-speech (POS) Tagging, Stemming and Lemmatization, Tokenization

Other Models and Tools

An open-source effort to make Hebrew properly searchable by various IR software libraries. Includes Hebrew Analyzer for Lucene.

Morphological Analysis, Part-of-speech (POS) Tagging, Stemming and Lemmatization, Tokenization

Dicta-LM 2.0 Collection

Models and Tools

Pre-Trained Language Models

Generative language models specifically optimized for Hebrew.

Apache License 2.0

Language Generation

Dicta-LM 2.0 Collection

Hebrew-Mistral-7B-200K

Models and Tools

Pre-Trained Language Models

An open-source Large Language Model (LLM) pretrained in Hebrew and English and created by yam Peleg. It has been pretrained with 7B billion parameters and with 200K context length, based on Mistral-7B-v1.0 from Mistral. It has an extended hebrew tokenizer with 64,000 tokens and is continuesly pretrained from Mistral-7B on tokens in both English and Hebrew. The resulting model is a powerful general-purpose language model suitable for a wide range of natural language processing tasks, with a focus on Hebrew language understanding and generation.

Apache License 2.0

Language Generation

Hebrew-Mistral-7B-200K

Models and Tools

Other Models and Tools

Tool for Automatic and semi-automatic Nikud for Hebrew texts. Avi Shmidman, Shaltiel Shmidman, Moshe Koppel, and Yoav Goldberg. 2020. Nakdan: Professional Hebrew diacritizer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 197–203, Online. Association for Computational Linguistics.

Diacritization/Vocalization

Models and tools

Other Models and Tools

Free Hebrew linguistic project including spell checker and morphological analyzer. https://github.com/eranroz/HspellPy/ - Python wrapper for Hspell.

Morphological Analysis, Part-of-speech (POS) Tagging, Spell Checking and Correction

Models and tools

Other Models and Tools

Wikiproject for correcting grammar mistakes. (Heuristic) positive annotations can be derived from query <https://quarry.wmflabs.org/query/21957

Spell Checking and Correction

Models and Tools

Other Models and Tools

Hebrew Diacritizer. Elazar Gershuni and Yuval Pinter: Restoring Hebrew Diacritics Without a Dictionary. Demo in Replicate: https://replicate.com/elazarg/nakdimon/ Paper: https://arxiv.org/abs/2105.05209/ code: https://github.com/elazarg/nakdimon/ data: https://github.com/elazarg/hebrew-diacritize/

Diacritization/Vocalization

Models and Tools

Other Models and Tools

Morris Alper's open-source tool for adding vowel signs (Nikud) to Hebrew text, uses no rule-based logic, built with a CANINE transformer network. An interactive demo is available at https://huggingface.co/spaces/malper/unikud Blog post: https://towardsdatascience.com/unikud-adding-vowels-to-hebrew-text-with-deep-learning-powered-by-dagshub-56d238e22d3f

Diacritization/Vocalization

Hebrew OCR with Nikud

Models and Tools

Other Models and Tools

A program to convert Hebrew text files (without Nikud) to text files with the correct Nikud. Developed by Adi Oz and Vered Shani

Diacritization/Vocalization, Optical Character Recognition (OCR)

Hebrew OCR with Nikud

Hebrew Punctuation Model

Models and Tools

Fine-Tuned Language Models

A fine-tuned version of AlephBERT, designed to restore punctuation in Hebrew spoken language transcripts. It is specifically trained as a post-processing step for Automatic Speech Recognition (ASR) outputs, where punctuation is often missing in raw transcriptions.

Apache License 2.0

Diacritization/Vocalization

Hebrew Punctuation Model

NeMo-text-processing

Models and Tools

Other Models and Tools

Verbit extended NeMo-text-processing python package with WFST-based Hebrew inverse text normalization (ITN). ITN is a part of Automatic Speech Recognition (ASR) post-processing pipeline and can be used to convert spoken to written form to improve text readability.

Apache License 2.0

Text Normalization

NeMo-text-processing

Models and Tools

Pre-Trained Language Models, Fine-Tuned Language Models

a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. Two versions: (1) a fine-tuned model of HeBERT applied on legal and legislative documents, and (2) uses HeBERT's architecture guidlines to train a BERT model from scratch.

Language Modeling, Text Classification

Neural Sentiment Analyzer for Modern Hebrew

Models and Tools

Other Models and Tools

This code and dataset provide an established benchmark for neural sentiment analysis for Modern Hebrew.

Sentiment Analysis

Neural Sentiment Analyzer for Modern Hebrew

Models and Tools

Pre-Trained Language Models

a large pre-trained language model for Modern Hebrew, publicly available, pre-training on Oscar, Texts of Hebrew tweets, all of Hebrew Wikipedia, published by the OnlpLab team.

Apache License 2.0

Morphological Analysis, Named Entity Recognition (NER), Part-of-speech (POS) Tagging, Sentiment Analysis

Models and Tools

Pre-Trained Language Models

HeBERT is a Hebrew pretrained language model for Polarity Analysis and Emotion Recognition, published by Dr. Inbal Yahav Shenberger and Avichay Chriqui. It is based on Google's BERT architecture and it is BERT-Base config. HeBert was trained on three dataset: OSCAR, A Hebrew dump of Wikipedia, Emotion User Generated Content (UGC) data that was collected for the purpose of this study. The model was evaluated on downstream tasks: HebEMO - emotion recognition model, and sentiment analysis.

Emotion Detection, Sentiment Analysis

AlephBERTGimmel

Models and Tools

Pre-Trained Language Models

a Hebrew pre-trained language model, trained on the same dataset as the previous SOTA Hebrew PLM AlephBERT, consisting of approximately 2 billion words of text but with a substantially increased vocabulary of 128,000 word pieces. Published as a collaboration of the OnlpLab team and Dicta.

Morphological Analysis, Morphological Segmentation, Named Entity Recognition (NER), Part-of-speech (POS) Tagging, Sentiment Analysis, Tokenization

AlephBERTGimmel

Hebrew Psychological Lexicons (Code)

Models and Tools

Other Models and Tools

Easy-to-use Python interface for Hebrew clinical psychology text analysis. Useful for various psychology applications such as detecting emotional state, well being, relationship quality in conversation, identifying topics (e.g., family, work) and many more.

Apache License 2.0

Emotion Detection, Topic Classification

Hebrew Psychological Lexicons (Code)

Model and Tools

Pre-Trained Language Models

a pre-trained BERT model for modern Hebrew, with the masked-language-modeling objective.

Model and Tools

Fine-Tuned Language Models

A fine-tuned model for prefix segmentation task.

Morphological Segmentation

DictaBERT-morph

Model and Tools

Fine-Tuned Language Models

A fine-tuned model for morphological tagging task.

Morphological Analysis, Part-of-speech Tagging (POS)

DictaBERT-morph

Models and Tools

Pre-Trained Language Models

Designed specifically for identifying suffixed verbal forms in Modern Hebrew.

Morphological Analysis, Morphological Inflection, Morphological Segmentation, Stemming and Lemmatization

Models and Tools

Multilingual Language Models, Pre-Trained Language Models

A bilingual BERT for Arabic and Hebrew, pretrained on the respective parts of the OSCAR corpus.

Machine Translation

Model and Tools

Pre-Trained Language Models

a BERT-style language model for Hebrew, based on the BERT-base architecture with a character level tokenizer. This model is released to the public in this 2025 W-NUT paper: Avi Shmidman and Shaltiel Shmidman, "Restoring Missing Spaces in Scraped Hebrew Social Media", The 10th Workshop on Noisy and User-generated Text (W-NUT), 2025 This is the base model pretrained with the masked-language-modeling objective.

Information Retrieval, Morphological Segmentation, Tokenization

BGU NLP - LemLDA: an LDA Package for Hebrew

Models and Tools

Other Models and Tools

The package is based on Heinrich's java implementation of collapsed Gibbs sampling with an extra variable to model the generative nature of lemmas in Hebrew.

BGU NLP - LemLDA: an LDA Package for Hebrew

Neural Modeling for Named Entities and Morphology (NEMO2)

Models and Tools

Other Models and Tools

OnlpLab's code and models for neural modeling of Hebrew NER. Described in the TACL paper Neural Modeling for Named Entities and Morphology (NEMO2).

Apache License 2.0

Named Entity Recognition (NER)

Neural Modeling for Named Entities and Morphology (NEMO2)

Models and Tools

Other Models and Tools

Yonatan Bitton's code that recognizes medical entities in a Hebrew text.

Named Entity Recognition (NER)

Models and Tools

Pipelines/Parsers

A custom spaCy pipeline for Hebrew text including a transformer-based multitask NER model that recognizes 16 entity types in Hebrew, including GPE, PER, LOC and ORG.

Named Entity Recognition (NER)

Models and Tools

Pipelines/Parsers

A de-identification toolkit for clinical text in Hebrew. Demo: https://hebsafeharbor-demo.azurewebsites.net/

Named Entity Recognition (NER), Temporal Information Extraction

HebSafeHarbor – CLALIT Validation

Models and Tools

Other Models and Tools

A de-identification toolkit for clinical text in Hebrew. An improved version of Microsoft's HebSafeHarbor project.

Named Entity Recognition (NER), Temporal Information Extraction

HebSafeHarbor – CLALIT Validation

Models and Tools

Other Models and Tools

A Python package for browsing and processing ancient corpora, focused on the Hebrew Bible Database.

Optical Character Recognition (OCR)

word2word (Code)

Models and Tools

Other Models and Tools

Easy-to-use Python interface for accessing top-k word translations and for building a new bilingual lexicon from a custom parallel corpus.

Apache License 2.0

Machine Translation

word2word (Code)

Universal Language Model Fine-tuning for Text Classification (ULMFiT) in Hebrew

Models and Tools

Fine-Tuned Language Models, Multilingual Models

The weights (e.g. a trained model) for a Hebrew version for Howard's and Ruder's ULMFiT model. Trained on the Hebrew Wikipedia corpus.

Text Classification

Universal Language Model Fine-tuning for Text Classification (ULMFiT) in Hebrew

Models and Tools

Fine-Tuned Language Models

Sequence-to-sequence learning for Hebrew transliteration (converting between Hebrew text and Latin transliteration).

Transliteration

Models and Tools

Other Models and Tools

Converts YAP's output from the SPMRL scheme to UD v2.

Apache License 2.0

Models and Tools

Causal Language Models (CLM)

Doron Adler's Hebrew text generation model based on EleutherAI's gpt-neo.

Language Generation

Commercial and Online Services

Analytical tools for Jewish texts. They also have a GitHub organization: https://github.com/Dicta-Israel-Center-for-Text-Analysis.

Diacritization/Vocalization, Optical Character Recognition (OCR), Text Classification

Commercial and Online Services

A Python library for looking up the frequencies of words in 44 languages, including Hebrew. The Hebrew data is based on Wikipedia, OPUS OpenSubtitles 2018 and SUBTLEX, Google Books Ngrams 2012, Web text from OSCAR and Twitter.

Commercial and Online Services

A commercial engine for search and entity tagging in Hebrew.

Named Entity Recognition (NER)

Melingo's ICA (Intelligent Content Analysis)

Commercial and Online Services

A text analysis and textual categorized entity extraction API for Hebrew, Arabic and Farsi texts.

Morphological Analysis, Named Entity Recognition (NER), Sentiment Analysis, Text Classification

Melingo's ICA (Intelligent Content Analysis)

Commercial and Online Services

Automatic analysis of free text in Hebrew.

Commercial and Online Services

Online text-to-speech service for Hebrew.

Text-to-Speech (TTS)

Amnon The Transcriber

Commercial and Online Services

a WhatsApp bot that receives a voice note and transcribe it to text.

Speech-to-Text (STT)

Amnon The Transcriber

Commercial and Online Services

a WhatsApp bot that receives a voice note and transcribe it to text.

Speech-to-Text (STT), Text Summarization

Commercial and Online Services

Speech-to-Text (STT)

Text Analytics for health containers

Commercial and Online Services

Text Analytics for health containers

Commercial and Online Services

Morphological Analysis, Morphological Segmentation, Named Entity Recognition (NER), Stemming and Lemmatization, Text Normalization, Tokenization

Annotation Tools

A tool for managing annotation projects. Handles right-to-left and part-of-word marking. Tutorial video: https://www.youtube.com/watch?v=eTlrTC_n_yg

Annotation Tools

A tool for linked data annotation.

Apache License 2.0

Annotation Tools

an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequence to sequence tasks. So, you can create labeled data for sentiment analysis, named entity recognition, text summarization and so on.

Named Entity Recognition (NER), Sentiment Analysis, Text Classification, Text Summarization

Hebrew SimLex-999

A Hebrew version of the Simlex-999 resource for the evaluation of models that learn the meaning of words and concepts.

Hebrew SimLex-999

Annotation Tools

A web-based tool for research and collaboration over text data. Handles right-to-left and part-of-word marking.

Annotation Tools

Web-based. Support RTL and project management.

Apache License 2.0

Annotation Tools

OpenNLP has a tagging tool.

Apache License 2.0

Annotation Tools

opeNER has a tagging tool.

Named Entity Recognition (NER), Sentiment Analysis

Arethusa: Annotation Environment

Annotation Tools

A backend-independent client-side annotation framework.

Arethusa: Annotation Environment

rasa-nlu-trainer

Annotation Tools

A tool to edit training examples for rasa NLU. Handles right-to-left and part-of-word marking.

rasa-nlu-trainer

Annotation Tools

An online environment for collaborative text annotation. Does not support right-to-left.

Annotation Tools

A framework for crowdsourcing of data analysis and enrichment tasks.

Annotation Tools

A crowdsourced text annotator. Built with React and Redux (possibly also with pybossa).

Annotation Tools

System for HEBrew Text: ANnotations for Queries and Markup. SHEBANQ is an online environment for studying the Hebrew Bible.

The ONLP Lab at Bar Ilan University

Labs & Researchers

The ONLP Lab team research Natural Language Processing (NLP) foundations, algorithms and applications, and develop morphological, syntactic and semantic models.

The ONLP Lab at Bar Ilan University

Prof. Reut Tsarfaty

Labs & Researchers

Prof. Reut Tsarfaty is Associate Professor at the Computer Science department at Bar-Ilan University, and the head of the ONLP Lab. She is interested in developing models for analyzing, understanding and generating utterances in natural language, and in applications of natural language processing to dowbstream applications such as natural language programming, natural language navigation, and the analysis and generation of user-content in social media. Her research is funded by an ISF grant (1739/26) and an ERC Starting Grant (677352).

Prof. Reut Tsarfaty

Labs & Researchers

Data Scientist - the ONLP Lab.

The Natural Language Processing Lab at Bar Ilan University

Labs & Researchers

The Natural Language Processing Lab at Bar Ilan University

Prof. Ido Dagan

Labs & Researchers

The founder of the Natural Language Processing (NLP) Lab at Bar-Ilan.

Prof. Ido Dagan

Prof. Yoav Goldberg

Labs & Researchers

Professor at Bar Ilan University's Computer Science Department, and the Research Director of the Israeli branch of the Allen Institute for Artificial Intelligence.

Prof. Yoav Goldberg

Dr. Avi Shmidman

Lecturer, Bar Ilan University, The Department of the Literature of the Jewish People

Dr. Avi Shmidman

Prof. Moshe Koppel

Labs & Researchers

Prof. Moshe Koppel of the Department of Computer Science conducts research on a variety of machine learning applications including text categorization, image processing, speaker recognition and automated game playing.

Prof. Moshe Koppel

The Open Media and Information Lab (OMILab) at the Open University of Israel

Labs & Researchers

An interdisciplinary center for research and for teaching in new media and related areas, such as big data, information science, network cultures and digital sociology.

The Open Media and Information Lab (OMILab) at the Open University of Israel

Dr. Vered Silber-Varod

Labs & Researchers

Director of the Open Media and Information Lab (OMILab). Research interests and publications focus on various aspects of speech sciences, with expertise in speech prosody, acoustic phonetics, and speech communication and text analytics.

Dr. Vered Silber-Varod

Dr. Anat Lerner

Labs & Researchers

Interested in speech prosody analyses, combinatorial auctions and computer Networks (especially Ad-Hoc networks, mobile and cellular networks).

Dr. Anat Lerner

Natural Language Processing Lab at Ben Gurion University

Labs & Researchers

Research topics: EasyFirst Syntactic Dependency Parsing, Hebrew NLP, Medical NLP, Text Summarization, Natural Language Generation (NLG).

Natural Language Processing Lab at Ben Gurion University

Prof. Michael Elhadad

Labs & Researchers

Professor at the Department of Computer Science, Ben-Gurion University of the Negev. His research interests are in Computational Linguistics, Natural Language Generation and Intelligent User Interfaces.

Prof. Michael Elhadad

Labs & Researchers

Teaching fellow and Researcher, Ben Gurion University, Israel. Area of interest: Computational Linguistics, Morphology, Hebrew.

Labs & Researchers

Dr. Oren Tzur is an Assistant Professor (Senior Lecturer) at the department of Software & Information Systems Engineering (SISE) at Ben Gurion University, and the head of the NLP and Social dynamics lab (NAS-LAB).

Prof. Shuly Wintner

Labs & Researchers

Prof. Shuly Wintner is a professor of Computer Science at the University of Haifa, Israel. His main areas of interest are computational linguistics and natural language processing. Specific research topics include linguistic formalisms, formal grammar, computational morphology and syntax, processing of Hebrew, machine translation and computational approaches to language acquisition.

Prof. Shuly Wintner

Dr. Einat Minkov

Labs & Researchers

Working on Information Extraction and Semantics, as well as in other Natural Language Processing applications. I am also interested in Machine Learning - and the application of learning to NLP problems.

Dr. Einat Minkov

Prof. Jonathan Berant

Labs & Researchers

Prof. Jonathan Berant is Associate Professor at the Blavatnik School of Computer Science, Tel-Aviv University. He works on Natural Language Understanding problems such as Semantic Parsing, Question Answering, Paraphrasing, Reading Comprehension, and Textual Entailment.

Prof. Jonathan Berant

Prof. Joseph (Yossi) Keshet

Labs & Researchers

Prof. Joseph (Yossi) Keshet is an Associate Professor at the Andrew and Erna Viterbi Faculty of Electrical and Computer Engineering. He is the director of the Speech, Language, and Deep Learning Lab and affiliated with the Signal and Image processing Lab (SIPL).

Prof. Joseph (Yossi) Keshet

Dr. Yonatan Belinkov

Labs & Researchers

Assistant Professor at the faculty of Computer Science. Focus: interpretability and robustness.

Dr. Yonatan Belinkov

Prof. Alon Itai

Labs & Researchers

Prof. Alon Itai

Prof. Roi Reichart

Labs & Researchers

An Assistant Professor at the faculty of Industrial Engineering and Management of the Technion. Working on Natural Language Processing (NLP). Interested in language learning in its context and design models that integrate domain and world knowledge with data-driven methods.

Prof. Roi Reichart

Prof. Ronen Feldman

Labs & Researchers

Feldman's main areas of research are natural language processing, entity extraction and text relations, text sentiment analysis, and language processing for algorithmic trading. He is one of the founder of the discipline of text mining.

Prof. Ronen Feldman

Prof. Ari Rappoport

Labs & Researchers

With his main contribution in the area of Neuroscience, where he developed a comprehensive theory of the brain, Prof. Rappoport's Computer Science area of interest is language (Computational Linguistics, Natural Language Processing (NLP)), from cognitive science and machine learning perspectives.

Prof. Ari Rappoport

Prof. Omri Abend

Labs & Researchers

His fields of interest are Computational Linguistics and Natural Language Processing. Specifically, I conduct research on semantic (meaning) representation from a computational perspective. My research is tightly linked to statistical learning, language technology (such as Machine Translation and Information Extraction), and computational modeling of child language acquisition.

Prof. Omri Abend

Prof. Dafna Shahaf

Labs & Researchers

Prof. Shahaf's research focuses on helping people make sense of the world. She designs algorithms that help people understand the underlying structure of complex topics, and connect the dots between different pieces. She also likes to formalize intuitive notions; see recent work on Computational Humor.

Prof. Dafna Shahaf

Dr. Yael Netzer

Labs & Researchers

Dr. Yael Netzer has PhD in Computer Science and MA studies in Hebrew Literature at Ben Gurion University. She is a teaching fellow in Ben Gurion University and Haifa University, teaching Digital Humanities for Computer Science and for the Humanities, and in Tel Aviv University teaching Digital Humanities for archivists. She works at Dicta, the Israeli Center for Text Analysis. She is working as a DH consultant in the Digital Humanities lab in Haifa University. In recent years, Netzer develops and implements methods for digital personal archives, and is most interested in knowledge representation for archives, libraries and for the humanities.

Dr. Yael Netzer

The Neurolinguistics Laboratory at the Edmond and Lily Safra Center for Brain Sciences (ELSC)

Labs & Researchers

Studies the neural bases of linguistic knowledge and processing.

The Neurolinguistics Laboratory at the Edmond and Lily Safra Center for Brain Sciences (ELSC)

Prof. Yosef Grodzinsky

Labs & Researchers

Research fields: functional anatomy of language, linguistic theory (syntax, semantics), language acquisition, aphasia, individual variation.

Prof. Yosef Grodzinsky

The NLPH Facebook Group

Courses, Presentations and Meetups

The NLPH Facebook Group

The Israeli Natural Language Processing Meetup

Courses, Presentations and Meetups

The Israeli Natural Language Processing Meetup

Bar Ilan University's NLP course

Courses, Presentations and Meetups

Bar Ilan University's NLP course

ONLP April 2019 Meetup lecture slides

Courses, Presentations and Meetups

ONLP April 2019 Meetup lecture slides

Big DataNights NLP 2020

Courses, Presentations and Meetups

Big DataNights NLP 2020

Want to suggest a new resource? Email us at hilla@webiks.com to help improve our platform!

No items found.