Appending additional OCR language dictionaries

In This Topic

The language dictionaries provided within the installation package are:

ara (Arabic)
deu (German)
eng (English)
fra (French)
heb (Hebrew)
ita (Italian)
nld (Dutch; Flemish)
por (Portuguese)
spa (Spanish; Castilian)
vie (Vietnamese)

Of course the OCR engine isn't restricted to those languages only and can recognize many more.
If the language you wish to recognize is not in the above list, please download the complete OCR languages pack.
It includes more than120 languages and can be downloaded from https://www.gdpicture.com/download/tesseract_ocr_4x_language_pack.zip
You can also try other language files provided by the Tesseract team here: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data-files-for-version-400-september-15-2017

This 326MB archive contains most of all the available languages that are currently supported. We strongly recommend to use these dictionary files.

Once the download is completed, simply extract the archive content in the folder, where you have your OCR dictionaries already installed.

To obtain language names from language codes please visit this page: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data-files-for-version-400-september-15-201

If for any reason you want to use previous language data files (without LSTM engine usage) you can download the complet pack from this link: https://www.gdpicture.com/download/tesseract_ocr_304_language_pack.zip