This article will present us a way of extracting data from image file using Tesseract in C#.NET.
Optical character recognition (OCR) is a process for extracting textual data from an image. Apart from that, it finds it's applicability in the field of pattern recognition, artificial intelligence ,computer vision etc. tesseract-ocr is high accuracy of character recognition and contains prepared trained data sets for 39 languages.The original Tesseract Open Source OCR Engine was developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado. Tesseract is a dotnet wrapper for the Open Source OCR assembly that uses the Tesseract engine.This article will present us a way of extracting data from image file using Tesseract
Fire up a Console Application and from the Nuget Package Manager Console, issue the below command
Install-Package Tesseract -Version 184.108.40.206
If everything goes as expected, then we will receive the below
Also we need to download language data files for tesseract from here
Also let us create an image (we use MSPaint) as shown below
Let us first write the below code
static void Main(string args)
var testImagePath = [YOUR IMAGE PATH];
var dataPath = [YOUR DATA PATH];
using (var tEngine = new TesseractEngine(dataPath, "eng", EngineMode.Default)) //creating the tesseract OCR engine with English as the language
using (var img = Pix.LoadFromFile(testImagePath)) // Load of the image file from the Pix object which is a wrapper for Leptonica PIX structure
using (var page = tEngine.Process(img)) //process the specified image
var text = page.GetText(); //Gets the image's content as plain text.
Console.WriteLine(text); //display the text
Console.WriteLine(page.GetMeanConfidence()); //Get's the mean confidence that as a percentage of the recognized text.
catch (Exception e)
Console.WriteLine("Unexpected Error: " + e.Message);
At first we are creating a new instance of TesseractEngine with Default engineMode and English as the language. Next we are loading the image file by using the Pix object which is a wrapper for Leptonica PIX structure.The tEngine.Process(img) accepts the image as an input, process the image and returns a Page. Once we get the text from the image, we displays the same on the console. Also for getting the confidence, we are using GetMeanConfidence() method of Page class.
Now let us run the application
This indicates that, we are able to read the text from the image.
Hope this will help to proceed with tesseract library. If you find this interesting please add more test cases. Thanks for reading. Zipped file attached.