This article will present us a way of extracting data from image file using Tesseract in C#.NET.
Introduction
Optical character recognition (OCR) is a process for extracting textual data from an image. Apart from that, it finds it's applicability in the field of pattern recognition, artificial intelligence ,computer vision etc. tesseract-ocr is high accuracy of character recognition and contains prepared trained data sets for 39 languages.The original Tesseract Open Source OCR Engine was developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado. Tesseract is a dotnet wrapper for the Open Source OCR assembly that uses the Tesseract engine.This article will present us a way of extracting data from image file using Tesseract
Environment Setup
Fire up a Console Application and from the Nuget Package Manager Console, issue the below command
Install-Package Tesseract -Version 2.4.1.0
If everything goes as expected, then we will receive the below

Also we need to download language data files for tesseract from here
Also let us create an image (we use MSPaint) as shown below

Using Code
Let us first write the below code
using System;
using Tesseract;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var testImagePath = [YOUR IMAGE PATH];
var dataPath = [YOUR DATA PATH];
try
{
using (var tEngine = new TesseractEngine(dataPath, "eng", EngineMode.Default)) //creating the tesseract OCR engine with English as the language
{
using (var img = Pix.LoadFromFile(testImagePath)) // Load of the image file from the Pix object which is a wrapper for Leptonica PIX structure
{
using (var page = tEngine.Process(img)) //process the specified image
{
var text = page.GetText(); //Gets the image's content as plain text.
Console.WriteLine(text); //display the text
Console.WriteLine(page.GetMeanConfidence()); //Get's the mean confidence that as a percentage of the recognized text.
Console.ReadKey();
}
}
}
}
catch (Exception e)
{
Console.WriteLine("Unexpected Error: " + e.Message);
}
}
}
}
At first we are creating a new instance of TesseractEngine with Default engineMode and English as the language. Next we are loading the image file by using the Pix object which is a wrapper for Leptonica PIX structure.The tEngine.Process(img) accepts the image as an input, process the image and returns a Page. Once we get the text from the image, we displays the same on the console. Also for getting the confidence, we are using GetMeanConfidence() method of Page class.
Now let us run the application

This indicates that, we are able to read the text from the image.
References
tesseract
Conclusion
Hope this will help to proceed with tesseract library. If you find this interesting please add more test cases. Thanks for reading. Zipped file attached.