Extract Text from Image using Tesseract in C#

Rajnilari2015
Posted by in C# category on for Beginner level | Points: 250 | Views : 50818 red flag
Rating: 5 out of 5  
 2 vote(s)

This article will present us a way of extracting data from image file using Tesseract in C#.NET.


 Download source code for Extract Text from Image using Tesseract in C#

Introduction

Optical character recognition (OCR) is a process for extracting textual data from an image. Apart from that, it finds it's applicability in the field of pattern recognition, artificial intelligence ,computer vision etc. tesseract-ocr is high accuracy of character recognition and contains prepared trained data sets for 39 languages.The original Tesseract Open Source OCR Engine was developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado. Tesseract is a dotnet wrapper for the Open Source OCR assembly that uses the Tesseract engine.This article will present us a way of extracting data from image file using Tesseract

Environment Setup

Fire up a Console Application and from the Nuget Package Manager Console, issue the below command

Install-Package Tesseract -Version 2.4.1.0

If everything goes as expected, then we will receive the below

Also we need to download language data files for tesseract from here

Also let us create an image (we use MSPaint) as shown below

Using Code

Let us first write the below code

using System;
using Tesseract;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            var testImagePath = [YOUR IMAGE PATH];
            var dataPath = [YOUR DATA PATH];

            try
            {  
                using (var tEngine = new TesseractEngine(dataPath, "eng", EngineMode.Default)) //creating the tesseract OCR engine with English as the language
                {
                    using (var img = Pix.LoadFromFile(testImagePath)) // Load of the image file from the Pix object which is a wrapper for Leptonica PIX structure
                    {                        
                        using (var page = tEngine.Process(img)) //process the specified image
                        {
                            var text = page.GetText(); //Gets the image's content as plain text.
                            Console.WriteLine(text); //display the text
                            Console.WriteLine(page.GetMeanConfidence()); //Get's the mean confidence that as a percentage of the recognized text.
                            Console.ReadKey();                            
                        }
                    }
                }
            }
            catch (Exception e)
            {                
                Console.WriteLine("Unexpected Error: " + e.Message);
            }
        }
    }
}

At first we are creating a new instance of TesseractEngine with Default engineMode and English as the language. Next we are loading the image file by using the Pix object which is a wrapper for Leptonica PIX structure.The tEngine.Process(img) accepts the image as an input, process the image and returns a Page. Once we get the text from the image, we displays the same on the console. Also for getting the confidence, we are using GetMeanConfidence() method of Page class.

Now let us run the application

This indicates that, we are able to read the text from the image.

References

tesseract

Conclusion

Hope this will help to proceed with tesseract library. If you find this interesting please add more test cases. Thanks for reading. Zipped file attached.

Page copy protected against web site content infringement by Copyscape

About the Author

Rajnilari2015
Full Name: Niladri Biswas (RNA Team)
Member Level: Platinum
Member Status: Member,Microsoft_MVP,MVP
Member Since: 3/17/2015 2:41:06 AM
Country: India
-- Thanks & Regards, RNA Team


Login to vote for this post.

Comments or Responses

Posted by: Sheonarayan on: 1/4/2016 | Points: 25
Wow, so easy that too it seems like free to use plugin.
Posted by: Itocr on: 6/1/2017 | Points: 25
You need to download the tessdata folder, it corresponds to the datapath.
Posted by: Wenbuyi on: 8/3/2017 | Points: 25
tesseract ocr is good, there is a another ocr control using tesseract 3 engine, and provide higher accuracy: http://www.xspdf.com/guide-ocr/text-recognition-from-image/

Login to post response

Comment using Facebook(Author doesn't get notification)