How to extract text from pdf...

Posted by Manoj323 under C# on 3/14/2013 | Points: 10 | Views : 3091 | Status : [Member] | Replies : 8
I want to extract plain text from PDF without images. can anyone help me...




Responses

Posted by: Raja_89 on: 3/15/2013 [Member] Starter | Points: 25

Up
0
Down
Hai
You can try it by third party dll files like ASPOSE[http://www.aspose.com/total-component-suite.aspx],
Docotic.Pdf[http://bitmiracle.com/pdf-library/help/extract-images.aspx]
,..etc

Manoj323, if this helps please login to Mark As Answer. | Alert Moderator

Posted by: Johnw on: 3/27/2013 [Member] Starter | Points: 25

Up
0
Down
Hello,

Here is my code

            //Load PDF

PdfDocument document = new PdfDocument();
document.LoadFromFile("TEST.pdf");

//Extract Text
StringBuilder buffer = new StringBuilder();
foreach (PdfPageBase page in document.Pages)
{
buffer.Append(page.ExtractText());
}

document.Close();
//Save Text
String fileName = "TextInPdf.txt";
File.WriteAllText(fileName, buffer.ToString());


And I use a .NET PDF component(http://www.e-iceblue.com/Introduce/pdf-for-net-introduce.html ) in my solution.

Hope helpful.


Not what, but how

Manoj323, if this helps please login to Mark As Answer. | Alert Moderator

Posted by: Arronlee on: 5/1/2013 [Member] Starter | Points: 25

Up
0
Down
More precisely, reading the PDF into a character recognition (OCR) software, if your PDF is an all graphics file (indicated by the impossibility of highlighting text).



The results of course depend on your OCR software and the settings you apply before recognition.



In any case, the procedure is likely to involve a lot of work and only pays off if the text contains lots of repetitions and you can use a PDF sdk(http://www.yiigo.com/net-document-image-sdk/ ) afterwards. Otherwise, just use a

printout and type the translation into Word.


Manoj323, if this helps please login to Mark As Answer. | Alert Moderator

Posted by: Bronte on: 5/7/2013 [Member] Starter | Points: 25

Up
0
Down
I happen to work and just finish my PDF programming, it includes how to extract text form pdf in .net. i can share it with all, and i have tested the program, it works very well. hope it can help you.
http://www.rasteredge.com/how-to/csharp-imaging/pdf-text-extract/

Manoj323, if this helps please login to Mark As Answer. | Alert Moderator

Posted by: Rebecca Yang on: 9/27/2013 [Member] Starter | Points: 25

Up
0
Down
You can choose the third party dll--Spire.PDF for .NET ,like the products from E-iceblue.It is easy to extract text from pdf by using it .Download here :
http://pdfapi.codeplex.com/

Manoj323, if this helps please login to Mark As Answer. | Alert Moderator

Posted by: Jbella on: 2/13/2014 [Member] Starter | Points: 25

Up
0
Down
If I understood you correctly, you need to extract text from PDF files that already contain text, not images. If that is the case, you do NOT need OCR like others here suggested, but a library or class that can parse the PDF and extract the text directly from it. While OCR would work, it would be far less efficient and accurate than a solution that gets the text directly. One such library I tried was Leadtools, which I remember had a class for parsing PDF, but I can't quite remember that class's name because I used it more than a year ago.

Manoj323, if this helps please login to Mark As Answer. | Alert Moderator

Posted by: Thackeray on: 3/4/2014 [Member] Starter | Points: 25

Up
0
Down
Usually, we use ocr scanning tool to extract and export image or text from pdf document as well other document files, like tiff, word, excel or Powerpoint. A mature ocr reader supports both full page and zonal analysis via OCR control and is capable of detecting word, font, line size, location.
http://www.rasteredge.com/how-to/vb-net-imaging/ocr-sdk/


Manoj323, if this helps please login to Mark As Answer. | Alert Moderator

Posted by: Thomas128 on: 3/12/2014 [Member] Starter | Points: 25

Up
0
Down
To extract the text out from a PDF document, you're gonna need the OCR technology. Here's is a OCR guide which might help:
http://www.yiigo.com/guides/vbnet/how-to-ocr.shtml


Today is a gift. That's why it's called the Present!

Manoj323, if this helps please login to Mark As Answer. | Alert Moderator

Login to post response