How to Read Content from this file [Resolved]

Posted by Jayakumars under .NET Framework on 1/22/2015 | Points: 10 | Views : 1693 | Status : [Member] [MVP] | Replies : 3
hi
I have one Pdf File from that pdf file How can i Extract the content of the pdf file into an html tag

for ex:This word Asp.net in pdf file is bold so i need the html tag as the following <b>Asp.net</b>

Mark as Answer if its helpful to you

Kumaraspcode2009@gmail.com



Responses

Posted by: Solomanrakesh on: 7/7/2015 [Member] Starter | Points: 50

Up
0
Down

Resolved
One way to do this is by extracting the text and images from the PDF and manually inserting them to an HTML file with the proper tags.
I found the following link that discusses options for extracting text and image objects from PDF.
http://www.c-sharpcorner.com/Forums/Thread/304574/pdf-how-to-extract-text-and-images.aspx

Hope this helps


Jayakumars, if this helps please login to Mark As Answer. | Alert Moderator

Posted by: Nismeh on: 6/4/2015 [Member] Starter | Points: 25

Up
0
Down
You can try below code

using System;
using System.IO;
using iTextSharp.text.pdf;
using System.Text.RegularExpressions;

namespace PDFReader
{
/// <summary>
/// Parses a PDF file and extracts the text from it.
/// </summary>
public class PDFParser
{
/// BT = Beginning of a text object operator
/// ET = End of a text object operator
/// Td move to the start of next line
/// 5 Ts = superscript
/// -5 Ts = subscript

#region Fields

#region _numberOfCharsToKeep
/// <summary>
/// The number of characters to keep, when extracting text.
/// </summary>
private static int _numberOfCharsToKeep = 15;
#endregion

#endregion

#region ExtractText
/// <summary>
/// Extracts a text from a PDF file.
/// </summary>
/// <param name="inFileName">the full path to the pdf file.</param>
/// <param name="outFileName">the output file name.</param>
/// <returns>the extracted text</returns>
public bool ExtractText(string inFileName, string outFileName)
{
StreamWriter outFile = null;
try
{
// Create a reader for the given PDF file
PdfReader reader = new PdfReader(inFileName);
//outFile = File.CreateText(outFileName);
outFile = new StreamWriter(outFileName, false, System.Text.Encoding.UTF8);

Console.Write("Processing: ");

int totalLen = 68;
float charUnit = ((float)totalLen) / (float)reader.NumberOfPages;
int totalWritten = 0;
float curUnit = 0;

for (int page = 1; page <= reader.NumberOfPages; page++)
{
outFile.Write(ExtractTextFromPDFBytes(reader.GetPageContent(page)) + " ");

// Write the progress.
if (charUnit >= 1.0f)
{
for (int i = 0; i < (int)charUnit; i++)
{
Console.Write("#");
totalWritten++;
}
}
else
{
curUnit += charUnit;
if (curUnit >= 1.0f)
{
for (int i = 0; i < (int)curUnit; i++)
{
Console.Write("#");
totalWritten++;
}
curUnit = 0;
}

}
}

if (totalWritten < totalLen)
{
for (int i = 0; i < (totalLen - totalWritten); i++)
{
Console.Write("#");
}
}
return true;
}
catch
{
return false;
}
finally
{
if (outFile != null) outFile.Close();
}
}
#endregion

#region ExtractTextFromPDFBytes
/// <summary>
/// This method processes an uncompressed Adobe (text) object
/// and extracts text.
/// </summary>
/// <param name="input">uncompressed</param>
/// <returns></returns>
public string ExtractTextFromPDFBytes(byte[] input)
{
if (input == null || input.Length == 0) return "";

try
{
string resultString = "";

// Flag showing if we are we currently inside a text object
bool inTextObject = false;

// Flag showing if the next character is literal
// e.g. '\\' to get a '\' character or '\(' to get '('
bool nextLiteral = false;

// () Bracket nesting level. Text appears inside ()
int bracketDepth = 0;

// Keep previous chars to get extract numbers etc.:
char[] previousCharacters = new char[_numberOfCharsToKeep];
for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';


for (int i = 0; i < input.Length; i++)
{
char c = (char)input[i];
if (input[i] == 213)
c = "'".ToCharArray()[0];

if (inTextObject)
{
// Position the text
if (bracketDepth == 0)
{
if (CheckToken(new string[] { "TD", "Td" }, previousCharacters))
{
resultString += "\n\r";
}
else
{
if (CheckToken(new string[] { "'", "T*", "\"" }, previousCharacters))
{
resultString += "\n";
}
else
{
if (CheckToken(new string[] { "Tj" }, previousCharacters))
{
resultString += " ";
}
}
}
}

// End of a text object, also go to a new line.
if (bracketDepth == 0 &&
CheckToken(new string[] { "ET" }, previousCharacters))
{

inTextObject = false;
resultString += " ";
}
else
{
// Start outputting text
if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
{
bracketDepth = 1;
}
else
{
// Stop outputting text
if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))


IT KNOWLEDGE IS APPLIED KNOWLEDGE
So Just Do It

Jayakumars, if this helps please login to Mark As Answer. | Alert Moderator

Posted by: Haves66 on: 12/1/2015 [Member] Starter | Points: 25

Up
0
Down
This solution from the MSDN forums might help: https://code.msdn.microsoft.com/Extracting-text-and-image-d47ac957

Alternatively, you could use a third-party component such as this one for handling and converting files: http://www.gemboxsoftware.com/document/overview

Jayakumars, if this helps please login to Mark As Answer. | Alert Moderator

Login to post response