Test in a live environment
Test in production without watermarks.
Works wherever you need it to.
In today's digital age, Portable Document Format (PDF) files have become a standard for document exchange due to their platform-independent nature and consistent formatting. The iTextSharp library emerges as a powerful library for seamlessly interacting with PDFs for developers working with C#. In this article, we will learn the process of reading PDF files using iTextSharp in C#, exploring the essential steps and providing a comprehensive guide to help you unlock the potential of this versatile library.
iText 7, formerly known as iTextSharp, is a powerful and versatile Java and .NET library for creating, manipulating, and extracting content from PDF documents. It provides a comprehensive set of features, including text and image handling, form filling, digital signatures, and watermarking. Whether you’re generating invoices, reports, or interactive forms, iText 7 empowers developers to work with PDFs efficiently.
Let's discuss some examples of Reading PDF Files in C#. To get started, you’ll need to add the iTextSharp library to your project
Open your C# Project using Visual Studio. In the top menu, go to "View" and then select "Package Manager Console." This will open the Package Manager Console at the bottom of the Visual Studio window.
In the Package Manager Console, ensure that the "Default project" dropdown is set to the project where you want to install the iTextSharp package.
Run the following command to install the iTextSharp library:
Install-Package itext7
This command fetches the latest version of iTextSharp from the NuGet package repository and installs it in your project. Wait for the installation process to complete. The Package Manager Console will display information about the installation progress.
I will use the following PDF document as input for this example.
Before begin, add the following namespace:
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using System.Text;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using System.Text;
Imports iText.Kernel.Pdf
Imports iText.Kernel.Pdf.Canvas.Parser
Imports System.Text
The following code will read the above PDF file, extract the content, and print the extracted content to the console.
public static void Main(string [] args)
{
StringBuilder text = new StringBuilder();
string fileName = @"D:/What_is_pdf.pdf";
if (File.Exists(fileName))
{
using (PdfReader pdfReader = new PdfReader(fileName))
{
using (PdfDocument pdfDocument = new PdfDocument(pdfReader))
{
for (int page = 1; page <= pdfDocument.GetNumberOfPages(); page++)
{
string currentText = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(page));
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
}
}
}
Console.WriteLine(text.ToString());
}
public static void Main(string [] args)
{
StringBuilder text = new StringBuilder();
string fileName = @"D:/What_is_pdf.pdf";
if (File.Exists(fileName))
{
using (PdfReader pdfReader = new PdfReader(fileName))
{
using (PdfDocument pdfDocument = new PdfDocument(pdfReader))
{
for (int page = 1; page <= pdfDocument.GetNumberOfPages(); page++)
{
string currentText = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(page));
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
}
}
}
Console.WriteLine(text.ToString());
}
Public Shared Sub Main(ByVal args() As String)
Dim text As New StringBuilder()
Dim fileName As String = "D:/What_is_pdf.pdf"
If File.Exists(fileName) Then
Using pdfReader As New PdfReader(fileName)
Using pdfDocument As New PdfDocument(pdfReader)
Dim page As Integer = 1
Do While page <= pdfDocument.GetNumberOfPages()
Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(page))
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)))
text.Append(currentText)
page += 1
Loop
End Using
End Using
End If
Console.WriteLine(text.ToString())
End Sub
The above source code reads a PDF file, extracts the text from each page, converts it to UTF-8, and then prints the entire text content to the console. It’s a basic example of how to extract text from a PDF file using the iTextSharp library in C#.
The code starts by declaring a StringBuilder named text to accumulate the extracted text from the PDF. It also defines a string variable fileName with the path of the document location. In this case, the PDF file is located at "D:/What_is_pdf.pdf".
The if (File.Exists(fileName)) condition checks whether the specified file exists. If the file exists, the subsequent code block is executed.
Inside the if block, it opens the PDF file using a PdfReader object. Then, it creates a PdfDocument file instance using the PdfReader. The for loop iterates through each page of the PDF document.
For each PDF page, it extracts the text content using the class PdfTextExtractor's GetTextFromPage(pdfDocument.GetPage(page)) method. The extracted text is initially encoded in the default encoding.
It then converts the text from the default encoding to UTF-8 using Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,Encoding.UTF8,Encoding.Default.GetBytes(currentText))). The converted text is then appended to the text string.
Finally, it prints the accumulated text using the Console.WriteLine() method.
The extracted PDF text output is as follows:
In this way, we can read the content of the PDF file. This approach is complex, and less efficient as multiple instances are created. Let's explore an alternate way that is more user-friendly, and highly efficient.
IronPDF is a versatile and efficient C# library designed to simplify and enhance the creation, manipulation, and rendering of PDF documents within .NET applications. IronPDF enables developers to seamlessly integrate PDF-related functionalities into their projects with a focus on ease of use and feature-rich capabilities. The library supports a wide range of PDF operations, including the creation of PDF documents from scratch, the conversion of HTML content to PDF, and the extraction of text and images from existing PDF files. IronPDF's intuitive API provides developers with a user-friendly experience, allowing them to generate dynamic and interactive PDFs effortlessly. Whether it's adding watermarks, annotations, or encrypting documents, IronPDF empowers developers to tailor PDFs to their specific requirements. As a reliable solution, IronPDF proves instrumental in applications ranging from report generation and document management to web development, offering a comprehensive set of tools to streamline PDF-related tasks in the .NET environment.
Download IronPDF into your project using the NuGet Package Manager Console with the following command.
Install-Package IronPdf
This command will download and install the IronPDF NuGet package, along with its dependencies, into your project.
In the browse tab of NuGet, search for the "IronPDF" library and click install.
Now, Let's read the same PDF File using IronPDF. The following code will extract text from the input PDF document.
using IronPdf;
public static void Main(string [] args)
{
var pdfDocument = PdfDocument.FromFile(@"D:/What_is_pdf.pdf");
string text = pdfDocument.ExtractAllText();
Console.WriteLine(text);
}
using IronPdf;
public static void Main(string [] args)
{
var pdfDocument = PdfDocument.FromFile(@"D:/What_is_pdf.pdf");
string text = pdfDocument.ExtractAllText();
Console.WriteLine(text);
}
Imports IronPdf
Public Shared Sub Main(ByVal args() As String)
Dim pdfDocument = PdfDocument.FromFile("D:/What_is_pdf.pdf")
Dim text As String = pdfDocument.ExtractAllText()
Console.WriteLine(text)
End Sub
The above code reads a PDF file named “What_is_pdf.pdf,” extracts all the text content from it, and displays the extracted text in the console
The code starts by loading a PDF document from a file named "What_is_pdf.pdf". It uses the PdfDocument.FromFile() method to create a PdfDocument object from the specified file.
Next, it extracts all the text content from the loaded PDF document. The pdfDocument.ExtractAllText() method returns the entire text from the PDF as a single string.
Finally, the extracted text is stored in the text variable. The code prints the extracted text to the console using Console.WriteLine(text) method.
IronPDF also provides a way to Extract text from a PDF file, page by page.
The following code will read a PDF document, page by page using IronPDF.
using IronPdf;
public static void Main(string [] args)
{
StringBuilder sb = new StringBuilder();
using PdfDocument pdf = PdfDocument.FromFile(@"D:/What_is_pdf.pdf");
for (int index = 0; index < pdf.PageCount; index++)
{
sb.Append (pdf.ExtractTextFromPage(index));
}
Console.WriteLine(sb.ToString());
}
using IronPdf;
public static void Main(string [] args)
{
StringBuilder sb = new StringBuilder();
using PdfDocument pdf = PdfDocument.FromFile(@"D:/What_is_pdf.pdf");
for (int index = 0; index < pdf.PageCount; index++)
{
sb.Append (pdf.ExtractTextFromPage(index));
}
Console.WriteLine(sb.ToString());
}
Imports IronPdf
Public Shared Sub Main(ByVal args() As String)
Dim sb As New StringBuilder()
Using pdf As PdfDocument = PdfDocument.FromFile("D:/What_is_pdf.pdf")
For index As Integer = 0 To pdf.PageCount - 1
sb.Append(pdf.ExtractTextFromPage(index))
Next index
Console.WriteLine(sb.ToString())
End Using
End Sub
The above code reads a PDF file named “What_is_pdf.pdf,” extracts the text content from each page, and prints the combined text to the console.
A StringBuilder named sb is created to accumulate the extracted text from the PDF. The using statement ensures proper disposal of resources. A PdfDocument object named PDF is created by loading a PDF file from the path "D:/What_is_pdf.pdf" using the PdfDocument.FromFile method.
The for loop iterates through each page of the loaded PDF document. For each page (indexed by index), it extracts the text content using pdf.ExtractTextFromPage(index). The extracted text is appended to the StringBuilder using sb.Append().
Finally, the accumulated text is converted to a single string using sb.ToString(). The entire extracted text is printed to the console using Console.WriteLine() method.
In conclusion, working with PDFs in C# involves understanding essential elements like a byte array, a document information dictionary, a cross-reference table, a new file instance, and a static byte. The first code using iTextSharp shows a functional approach, while the second with IronPDF offers a simpler and more efficient method. IronPDF's easy-to-use API simplifies tasks involving cross-reference tables, page Dictionary, and indirect reference. Whether dealing with only the xref in document information or private key aspects for secure PDFs, IronPDF is a versatile solution.
Developers seeking to explore IronPDF. Customer satisfaction is at the forefront of IronPDF's offerings, ensuring that developers find value and efficiency in their PDF-related tasks, making it a compelling choice for those in search of a reliable and feature-packed PDF library.
For more information on how to to use IronPDF, please refer to this documentation link.
9 .NET API products for your office documents