We will be using the following third-party DLLs to get our work done:

  1. PdfBox: This third-party Nuget package will be used to read a PDF file.
  2. DocX: This package will be used to write a Word document.

Step 1

The first step will be to get the PdfBox package using the Nuget Package Manager. The path for it is somwhere.

Right-click on the solution and select the “Manage NuGet Packages” option.

manage newget package

Now, select the “Online” option from the left side menu and search for “PdfBox” on the right side panel. Ater searching, click on the “Install” button alongside the “PdfBox” as in the following:

PdfBox

Once it is installed, you will see some DLLs have been added to the project as in the following:

add reference

Step 2

Now, import the following DLLs into your .cs file:

  1. using org.apache.pdfbox.pdmodel;  
  2. using org.apache.pdfbox.util;  
Please ensure that this step is followed else you would not be able to read the PDF doc.

Step 3

The third step will be to install the DocX NuGet Package from the NuGet Package Manager:

DocX

Doing this will import some more DLLs into your solution.

Step 4

Let's read a PDF file and try to get the text from it.

We would use the package PDFBox to do so and the code for it will be as in the following:

use the package PDFBox

Step 5

The next part of the code will be to read this string and write it to a Word document.

You would need to import the following two namespaces in the .cs file:
  1. using Novacode;  
  2. using System.Diagnostics;  
The Novacode namespace is to make use of the DOCX packages included in the solution.

The System.Diagnostic is to ensure that we are automatically able to open the new Word document. This is done by the code: “Process.Start("WINWORD.EXE", fn);”.

The code for it will be:

code

In this way the PDF document would not be available as a Word document file.

I hope this helped.

Thanks,
Vipul

Next Recommended Readings