Friday, 8 March 2013

How to load data from PDF file in SSIS package using script component

I have come across the case where I have to read PDF document, parse it and write content to database using SSIS. PDF document is not usual source for any data so there is no component available to read PDF file from SSIS.  I have used iTextSharp library in Script Component to read the PDF file from SSIS package.  I also found very good link where I found good example on how to use iTextsharp library.
Using iTextsharp library it is very easy to process PDF document. In my case it was simple case of reading PDF document page by page and parses the content using custom logic. You can do much more than just reading PDF document using this library.

To read data from PDF file and write to database, I have used Script Component as data source, which reads the PDF document and return result and output record of Script Component. The simply used OLE DB destination to write it to database. You might be aware that any third party dlls can be used in SSIS Script component by referencing it. But to add it as reference and use it in script component you have to register dll in GAC. It can be referenced without adding it but dll might not be available on the production environment at the same location. So better add them to GAC. In my case I was not allowed to add dll to production server directly, so I used .Net reflection to load the dll dynamically and added dll location as configuration in SSIS package.

Below is my code

//Below two variable are used to pass ITextSharp dll location and PDF file path.
//Which ultimately supplied from dts config file.
String configITextPath = Variables.itextdllfilepath;
String configSourceDataFile = Variables.filepath;

//Load ITextSharp library           
Assembly assembly = Assembly.LoadFrom(configITextPath);

//Get PdfReader object and create its instance
Type readerType = assembly.GetType("iTextSharp.text.pdf.PdfReader");
object reader = Activator.CreateInstance(readerType, new object[] { configSourceDataFile });

Type extractorType = assembly.GetType("iTextSharp.text.pdf.parser.PdfTextExtractor");

Type extractionStrategyType = assembly.GetType("iTextSharp.text.pdf.parser.LocationTextExtractionStrategy");

object extractionStrategy = Activator.CreateInstance(extractionStrategyType);

//Get number of pages in PDF document
Int32 pages = (Int32)reader.GetType().GetProperty("NumberOfPages").GetValue(reader, null);

//Read PDF document page by page and process it and generate Output rows for script component.
String pageText;
List<PepRecord> pepRecords = new List<PepRecord>();
for (int page = configStartPageNumber; page <= pages; page++)
    reader = Activator.CreateInstance(readerType, new object[] { configSourceDataFile });
    extractionStrategy = Activator.CreateInstance(extractionStrategyType);
    pageText = (String)extractorType.GetMethods()[0].Invoke(reader, new Object[] { reader, page, extractionStrategy });

   //code to process PDF file content and create Output rows for script component   
I know better way to do this is to add dll in GAC and refer from there. But in my case I have to quickly found the solution to process few PDF document suing SSIS package, so I have used reflection to load it.