Tuesday, April 22, 2014

Copy tables from PDF for further processing

You ever tried to extract data part of a table from PDF to further process it in MS Excel or any other Spreadsheet application? That can be a painful job.

You might use one of the many PDF to Excel conversion tools - but most of them cannot be used without submitting your PDF to an online service or buying a commercial license. In addition you have to evaluate the quality of the results especially for data part of tables.

Few days ago I stumbled upon the tool Tabula. Still marked as experimental but the results are still pretty useful at least for data oriented tables.

Example

 

The PDF "The Mobile Economy 2013" contains a table on page 56 which you want to process in Excel:









  


You have to follow the following steps to extract as csv:

  • Download the file to your local disk
  • install and start the tool follow the instructions on the homepage
  • upload the PDF and select "Submit"
  • navigate to the table and select the table:













  • choose "Repeat this selection" if you want to select the following tables as well using the same coordinates.
  • choose "Download all data" and you get:










  • choose "Download data" to get a csv file with the extracted tables. This file can be open with MS Excel or any other application which can read the csv format for further processing.
The results are very useful and it works for those kind of data as well:

























Won't work with data provided as graphic as part of the PDF this is an topic for another story.

No comments: