Extracting data from PDFs remains, unfortunately, a common data wrangling task. This post reviews various tools and services for doing this with a focus on free (and preferably) open source options.
The tools we can consider fall into three categories:
- Extracting text from PDF
- Extracting tables from PDF
- Extracting data (text or otherwise) from PDFs where the content is not text but is images (for example, scans)
The last case is really a situation for OCR (optical character recognition) so we’re going to ignore it here. We may do a follow up post on this.
The Paris Climate Agreement text was published as PDF. Some of the tools described here – plus the usual blood, sweat and tears – were used turn them back into usable HTML for our Paris COP21 Climate Treaty Texts site
A classic example of an important government report published as PDF only
Generic (PDF to text)
-
PDFMiner - PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
- Pure python
- In our trials PDFMiner has performed excellently and we rate as one of the best tools out there.
- pdftohtml - pdftohtml is a utility which converts PDF files into HTML and XML formats. Based on xpdf. One of the better for tables but have found PDFMiner somewhat better for a while. Command-line Linux
- pdftoxml - command line utility to convert PDF to XML built on poppler.
- docsplit - part of DocumentCloud. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages…)
- pypdf2xml - convert PDF to XML. Built on pdfminer. Started as an alternative to poppler’s pdftoxml, which didn’t properly decode CID Type2 fonts in PDFs.
- pdf2htmlEX - Convert PDF to HTML without losing text or format. C++. Fast. Primarily focused on producing HTML that exactly resembles the original PDF. Limited use for straightforward text extraction as it generates css-heavy HTML that replicates the exact look of a PDF document.
-
pdf.js - you probably want a fork like pdf2json or node-pdfreader that integrates this better with node. Not tried this on tables though …
- Max Ogden has this list of Node libraries and tools for working with PDFs: https://gist.github.com/maxogden/5842859
- Here’s a gist showing how to use pdf2json: https://gist.github.com/rgrp/5944247
- Apache Tika - Java library for extracting metadata and content from all types of document types including PDF.
- Apache PDFBox - Java library specifically for creating, manipulating and getting content from PDFs.
Tables from PDF
- Tabula - open-source, designed specifically for tabular data. Now easy to install. Ruby-based.
- https://github.com/okfn/pdftables - open-source. Created by Scraperwiki but now closed-source and powering PDFTables so here is a fork.
- pdftohtml - one of the better for tables but have not used for a while
- https://github.com/liberit/scraptils/blob/master/scraptils/tools/pdf2csv.py AGPLv3+, python, scraptils has other useful tools as well, pdf2csv needs pdfminer==20110515
- Using scraperwiki + pdftoxml - see this recent tutorial Get Started With Scraping – Extracting Simple Tables from PDF Documents
Existing open services
- http://givemetext.okfnlabs.org/ - Give me Text is a free, easy to use open source web service that extracts text from PDFs and other documents using Apache Tika (and built by Labs member Matt Fullerton)
-
http://pdfx.cs.man.ac.uk/ - has a nice command line interface
- Is this open? Says at bottom of usage that it is powered by http://www.utopiadocs.com/
- Note that as of 2016 this seems more focused on conversion to structured XML for scientific articles but may still be useful
-
Scraperwiki - https://views.scraperwiki.com/run/pdf-to-html-preview-1/ and this tutorial- no longer working as of 2016
Existing proprietary free or paid-for services
There are many online – just do a search – so we do not propose a comprehensive list. Two we have tried and seem promising are:
- http://www.newocr.com/ - free, with an API, very bare bones site but quite good results based on our limiting testing
- https://pdftables.com/ - pay-per-page service focused on tabular data extraction from the folks at ScraperWiki
We also note that Google app engine used to do this but unfortunately it seems discontinued.
Comments