Parsing pdf files with Python and PDFMiner

This time I needed a time series of cash biodiesel prices in Iowa (yes, I did ….).

Weekly biodiesel prices in Iowa are available in the National Weekly Ag Energy Round-Up report released every Friday by the The Agricultural Marketing Service (AMS/USDA). This report is published in pdf format while we recently parsed html with Python and BeautifulSoup.

Below is a Python snippet using the PDFMiner library. It gets through a certain number AMS’s online pdf files and extracts the data desired.

One can’t say it is robust, but it does the job quite okay. We wrote a basic error handling mechanism so that the program goes on whatever happens. In fact, there shall be Fridays when the weekly report in question was not released eg due to  banking holidays, some reports of the US agriculture agency are missing due to last November’s partial US Government shutdown, … . Other possible error types would also be ignored and NAs would be output instead.

#############################################################
# Edouard TALLENT @ TaGoMa.Tech, March, 2014                #
# Parse a series of online PDF documents to  build          #
# historical time series                                    #
# QuantCorner @ https://quantcorner.wordpress.com            #
#############################################################

# Some sources
# http://www.unixuser.org/~euske/python/pdfminer/programming.html
# http://denis.papathanasiou.org/2010/08/04/extracting-text-images-from-pdf-files/
# http://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text

# Required headers
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import  TextConverter # , XMLConverter, HTMLConverter
import urllib2
from urllib2 import Request
import datetime
import re

# Define a PDF parser function
def parsePDF(url):

    # Open the url provided as an argument to the function and read the content
    open = urllib2.urlopen(Request(url)).read()

    # Cast to StringIO object
    from StringIO import StringIO
    memory_file = StringIO(open)

    # Create a PDF parser object associated with the StringIO object
    parser = PDFParser(memory_file)

    # Create a PDF document object that stores the document structure
    document = PDFDocument(parser)

    # Define parameters to the PDF device objet 
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    codec = 'utf-8'

    # Create a PDF device object
    device = TextConverter(rsrcmgr, retstr, codec = codec, laparams = laparams)

    # Create a PDF interpreter object
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    # Process each page contained in the document
    for page in PDFPage.create_pages(document):
        interpreter.process_page(page)
        data =  retstr.getvalue()

    # Get values for Iowa B100 prices 
    reg = '(?<=\n---\n\n)\d.\d{2}-\d.\d{2}'
    matches = re.findall(reg, data)                 # Our data are contained in matches[0]

    # Compute the average
    # Extract value from previous regex 
    low = re.search('\d.\d{2}(?=-)', matches[0])
    high = re.search('(?<=-)\d.\d{2}', matches[0])

    # Cast string variables to float type
    low_val = float(low.group(0))
    high_val = float (high.group(0))

    # Calculate the average
    #import numpy
    #value = [high_val, low_val]
    #print value.mean
    ave = (high_val + low_val) /2

    # Search the date of the report 
    reg = '\w{3},\s\w{3}\s\d{2},\s\d{4}'
    match = re.search(reg, data)        # Result is contained in matches[0]
    dat = match.group(0)

    # Cast to date format
    #import datetime
    #form = datetime.datetime.strptime(match.group(0), '%a, %b %d, %Y')
    #print form

    # http://stackoverflow.com/questions/9752958/how-can-i-return-two-values-from-a-function-in-python
    return (dat, ave)

# The date of the latest weekly price bulletin
start_date = raw_input("Enter the date of the latest weekly price bulletin (dd/mm/yyyy): ")

# Convert start_date string to Python date format
dat = datetime.datetime.strptime(start_date,  '%d/%m/%Y')

# Time series length
back_weeks = raw_input("How many weeks back in time: ")

# A bit of order onto the screen
print '\n'
print 'Date as read in PDF' + '\t' + 'Formatted date' + '\t' + 'Value'

# Loop through the dates
for weeks in xrange(0, int(back_weeks)):

    # Basic exception handling mechamism
    try:           
        wk = datetime.timedelta(weeks = weeks)
        date_back = dat - wk

        # Construct the url
        url = 'http://search.ams.usda.gov/mndms/' + str(date_back.year) + \
              '/' + str(date_back.month).zfill(2) + '/LS' + str(date_back.year) + \
              str(date_back.month).zfill(2) + str(date_back.day).zfill(2) + \
              'WAGENERGY.PDF'

        # Call to function
        fun =  parsePDF(url)

        # Information we are after
        res = str(fun[0]) + '\t' +  str(date_back.day).zfill(2) + '/' + \
              str(date_back.month).zfill(2) +  '/' + str(date_back.year) + \
              '\t' + str(fun[1])

    except Exception:
        print 'NA\t\t\t' +  str(date_back.day).zfill(2) + '/' + \
              str(date_back.month).zfill(2) +  '/' + str(date_back.year) + \
              '\t' + 'NA'

    # Output onto the screen
    print res

'''
Result example:

>>> 

Date as read in PDF Formatted date  Value
Fri, Mar 07, 2014   07/03/2014  3.67
Fri, Feb 28, 2014   28/02/2014  3.57
Fri, Feb 21, 2014   21/02/2014  3.475
Fri, Feb 14, 2014   14/02/2014  3.225
Fri, Feb 07, 2014   07/02/2014  3.175
Fri, Jan 31, 2014   31/01/2014  3.15
Fri, Jan 24, 2014   24/01/2014  3.295
Fri, Jan 17, 2014   17/01/2014  3.24
Fri, Jan 10, 2014   10/01/2014  3.2
Fri, Jan 03, 2014   03/01/2014  3.825
Fri, Dec 27, 2013   27/12/2013  3.845
Fri, Dec 20, 2013   20/12/2013  3.855
Fri, Dec 13, 2013   13/12/2013  3.885
Fri, Dec 06, 2013   06/12/2013  4.025
Fri, Nov 29, 2013   29/11/2013  4.05
Fri, Nov 22, 2013   22/11/2013  4.025
Fri, Nov 15, 2013   15/11/2013  4.075
Fri, Nov 08, 2013   08/11/2013  4.325
Fri, Nov 01, 2013   01/11/2013  4.365
Fri, Oct 25, 2013   25/10/2013  4.54

'''

7 pensamientos en “Parsing pdf files with Python and PDFMiner

  1. Kieren Tinning

    Great sample, do you know how to extract certain elements though? And to know which elements to extract? I.e., if I have a PDF that has a form and I need one element out of the form. I know which element it is, but how does one get PDFMiner to extract it?

    Thanks for the help

    Responder
    1. édouard Autor de la entrada

      Hello Kieren,

      Thanks for stopping here.

      I guess that the ‘best’ strategy depends on that very PDF document of yours.

      It may be more efficient (easier) to get everything out of the PDF document and then extract the relevant pieces.
      Alternatively, one may extract only the information required. Once again, that depends on the structure (‘complexity’) of the document.

      Edouard.

      Responder
  2. Pantelis Karamolegkos

    May I ask why you are using the keyword ‘open’ as the variable’s name where the result of the urlopen.read() is stored?

    open = urllib2.urlopen(Request(url)).read()

    Wouldn’t just any other valid python variable name do?

    Responder
  3. Pingback: Bookmarks for September 10th | Chris's Digital Detritus

Responder

Por favor, inicia sesión con uno de estos métodos para publicar tu comentario:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s