Scraping data from the internet with Python and BeautifulSoup

I needed a daily time series of crude soybean oil prices in Central Illinois (yes, I did …). After a bit of search, I found out that the data I needed are available on the Iowa Farm Bureau website.

More specifically, there is no time series available but data are accessible playing a bit with the URLs. It was a case for Python and BeautifulSoup!

The snippet code provided below is straightforward and can easily modified to suit specific needs. There is no optimization or exceptions mechanisms. That just do the work.

iowa Farm Bureau


##################################################
# Edouard TALLENT @TaGoMa . Tech, March, 2014    #
# Scraping Central Illinois crude soyoil prices  #
# from http://markets.iowafarmbureau.com/        #
# QuantCorner @ https://quantcorner.wordpress.com #
##################################################

# Required headers
import urllib2                          # Read webpages
from bs4 import BeautifulSoup           # bs4 fonctions
import time                             # Time, time elapsed
import re                               # Regex, removing characters

# Arrays that will contain the desired datas
#d = []      # Dates
#l = []      # Lows
#h = []      # Highs

# URLs
start_urls = 4539   # Most recent webpage to start parsing
nb_quotes = 200     # Number of quotes desired

for urls in range (start_urls, start_urls - nb_quotes, -1):
    # Start time
    start_time = time.time()

    # construct the URLs strings
    url = 'http://markets.iowafarmbureau.com/pages/usdacash.php?id=' + str(urls)

    # Read the HTML page content
    page = urllib2.urlopen(url)

    # Create a beautifulsoup object
    soup = BeautifulSoup(page)

    # Search the table to be parsed in the whole HTML code
    tables = soup.findAll('table')
    tab = tables[1]                 # This is the table to be parsed

    # Search the date
    # <option value='4539'>Mar 03, 2014</option>
    date = str(soup.find('option', {'value' : str(urls)}).string)

    # Pick up the content of the desired cells in tab
    # http://www.briancarpio.com/2012/12/02/website-scraping-with-python-and-beautiful-soup/
    '''
    <td>Crude Soybean Oil</td>
    <td>Processor</td>
    <td>+40.01</td>
    <td>+40.36</td>
    <td>    
    '''
    low_tmp = str(tab.findAll('tr')[8].findAll('td')[2].string)     #Low price
    low = re.sub('[+]', '', low_tmp)                                # Remove the '+' sign
    high_tmp = str(tab.findAll('tr')[8].findAll('td')[3].string)    # High price
    high = re.sub('[+]', '', high_tmp)                              # Remoce the '+' sign

    # Stop time
    stop_time = time.time()

    # Print out to the screen
    print date, '\t', low , '\t', high, '(%0.1f s)' % (stop_time - start_time)

    ## Store values parsed in arrays for later use
    #d.append(date)
    #l.append(low)
    #h.append(high)

2 pensamientos en “Scraping data from the internet with Python and BeautifulSoup

  1. Pingback: Parsing pdf files with Python and PDFMiner | Quant Corner

  2. Pingback: MPOB’s Malaysian crude palm oil production data. Another fetching thing with Python and BeautifulSoup | Quant Corner

Responder

Por favor, inicia sesión con uno de estos métodos para publicar tu comentario:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s