MPOB’s Malaysian crude palm oil production data. Another fetching thing with Python and BeautifulSoup

We needed up-to-date Malaysian crude palm oil production data. The MPOB provides such data on its website. It remained to write some ugly but somewhat reliable snippet code in Python, the latter doing a lot with only few lines of codes. We used the additional library BeautifulSoup we already used here and here.

The one who need an additional year of data (that is 2005) can easily modify the original piece of code to get them from the MPOB’s website.

The data are output to screen so that they are easily copied-pasted in MS Excel (kindda *.csv format).

#####################################################
# Edouard TALLENT @TaGoMa.Tech                      #
# Scraping MPOB's data on CPO                       #
# Website http://bepi.mpob.gov.my/                  #
# QuantCorner @ https://quantcorner.wordpress.com    #
# April, 2014                                       #
#####################################################

# Required headers
import urllib2                          # Read webpages
from bs4 import BeautifulSoup           # bs4 fonctions
import re                               # Regex, removing those bloody kommas

# Arrays that will contain the desired data
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul',  'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
data = []
dates = []

# URLs
oldest = 2006 # Data are available back to 2006
newest = 2014 # Most recent year for the data

for urls in range (oldest, newest + 1):
    # construct the URLs strings
    url = 'http://bepi.mpob.gov.my/stat/web_report1.php?val=' + str(urls) + '44'

    # Read the HTML page content
    page = urllib2.urlopen(url)

    # Create a beautifulsoup object
    soup = BeautifulSoup(page)

    # Search the table to be parsed in the whole HTML code
    tables = soup.findAll('table')
         
    for y in range (2, 16):
        m = 0
        for table in range(0, 2):
            for x in range(2, 13, 2):
                data.append(re.sub(',','', tables[table].findAll('tr')[y].findAll('td')[x].string))

# Construct the date series (%b-%Y format)
for year in range(oldest, newest + 1):
    for month in months:
        dates.append(str(month) + '-' + str(year)) # Construct the %b-%Y series

# Rearrange the data
ind = 0
cnt = 0
arr = [[],[],[],[],[],[],[],[],[],[],[],[],[],[]]

for dat in range(0, len(data)):
    arr[ind].append(data[dat])
    cnt += 1
    if (dat + 1)%12 == 0: 
        ind += 1
    if ind > 13:
        ind = 0

# Print out to screen
print 'date,johor,kedah,kelantan,negeri_sembilan,pahang,perak,selangor,terennganu,\
other_states,p_malaysia,sabah,sarawak,sabah_sarawak,malaysia'
for x in range (0, len(arr[0])):
    print str(dates[x]) + ',' + arr[0][x] + ',' + arr[1][x] + ',' + arr[2][x] + ',' + arr[3][x] + ',' + \
    arr[4][x] + ',' + arr[5][x] + ',' + arr[6][x] + ',' + arr[7][x] + ',' + arr[8][x] + ',' + \
    arr[9][x] + ',' + arr[10][x] + ',' + arr[11][x] + ',' + arr[12][x] + ',' + arr[13][x]

'''
# Output
date,johor,kedah,kelantan,negeri_sembilan,pahang,perak,selangor,terennganu,other_states,p_malaysia,sabah,sarawak,sabah_sarawak,malaysia
Jan-2006,145637,10794,12146,22321,108659,100251,34323,21896,8353,464380,376082,96131,472213,936593
Feb-2006,193908,16102,14455,33604,137622,130834,43014,23719,12684,605942,346248,99714,445962,1051904
...
...
Oct-2014, , , , , , , , , , , , , , 
Nov-2014, , , , , , , , , , , , , , 
Dec-2014, , , , , , , , , , , , , , 
'''

Responder

Por favor, inicia sesión con uno de estos métodos para publicar tu comentario:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s