Free Web Hosting by Netfirms
Web Hosting by Netfirms | Free Domain Names by Netfirms

Webscrape

A Web 'Screen Scraper' 

mailto:info@webscrape.com


To get the Euro / US Dollar Exchange Rate

The web page http://www.x-rates.com has a conversion table which contains some of the main currencies.  We shall look into getting PageScrape to web-clip the current Euro/Dollar exchange rate, we will then write a small Perl script which will convert a Dollar value to its Euro equivalent.

Looking at the main page we see the various conversion rates arranged in a table.  The Dollar / Euro exchange rate is in a table cell located on the first column beside the EU flag.  Looking at the source code we see that the required exchange rate is the Value of an <A> Tag within the table cell.  The associated href gives us an easy way to anchor our search as the URL contains the text EUR/USD.  Using this anchor, and the fact that the required rate is the Value of the <A> Tag,  we can design a search.

The basic target HTML format is:

<A some attributes>the exchange rate</A>

The href attribute contains the anchor we will use in our search:

<A some stuff EUR/USD some more stuff>the exchange rate</A>

So, we could use the following (non-greedy) RegularExpression:

EUR/USD.*>(.*)<

As it is better to avoid the use of .* (it literally means 'anything', and can lead to strange results) we rephrase the expression to:

EUR/USD[^>]*>([^<]+)<

This expression is safer and can be happily used in a greedy search.  It can be read as:

EUR/USD followed by anything but > followed by > followed by anything but < followed by <

The exchange rate is a decimal number, we can modify the Regular Expression to reflect this, and thereby produce an even more discriminating search:

EUR/USD[^>]*>([+\-.0-9]+)<

The associated PageScrape command line is:

pscrape -u"www.x-rates.com" -e"EUR/USD[^>]*>([+\-.0-9]+)<"

 

Retrieving all of the Euro Exchange Rates

If we want to get all of the Euro exchange rates we can specify EUR/... rather than EUR/USD and use the -m option to tell PageScrape to search for multiple matches.  EUR/... specifies EUR/ followed by any three characters, to be more precise we could specify EUR/[A-Z]{3}.

pscrape -u"www.x-rates.com" -e"/d/EUR/[A-Z]{3}[^>]*>([+\-.0-9]+)<" -m

This returns a list of rates, but it is not obvious to which currency each rate applies.  To return each currency name as well as each exchange rate we can do the following:

pscrape -u"www.x-rates.com" -e"/d/EUR/([A-Z]{3})[^>]*>([+\-.0-9]+)<" -f"\$1 \$2" -m

This returns a list which contains each exchange rate along with a currency identifier for each, it should look something like:

USD 0.834168
GBP 1.47038
CAD 0.708362
AUD 0.624293

The -f option tells PageScrape how to format the output; \$1 refers to the first Regular Expression buffer/register while \$2 refers to the second, so using -f"\$1 \$2" results in PageScrape outputting the currency name followed by the rate for each currency.

Perl Script to Convert from Dollars to Euro

Using the above, a simple Perl script to convert a Dollar amount into its Euro equivalent could look like the following:

$dollarAmount = $ARGV[0];

$url = "http://www.x-rates.com";
$expr = "EUR/USD[^>]*>([+\-.0-9]+)<";

# Build the Commandline...
$cmd = "pscrape -u\"$url\" -e\"$expr\"";

# Execute the command, storing the result in $rate
$rate = `$cmd`;

$euroAmount = $dollarAmount * $rate;

print "Euro amount is $euroAmount";