|
|
|
WebscrapeA Web 'Screen Scraper'To get the Temperature in Barcelona Let's get query AOL for the current temperature in Barcelona. On investigation with a browser we find that the following URL will bring us to the required page: http://my.aol.com/weather/index.psp?city=SPXX0015 Looking at this page the temperature will be found somewhere between 'Temperature:' and 'Feels'. The following non-greedy search does the trick: pscrape -u"my.aol.com/weather/index.psp?city=SPXX0015" -e"Temperature:.*(\d+)\D.*Feels" \d is another way of expressing [0-9],that is zero to nine, and \D is equivalent to [^0-9], anything but zero to nine. We must use the \D as this is a non greedy search, if we left it out: pscrape -u"my.aol.com/weather/index.psp?city=SPXX0015" -e"Temperature:.*(\d+).*Feels" PageScrape would only return one digit of the possible two (or three) digit temperature value, as this is a non-greedy search \d+ is satisfied once it has matched one digit between 0 and 9, and .* will soak up the other digits, specifying \D gets around this problem. Another way to avoid this problem is to perform a greedy search as follows: pscrape -g -u"my.aol.com/weather/index.psp?city=SPXX0015" -e"Temperature:.*(\d+).*Feels" In this case \d+ will try to match as many digits as possible. However, the use of .* in a greedy search is normally very dangerous and may lead to very unexpected results! With this greedy search, if 'Feels' appears anywhere later on the page the search will break Now let's try to achieve something similar using www.yahoo.com, doing some browsing for the temperature in Barcelona yields the following URL: http://weather.yahoo.com/forecast/SPXX0015_f.html From what we have learned from the previous searches, a first stab at a Regular Expression may look something like: pscrape -u"weather.yahoo.com/forecast/SPXX0015_f.html" -e"Currently:.*(\d+)\D" But this does not really work at all, there's just too much stuff, especially numbers, between the 'Currently:' and the value we want, the wrong number is returned. We need a way to express the fact that after 'Currently:' and before the temperature value there are some HTML tags which we want to ignore. So how do we express a tag? Well as we are performing a non-greedy search we could use the following <.*> (note: this will not work if we encounter nested comments!). We expect a few of these tags before the temperature value, this can be expressed as (<.*>)*, then we expect a number after possibly a few non-numeric characters, or more specifically, trying to avoid another use of .*, one or more characters which are not <, or [^<] before the number. So, now we have Currently:(<.*>)*[^<]*(\d+)\D. pscrape -u"weather.yahoo.com/forecast/SPXX0015_f.html" -e"Currently:(<.*>)*[^<]*(\d+)\D" -b2 The -b2 parameter indicates that PageScrape should return the contents of the second buffer, the first buffer is the (<.*>) atom, and the second buffer is (\d+) which is what we want. Instead of passing the -b2 argument to PageScrape we could use the format string argument -f as follows: pscrape -u"weather.yahoo.com/forecast/SPXX0015_f.html" -e"Currently:(<.*>)*[^<]*(\d+)\D" -f"The Temperature in Barcelona is \$2 Fahrenheit" In the format string \$2 refers to the second buffer, on a successful match PageScrape will insert the contents of this buffer into the format string at this point. The result should look something like the following: The Temperature in Barcelona is 75 Fahrenheit
|