|
|
|
WebscrapeA Web 'Screen Scraper'Stock Quote Example To get Microsoft's (ticker MSFT) current stock price from MSN we could use the following:
pscrape -i -u"moneycentral.msn.com/scripts/webquote.dll?iPage=lqd&Symbol=MSFT" -e"real-time
quotes.*(\d+\.\d+)<"
This example can might look a bit daunting at first, so let's go through it. It is important to remember that for a given search there is no one 'correct' regular expression, the same search can normally be achieved in a myriad of different ways. So where do we start? The first step is always to go through the process of getting to the required web data manually using your browser. So first let's go to: http://moneycentral.msn.com From this page all we have to do is enter the ticker symbol (MSFT) and hit the 'Go' button, doing so yields a web page with loads of stock information. Take a look at the corresponding URL as displayed by the browser, when we call PageScrape this will be the -u parameter. Everything after the ' ?' are the GET request's parameters each separated by '&'. Now we just have to figure out a Regular Expression which will pull out the stock price from the response page. For the time being we could try the following just to check that we are on the right track: pscrape -i -u"moneycentral.msn.com/scripts/webquote.dll?iPage=lqd&Symbol=MSFT" -e"<title>(.*)</title>" The page's title should be returned, this is the text between the HTML 'title' tags in the page's HTML source. The Regular Expression (provided by the -e argument) can be read as "<title>Anything</title>", the brackets () donate a regular expression buffer, it is the text that will be placed into this buffer that we are interested in . The -i parameter tells PageScrape to ignore case so that the Regular Expression will match either <TITLE> or <title>. So far so good, but we want to pull out the stock price, looking at the page the text somewhere after 'Real-time quotes' may well contain the value that we want. At this stage it might be handy to look at the page's source, either use the browser to view it or get PageScrape to log it to a file by rerunning it with the -l option. pscrape -i -l"quote.log" -u"moneycentral.msn.com/scripts/webquote.dll?iPage=lqd&Symbol=MSFT" -e"<title>(.*)</title>" This writes all of the source to quote.log (as well as getting the title again!). Once the page has been written to file we can test further Regular Expressions without having to be online by specifying 'file://' in the URL, as we have the page it is no longer necessary to include the GET parameters: pscrape -i -u"file://quote.log" -e"<title>(.*)</title>" Anyway, looking at the source, we see that the value we want is placed after a hyperlink called "Real-time quotes", if we assume that the stock price is always of the form "one or more digits, a decimal point and then one or more digits" we can modify or expression to: pscrape -u"moneycentral.msn.com/scripts/webquote.dll?iPage=lqd&Symbol=MSFT" -e"Real-time quotes.*([0-9]+\.[0-9]+)<" Now, PageScrape should have returned the stock price? This time the Regular Expression can be read as "Real-time quotes then Any stuff then one or more digits between 0 and 9 then a . (escaped as \.) one or more digits between 0 and 9 and then any stuff followed by <", again the brackets indicate that which part of the matched Regular Expression should be returned (the number). To make things a bit neater, we can use \d rather than [0-9], (\d is short-hand for [0-9]) and the expression becomes: pscrape -u"moneycentral.msn.com/scripts/webquote.dll?iPage=lqd&Symbol=MSFT"
-e"Real-time quotes.*(\d+\.\d+)<" In a regular expression, a . (dot) means 'any character', so if we are actually looking for a . we have to escape it as \.
Formatting the output We can format this data into a a string as follows: pscrape -u"moneycentral.msn.com/scripts/webquote.dll?iPage=lqd&Symbol=MSFT" -e"Real-time quotes.*(\d+\.\d+)<" -f"Microsoft Stock Price = \$1" This should output something like: Microsoft Stock Price = 20 In the format string, the sequence \$1 tells page scrape to insert the contents of the first Regular Expression buffer into the output, in this case it is the bit between the brackets ().
Retrieving the day's High and Low Prices Lets say that as well as the current stock price we want to also get the day's high and low values. One way of doing this would be to perform three separate scrapes using a different Regular Expression each time, but it can also be achieved using a single scrape. In this case the Regular Expression will be a bit longer but still quite easy to put together. Looking at the page's source the real time quote is followed after a bit by the High and Low values in that order, this makes sense as they appear on the actual web page in that order too! So we have already figured out how to get the stock quote: pscrape -u"moneycentral.msn.com/scripts/webquote.dll?iPage=lqd&Symbol=MSFT" -e"Real-time quotes.*(\d+\.\d+)<" This bit doesn't have to change, we just have to add bits onto the end to get the other values. In the extended Regular Expression we will use two more sets of brackets ( ) to declare two more buffers that will hold the values we want when we have a successful match. We will us the -f format output to access the text in these buffers and combine them into a handy output string. In the example above we referenced the first buffer as \$1, likewise to reference the second buffer we use \$2 and so on. Using \$0 in the format string will result in the complete matched Regular Expression to be inserted, try: pscrape -u"moneycentral.msn.com/scripts/webquote.dll?iPage=lqd&Symbol=MSFT" -e"Real-time quotes.*(\d+\.\d+)<" -f"Complete Match: \$0\nStock Quote: \$1" The \n just inserts a new-line into the output to make it a bit easier to read. So it follows that if we want to retrieve three values we have to use three buffers that will be referenced in our format string as \$1, \$2 and \$3. Let's take a really simplistic but hopefully reassuring first stab at a en expression; we know that the high value comes somewhere after the text 'High' and before the text 'Low', and similarly we know that the low value comes after the text 'Low'. pscrape -u"moneycentral.msn.com/scripts/webquote.dll?iPage=lqd&Symbol=MSFT" -e"quotes.*>(\d+\.\d+)<.*High.*>(\d+\.\d+)<.*Low.*>(\d+\.\d+)<" -f"Current: \$1, High: \$2, Low: \$3" Although this is very simplistic, it does seem to work OK, at least when the market's in session. |