|
|
|
WebscrapeA Web 'Screen Scraper'To Screenscrape Words from a Thesaurus If you read the example called Get Google to Suggest Correct Spelling you would know that 'Thesaurus' is a very difficult word to spell correctly, but you would also know that we can easily webscrape the correct spelling from Google. So now that we have the correct spelling let's to use it to do some crafty thesaurus based screenscraping. Ok, we will webscrape from http://thesaurus.reference.com If you visit the site and search for a word you will see how the GET requests are formed, for example to search for the word 'Aracdian' the following request is submitted: http://thesaurus.reference.com/search?q=arcadian Looking at the results of a successful search we see that each alternative word appears after the text 'Main Entry:', we can use this to anchor our search. From the page source we see that the word we want to screenscrape is enclosed by > and < shortly after 'Main Entry:'. We just have to soak up some markup between 'Main Entry:' and the target word (using .*). We can use the -m option to get PageScrape to return all of the alternative words. The following works quite well: pscrape -u"http://thesaurus.reference.com/search?q=arcadian" -e"Main Entry:.*>(\w[^<]+)<" -m This should return a list of words like: country |