NavigationUser loginscreen-scraper.com welcomes...
Currently online
There are currently 0 users and 0 guests online.
|
Tutorial 2: Page 7: Scraping Pages from Scripts
For each details link we're going to scrape the corresponding details page. This is a common scenario in screen-scraping--given a search results page, you need to extract details for each product, which means following each of the product details links. For each details page you'll likely want to extract out pieces of information corresponding to the products. Let's start by creating a scrapeable file for the details page. We could create it from the proxy session, but it's pretty simple, so let's just create it from scratch. Click on the "Shopping Site" scraping session, the "General" tab, then click the "Add Scrapeable File" button. Give the scrapeable file the name "Details page", and the following URL: http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=~#PRODUCTID#~You'll notice that this time we're leaving all of the parameters embedded in the URL. Sometimes with shorter URL's it's more convenient to take this approach rather than breaking them out under the "Parameters" tab. As before, when the scraping session runs, the ~#PRODUCTID#~ token will be replaced by the value of the "PRODUCTID" session variable. At this point, click the "This scrapeable file will be invoked manually from a script" checkbox. If we didn't do this, screen-scraper would invoke this scrapeable file in sequence (after the search results page), which we don't want. Instead, we're going to tell screen-scraper to invoke this scrapeable file from a script. In screen-scraper, links are generally followed by invoking a script after an extractor pattern finds matches. Let's go over this in more detail. First, create a new script and call it "Scrape details page". If you're using Interpreted Java enter the following code: session.scrapeFile( "Details page" );If you're using VBScript enter the following: Call session.ScrapeFile( "Details page" )OK, this is where the logic may get a little tricky. For each product ID our "Product details link" extractor pattern extracts, we want to scrape the product details page using the PRODUCTID it extracts. Go to the "Product details link" extractor pattern by clicking the "Search results" scrapeable file, then the "Extractor Patterns" tab. Note the "Scripts" pane under the extractor pattern. Click the "Add Script" button. This will allow us to have a script execute as the pattern finds matches. Under the "Script Name" column, if it isn't already selected, select our "Scrape details page" script. Leave the "Sequence" as is, and, under the "When to Run" column, select "After each pattern application". Let's walk through this a bit more slowly. After the search results page is requested the "Product details link" will be applied to the HTML in the page. Remember that this particular extractor pattern will match 10 times--once for each product details link. Each time it matches it will grab a different product ID and save the value of that product ID into the PRODUCTID session variable. The "Scrape details page" script will get invoked after each of these matches, and each time the PRODUCTID session variable will hold a different product ID. As such, when the "Details page" gets scraped the URL will get a different product. For example, the first time the extractor pattern matches the PRODUCTID session variable will hold "8", and the URL will be: http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=8The next time the product ID will be 34, yielding the URL: http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=34If it helps, think again about the spreadsheet analogy. You can imagine screen-scraper walking through each row in the spreadsheet. It encounters a row, saves any needed data in session variables (the product ID, in this case), then invokes the "Scrape details page" script. Because it just matched a specific product ID, and saved its value in a session variable, when the "Details page" scrapeable file gets invoked by the script, the current product ID in the PRODUCTID session variable will be used. Once it's finished invoking the "Details page" scrapeable file, it will go on to the next row (or DataRecord) in the spreadsheet (or DataSet). Again, it will save the next product ID in a session variable, then execute the "Scrape details page" script, which in turn invokes the "Details page" scrapeable file. Because we indicated that the script should be invoked "After pattern application", this will occur 10 times--once for each search result. If we had designated "After pattern is applied", the script would only have been executed once--after it traversed the spreadsheet and reached the very end. Hopefully that's not too repetitive :) This is another area that people new to screen-scraper find confusing, so it's probably worth it to slow down a bit and ensure you understand what's going on. Now would be a good time to try out the whole scraping session again. Do that like you did before by clearing out the log for the scraping session, then clicking the "Run Scraping Session" button. You'll see each details page getting requested one-by-one. Note especially each URL, which will have a different product ID at the end of each. If you'd prefer not to wait for the entire session to run you can click the "Stop Scraping Session" button. As before, it would be a good idea to go through the log carefully to ensure that you understand what it's doing. At this point we still need to deal with the "Next" page link. We already have an extractor pattern to grab out the page number of the next page. Let's create a script to scrape the search results page again for each "Next" link. Generate a new script and call it "Scrape search results". If you're using Interpreted Java enter the following: if( dataSet.getNumDataRecords() > 0 ){ If you're using VBScript enter the following code (again, be sure to select "VBScript" from the "Language" drop-down box): If dataSet.getNumDataRecords > 0 Then You'll notice that the script makes use of a "dataSet" variable. When the script is invoked screen-scraper will automatically create a variable corresponding to the current DataSet. This variable allows you to get access to all of the information that was extracted by the current extractor pattern. You can read more about objects available in scripts and their scope in our documentation, at the Using Scripts and API Documentation pages. In this particular case, the script first checks the number of records in the current DataSet. That is, it looks at the number of DataRecords (or rows) in the DataSet (or spreadsheet). This effectively just checks to see if any "Next" link was found in the page. If so, it tells screen-scraper to scrape the "Search results" scrapeable file. After creating the script return to the "Next link" extractor pattern, then click the "Add Script" button. Select the "Scrape search results" script. This time there's something slightly different we'll need to do under the "When to Run" column. First, click the "Apply Pattern to Last Scraped Data" button. You'll notice that the pattern matches twice. The problem is that we only want to follow one of the "Next" links (that is, we don't want to scrape the second page twice). This is easily dealt with by selecting "After pattern is applied" under the "When to run" column. In other words, the script will only get invoked once--after the extractor pattern has matched as many times as it can. Note, though, that because we're saving the value of the ~@PAGE@~ extractor pattern token in a session variable it will still hold the correct value when the page gets scraped. Because we indicate that the script is to be invoked "After pattern is applied", the "dataSet" variable will be in scope. See the Variable scope section in our documentation for more detail on which variables are in scope depending on when a given script is run. OK, run the scraping session once more. Clear the scraping session log, then click the "Run Scraping Session" button again. If you let it run for a while you'll notice that it will request each details page for the products found on the first search results page, request the second search results page, then request each of the details pages for that page.
|
SearchNew Video!Tags Throughout this Site |
Recent comments
5 hours 17 sec ago
5 hours 7 min ago
7 hours 13 min ago
1 day 2 hours ago
1 day 2 hours ago
1 day 3 hours ago
1 day 3 hours ago
1 day 4 hours ago
1 day 4 hours ago
3 days 52 min ago