NavigationUser loginscreen-scraper.com welcomes...
Currently online
There is currently 1 user and 1 guest online.
Online users
|
Tutorial 2: Page 8: Extracting Product Details
At this point we're able to scrape the details pages for each of the products. We're now ready to extract the information we're really interested in: data about each DVD. To do this we're going to use sub-extractor patterns. Again, this is a point in the tutorial where you may want to slow down a bit. Sub-extractor patterns is another important concept that can be a bit confusing at first. Sub-extractor patterns allow us to define a small region within a larger HTML page from which we'll extract individual snippets of information. This helps to eliminate most of the HTML text we're not interested in, allowing us to be more precise about the data we'd like to extract. It also makes our extractor patterns more resilient to future changes in the HTML page, as they allow us to reduce the amount of HTML we need to include. If you let the scraping session run through to completion the last URL in the scraping session log will be the following: http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=7If that's not the exact one you have, don't worry; it won't make a difference for our extractor patterns. We'll need to examine the HTML for this page in order to generate the extractor patterns for it. Do this by clicking on the "Details page" scrapeable file in the tree on the left, then on the "Last Response" tab. You'll remember that screen-scraper records the HTML for the last time each page was requested. Bring up the URL above in your web browser. We'll be extracting the DVD title, price, model, shipping weight, and manufacturer. It should be apparent in examining the page that most of the elements in it aren't of interest to us. For example, we don't care about the header, footer, or any of the boxes along the sides of the page. We'll first define a region that basically surrounds the elements we're interested in. Here is that full region: <tr>That might seem like a large chunk of HTML, but it's actually a relatively small percentage of the entire page. Before defining sub-extractor patterns we first define an extractor pattern with a special ~@DATARECORD@~ token in it. If you're familiar with computer programming in general, the ~@DATARECORD@~ token can be thought of as a "reserved word". That is, it's a token that has a special meaning in that it defines the sub-region of the HTML page containing the data elements we're interested in. You'll always use the ~@DATARECORD@~ token when using sub-extractor patterns. Here's the extractor pattern we'll use: <tr>Notice that we simply replaced most of the middle portion of the large block of HTML with a ~@DATARECORD@~ token. If you look at the text before and after ~@DATARECORD@~ you can see that the same text is also found at the beginning and end of the large HTML block. The basic idea here is to include only as much HTML around the sub-region as necessary to uniquely identify it in the page. Any of the HTML covered by the ~@DATARECORD@~ token will be picked up by screen-scraper, and will define our sub-region that we'll be extracting the individual pieces of data from. Create a new extractor pattern using the text given above (remember we're still using the "Details page" scrapeable file), then give it the name "PRODUCTS". Now click the "Apply Pattern to Last Scraped Data" button. In the window that appears, copy the text from the "DATARECORD" column and paste it into your text editor. The easiest way to select all of the text in that box is to triple-click it, use the keyboard to copy the text (Ctrl-C in Windows and Linux), then paste it into your text editor. The text should look like this: " valign="top"><h1>You've Got Mail</h1></td></tr><tr><td align="center" valign="top" class="smallText" rowspan="2"><script language="javascript" type="text/javascript"><!--document.write( '<a href="javascript:popupWindow(\'http://www.screen-scraper.com/shop/index.php?main_page=popup_image &pID=7\')"><img src="images/dvd/youve_got_mail.gif" border="0" alt="You\'ve Got Mail" title=" You\'ve Got Mail " width="100" height="80" hspace="5" vspace="5" /><br />larger image<\/a>'); //--></script> <noscript><a href="http://www.screen-scraper.com/shop/index.php? main_page=images/dvd/youve_got_mail.gif" target="_blank"><img src="images/dvd/youve_got_mail.gif" border="0" alt="You've Got Mail" title=" You've Got Mail " width="100" height="80" hspace="5" vspace="5" /><br />larger image</a></noscript> </td><td class="main" align="center" valign="top"> Model: DVD-YGEM</td></tr><tr><td class="main" align="center"></td></tr><tr><td align="center" class="pageHeading">$34.99</td><td class="main" align="center">Shipping Weight: 7.00 lbs.</td> </tr><tr><td> </td><td class="main" align="center">10 Units in Stock</td></tr> <tr><td class="main" align="center">Manufactured by: Warner</td><td align="center"> <table border="0" width="150px" cellspacing="2" cellpadding="2"><tr><td align="center" class="cartBox"> This is the HTML we're after, but it's all in one large block. This occurs because screen-scraper strips out unnecessary white space when extracting information in order to make the extraction process more efficient. This can make sifting through the HTML a little more difficult, but the search feature in your text editor should make this relatively straightforward. You could also deal with the HTML found directly in the "Last Response" tab. You'd just have to be sure that you're only grabbing portions of the page that would be covered by the ~@DATARECORD@~ extractor pattern token. First off, we're interested in the DVD title. In your text editor do a search for the first word in the title of the DVD whose page you're viewing (e.g., if you're viewing the HTML for the last DVD in the search results you'll search for "You've"). This should highlight the first word in the title. In order to extract this piece of information we'll use a small sub-extractor pattern: <h1>~@TITLE@~</h1>Once again, we include only as much HTML around the piece of data that we're interested in as is necessary. If we do this just right we'll still be able to extract information even if the web site itself makes minor changes. On our "PRODUCTS" extractor pattern, click the "Sub-Extractor Patterns" tab, then on the "Add Sub-Extractor Pattern" button. In the text box that appears paste the text for the sub-extractor pattern we've included above. Edit the ~@TITLE@~ extractor pattern token by double-clicking it, click the "Regular Expression" tab, then select "Non-HTML tags" from the drop-down list (as a side note, "Non-HTML tags" is probably the most common regular expression you'll use). Click on the "Apply Sub-Extractor Pattern to Last Scraped Data" to try it out. You should see a DataSet with a single row and columns for the DATARECORD and TITLE tokens. Next, create the following sub-extractor patterns for the remaining data elements we want to extract (note that each line of text will be a separate sub-extractor pattern): >$~@PRICE@~<>Model: ~@MODEL@~<>Shipping Weight: ~@SHIPPING_WEIGHT@~<>Manufactured by: ~@MANUFACTURED_BY@~<For each token in the sub-extractor patterns give it the "Non-HTML tags" regular expression, as you did for the ~@TITLE@~ token. As sub-extractor patterns match data, they aggregate the pieces into a single data record. That is, when our PRODUCTS extractor pattern is applied along with its sub-extractor patterns, the following data record will be produced:
You can see this by clicking the "Apply Pattern to Last Scraped Data" button. If you'd like, at this point try running the scraping session again by clearing the log and hitting the "Run Scraping Session" button. If you examine the log while the session runs you'll see that it extracts out details for each of the DVDs.
|
SearchNew Video!Tags Throughout this Site |
Recent comments
4 hours 28 min ago
4 hours 35 min ago
6 hours 41 min ago
1 day 2 hours ago
1 day 2 hours ago
1 day 3 hours ago
1 day 3 hours ago
1 day 3 hours ago
1 day 3 hours ago
3 days 20 min ago