NavigationUser loginscreen-scraper.com welcomes...
Currently online
There are currently 0 users and 5 guests online.
|
Tutorial 2: Page 6: Creating Extractor Patterns for Links
This particular part of the tutorial is one that covers important principles that often seem confusing to people at first. If you've been speeding through the tutorial up to this point, it would probably be a good idea to slow down a bit and read more carefully. We're now going to create a couple of extractor patterns to extract information for the "Next" link and the product details links. Remember that an extractor pattern is a block of text (usually HTML) that contains special tokens that will match pieces of data you're interested in extracting. When creating extractor patterns we recommend that you always use the HTML from the "Last Response" tab in screen-scraper. By default, after screen-scraper requests a page it "tidies" the HTML found in it, which makes it differ from the HTML that you would get by viewing the source in your web browser (and also makes it more consistent, facilitating extraction). Click on the "Search results" scrapeable file in the tree on the left, then on the "Last Response" tab. The text box contains HTML because we just ran the scraping session. Copy all of the HTML and paste it into a text editor, such as Notepad or TextMate. If you click either the "Render HTML" or "Display Response in Browser" button in screen-scraper you'll see a page basically resembling the search results page in your web browser. We're going to extract a portion of each of the product details links so that we can subsequently request each details page and extract information from them. The first details link corresponds to the "A Bug's Life" DVD. Find that in the text editor you just pasted the HTML into (specifically search for the text "A Bug's Life"). Here is the block of HTML representing this product: <tr class="productListing-odd">This may seem like a bit of a mess, but if we look closely we can pick out the details link: <td class="productListing-data"> <a href="http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=8">A Bug's Life</a> </td>Breaking it down a bit more we get the URL: http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=8By the way, you might notice that the typical & symbols in the URL have been replaced by &. Don't be alarmed, it's just part of the tidying process screen-scraper applies to the HTML. Again, if we examine the parameters in the URL we can guess that the important one is "products_id", which likely identifies the product whose details we're interested in. We'll guess that the "products_id" is the only one we'll need to extract. This will give us enough information to request a details page. At this point, click on the "Search results" scrapeable file in the tree on the left, then click on the "Extractor Patterns" tab. We'll create an extractor pattern to grab out the product IDs from each link. Here's the extractor pattern we'll use: <td class="productListing-data"> <a href="http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=~@PRODUCTID@~">~@PRODUCT_TITLE@~</a> </td>Create the extractor pattern by clicking on the "Add Extractor Pattern" button, then copying and pasting the text above into the resulting box. Also, give the extractor pattern the name "Product details link". Remember that extractor pattern tokens (delineated by the ~@ @~ markers) indicate data points we're interested in extracting. In this case, we want to extract the ID of the product (embedded in the URL), and the title of the product. Double-click the ~@PRODUCTID@~ token (or select the text between the ~@ @~ delimiters, right-click it and select "Edit token"), and, in the box that appears, click "Save in session variable" checkbox. Click on the "Regular Expression" tab, and select "Non-double quotes". You'll notice that when you do that the text [^"]* shows up in the text box just above the drop-down list. This is the regular expression that we'll be using. You could also edit it manually, but generally won't need to. Let's slow down at this point and go over what we just did to the ~@PRODUCTID@~ extractor pattern token. You might remember from the second tutorial that by checking the "Save in session variable" box we're telling screen-scraper to preserve the value for us so that we can use it at a later point. We'll get to that in a bit. This time we also selected a regular expression for it to use. In most cases you'll want to designate a regular expression for extractor pattern tokens. If you're not very familiar with regular expressions, don't worry. In the vast majority of cases you can simply use the regular expressions found in that drop-down list. Let's go over what effect designating a regular expression has. By indicating the "Non-double quotes" regular expression we're saying that we want that token to match any character except a double-quote (i.e., the " character). You'll notice in our extractor pattern that a double-quote character just follows our ~@PRODUCTID@~ extractor pattern token. By using a regular expression we limit what the token will match so that we can ensure we get only what we want. You might think of it as putting a little fence around the token. We want it to match any characters underneath the ~@PRODUCTID@~ extractor pattern token, up to (but not including) the double-quote character. A line from that last paragraph is worth repeating. In most cases you'll want to designate a regular expression for extractor pattern tokens. Using regular expressions also makes extractor patterns more resilient to changes in the web site. That is, if the web site makes minor changes to its HTML (e.g., altering a font style or color), often if you've been using regular expressions your extractor patterns will still match. Also, by using regular expressions we can often decrease the amount of HTML we need to use in our extractor patterns. That is, by using regular expressions we indicate more precisely what the data will look like that our tokens will match. By doing this, we can often reduce the amount of HTML we include at the beginning and end of our extractor patterns. In general, if you can reduce the amount of HTML in your extractor patterns, and increase the number of regular expressions you use in tokens, your extractor patterns will be more resilient to changes that get made in the HTML of the pages. Now close the "Edit Token" box, which saves our settings. Now let's alter the settings for the ~@PRODUCT_TITLE@~ token. We're not interested in saving the value for this token in a session variable, but we include it since it will differ for each section of HTML we want to match. Double-click the ~@PRODUCT_TITLE@~ extractor pattern token to bring up the "Edit token" dialogue box. Click on the "Regular expression" tab, then select "Non-HTML tags". Again, take a look at the characters on the left and right sides of our ~@PRODUCT_TITLE@~ extractor pattern token. By using this regular expression we tell it not to include any greater than (>) or less than (<) symbols. This way we create a boundary for the token so that we can ensure it matches only what we want it to. Why even include an extractor pattern token for data we don't want to save? This is another important principle. By using extractor pattern tokens for data we don't necessarily want to save, we make the extractor pattern more resilient to changes in the HTML. By using these extra tokens we can "future proof" our extractor patterns against changes the site owners might make down the road. There are also often situations (such as the present one) where data points adjacent to data we want to extract will differ for each pattern match. Here we only want the product ID, but we also include the product title because of its proximity to the data we want to extract, and because its value will differ each time the extractor pattern matches. If those last few paragraphs strike you as a little bit confusing, don't worry. As you get more experience using screen-scraper you'll see why they're important. For now just take our word for it that you'll generally want to use regular expressions with extractor pattern tokens, and that it's often a good idea to use extractor pattern tokens to match data points you don't necessarily want to save. As you get more experience it will become more apparent when to use extractor pattern tokens for data you don't want to save. Let's give our new extractor pattern a try. Click the "Apply Pattern to Last Scraped Data" button. You should see a window come up that shows the extracted data. Again, let's slow down a moment and review what this window contains. When an extractor pattern matches, it produces a DataSet. You can think of a DataSet like a spreadsheet--it contains rows columns and cells. Each row in a DataSet is called a DataRecord. Again, a DataRecord can be thought of as being analogous to a row in a spreadsheet. In this particular case our DataSet has three columns. Two of them should be familiar--they correspond to the PRODUCT_TITLE and PRODUCTID extractor pattern tokens. The "Sequence" column indicates the order in which each row was extracted. You'll notice that the sequence is zero-based, meaning the first DataRecord in the DataSet is referenced with an index of 0. You'll also notice that the DataSet has 10 records--one for each product found in the search results page. Later on when we start talking more about DataSets and DataRecords, just remember the spreadsheet analogy--a DataSet is like the entire spreadsheet, and a DataRecord is like a single row in the spreadsheet. Another good habit to get into is applying your extractor patterns frequently to ensure they correctly match the text you want extracted. Go ahead and close the "DataSet" window now. Now for our "Next" link. In the text editor where you pasted the full HTML from the web page, search for the text "Next". Around that area you'll find the HTML for the link: <a href="http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&keyword=dvd&sort=2a&page=2" title=" Next Page ">[Next >>]</a> </td>Fortunately, we're already familiar with the URL, and we know that the only parameters we need to worry about are "keyword" and "page". Create a new extractor pattern, call it "Next link", and use the following to grab the values of those parameters out: <a href="http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&keyword=~@KEYWORD@~&sort=2a&page=~@PAGE@~" title=" Next Page ">[Next >>]</a> </td>As with the previous extractor pattern, double-click the ~@PAGE@~ token, and, in the box that appears, click "Save in session variable" checkbox. Click on the "Regular Expression" tab, and select "Number" from the "Select" drop-down list. Close the "Edit Token" box to save your settings. If you're interested, the "Number" regular expression \d* simply indicates that we only want the PAGE token to match numbers (\d signifies a digit, and the * signifies "zero or more"). Next, double-click the "KEYWORD" extractor pattern token to edit it. Click on the "Regular Expression" tab, then select "URL GET parameter" from the "Select" drop-down list. This indicates that the "KEYWORD" extractor pattern should match only characters that would be found in a "GET" parameter of a URL. We could have used the "Non-double quotes" regular expression as we did above, but used this one instead as it's a bit more specific still to what we do and don't want the token to match. You'll notice that we didn't check the box to save the "KEYWORD" extractor pattern token in a sesion variable. We already have that value in a session variable, so we don't bother getting it again. Try out the extractor pattern by clicking the "Apply Pattern to Last Scraped Data". Excellent! We have two matches--one for each "Next" link on the page (the top and bottom of the page). Now would be a good time to save your work. Do that by selecting "Save" from the "File" menu or by clicking the floppy disk icon. OK, let's try out the whole thing once more. Click on the "Shopping Site" scraping session in the tree on the left, then on the "Log" tab. Click the "Clear Log" button--we're going to run it again and we don't want to get confused by the log text from the last run. As before, click on the "Run Scraping Session" button to get it going. You'll see quite a bit more text in the log this time. Take a minute to look through it to ensure you understand what's going on.
|
SearchNew Video!Tags Throughout this Site |
Recent comments
2 hours 36 min ago
2 hours 43 min ago
4 hours 50 min ago
1 day 15 min ago
1 day 32 min ago
1 day 1 hour ago
1 day 1 hour ago
1 day 1 hour ago
1 day 1 hour ago
2 days 22 hours ago