NavigationUser loginscreen-scraper.com welcomes...
Currently online
There are currently 0 users and 2 guests online.
|
Tutorial 1: Page 6: Generating an Extractor Pattern
This is probably the trickiest part of the tutorial, so if you've been skimming up to this point you'll probably want to read this page a little more carefully. An extractor pattern is a block of text (usually HTML) that contains special tokens that will match pieces of data you're interested in extracting. These tokens are text labels surrounded by the delimiters ~@ and @~. You can think of an extractor pattern like a stencil. A stencil is an image in cut-out form, often made of thin cardboard. As you place a stencil over a piece of paper, apply paint to it, then remove the stencil, the paint remains only where there were holes in the stencil. Analogously, you can think of placing an extractor pattern over the HTML of a web page where the tokens correspond to the holes where the paint would pass through. After an extractor pattern is applied it reveals the portions of the web page you'd like to extract. Take a look at the HTML from the page we scraped by clicking on the "Form submission" scrapeable file, then on the "Last Response" tab. If you click the "Render HTML" button you should see a screen resembling the page you saw in your browser. Consider this snippet of HTML from the page: You typed: Hello world! As we're interested in extracting the string "Hello world!" our extractor pattern would look like this: <table align="center">The string "~@FORM_SUBMITTED_TEXT@~" is the token that corresponds to the data we're interested in, and, after this extractor pattern is applied, would hold the string "Hello world!". Returning to our stencil analogy, the "~@FORM_SUBMITTED_TEXT@~" token is analogous to the hole in the stencil where the paint would pass through. In a bit we'll look at how we might make use of the data extracted by that token. We'll now create an extractor pattern that will extract the "Hello world!" text you typed in to the HTML form. Under the "Form submission" scrapeable file, click on the "Extractor Patterns" tab, then click on the "Add Extractor Pattern" button. Give your extractor pattern the identifier "Form data", and in the "Pattern text" box enter the extractor pattern shown above. Your screen should now look like this: ![]() Go ahead and give the extractor pattern a try by clicking on the "Apply Pattern to Last Scraped Data" button. The following window will appear, displaying the text that our extractor pattern extracted from the page: ![]() Looks like our extractor pattern has matched the snippet of text we were after. The "Apply Pattern to Last Scraped Data" is another invaluable tool you'll use often to make sure you're getting the right data. It simply uses the HTML from the "Last Response" tab, and applies the extractor pattern to it. !!!!QUICK TIP!!!! Before we continue we need to take a look at one more thing. Extractor pattern tokens have properties, one of which we'll need to modify. To modify the properties for our "~@FORM_SUBMITTED_TEXT@~" extractor pattern token double-click it (that is, double click on the text FORM_SUBMITTED_TEXT found between the ~@ @~ tokens in the "Pattern text" box) or select it, right-click it (or Control-click in Mac OS X), then select "Edit token". You'll see the following box: ![]() screen-scraper makes use of session variables which allow you to save and persist objects throughout the life of a scraping session. This means that screen-scraper will save the extracted data in memory so that it can be used later in scripts and such. In this case we'd like to save the text that our "~@FORM_SUBMITTED_TEXT@~" extractor pattern token extracts. Indicate this now by clicking the "Save in sesssion variable?" checkbox, then closing the "Edit Token" window. In other words, when screen-scraper runs this scraping session and extracts the text for this extractor pattern it will save that text (e.g., "Hello world!") in a session variable so that we can do something with it later. Next we'll make use of the data we extract...
|
SearchNew Video!Tags Throughout this Site |
Recent comments
3 hours 51 min ago
3 hours 58 min ago
6 hours 4 min ago
1 day 1 hour ago
1 day 1 hour ago
1 day 2 hours ago
1 day 2 hours ago
1 day 2 hours ago
1 day 3 hours ago
2 days 23 hours ago