Tutorial 1: Hello World!

Hello World!

This tutorial will walk you step-by-step through the process generally used to scrape information from web pages using screen-scraper. It should take you about 20 to 30 minutes to complete, and will familiarize you with the basic principles you'll need to scrape information from web sites. To get the most from this tutorial you should have at least a basic knowledge of HTML and HTTP (really just the way web browsers interact with web servers). This tutorial also assumes that you've successfully downloaded and installed screen-scraper.

If you don't have a lot of experience working with web technologies, or if you'd just like a refresher, you might find these sites helpful:

This is intended to be a very basic tutorial, and, as such, we'll be extracting the words "Hello World" from a web page and writing them to a file. While this is a simple example of pulling a single snippet of text off of a page, you would use a very similar approach for something like a stock quote or product price.

We'll try to keep the pace of the tutorial such that (hopefully) you won't get bored or frustrated. Along the way if you'd like more information on a topic try the links at the bottom of each screen.

The scraping session you are about to create (choose Interpreted Java or VB Script):

AttachmentSize
Hello World (Scraping Session--Interpreted Java).xml3.78 KB
Hello World (Scraping Session--VBScript).xml4.28 KB

Tutorial 1: Page 2: Screen-Scraping Overview

Screen-Scraping Overview

In many ways working with screen-scraper is like working with a database, such as mySQL or SQL Server. With databases, you'll generally use an interface (often a graphical user interface) to create objects such as tables, columns, and indexes. Once you've set up the database you'll often write programming code to populate it with data as well as to pull information from it. Likewise with screen-scraper you'll use its graphical user interface to create objects needed to extract information from web sites. Once you've set up these objects you'll write programming code to interact with screen-scraper and make use of the data it extracts.

Extracting information from web sites using screen-scraper typically involves four main steps:

1. Use the proxy server to determine the exact files that need to be requested in order to get the information you're after.
2. Create a scraping session with scrapeable files that define the sequence of pages screen-scraper will request.
3. Generate extractor patterns to define the exact information you need screen-scraper to grab from each page.
4. Write small scripts or programming code to invoke screen-scraper and/or work with the data it extracts. If you don't do much programming, don't worry. Generally the scripts you'll need to write to work with screen-scraper are small and simple, and you can usually just modify the example scripts we provide.

We'll now walk through each of these steps in detail.

Tutorial 1: Page 3: Proxy Server Setup

Proxy Server Setup

An HTTP proxy server is basically just a program that sits in between a web browser and a web server, passing bits between each. screen-scraper contains a proxy server that allows you to view all requests that your web browser sends, and the corresponding responses that web servers send in return. The proxy server records all of the pages requested by your browser as you surf so that they can be easily scraped by screen-scraper at a later point.



OK, enough talk; it's time to fire up screen-scraper. If you're running Windows this is done by selecting the appropriate link from the "Start" menu. On Unix/Linux or Mac OS X use the "screen-scraper" link that was created when you installed screen-scraper.

Once screen-scraper has fully loaded you'll see a tree on the left which will contain the objects we'll be creating. Right now we need to set up screen-scraper's proxy server.

In screen-scraper you'll generally use a proxy session for each web site you'd like to extract information from. A proxy session holds all of the HTTP requests and responses recorded from your browser for the period of time you run it. Create a proxy session now by clicking the "New Proxy Session" button (looks like a globe) or by selecting "New Proxy Session" from the "File" menu. screen-scraper should now look like this:






Give the proxy session a name by typing "Hello World" into the "Name" field. The "Port" field determines the port number that your web browser will use when communicating with screen-scraper's proxy server. The bottom checkbox causes the proxy server to ignore binary files (which are generally not very interesting when you're scraping text-based data). For now we're only concerned with the "Port" field, which you should be able to leave as 8777.

Next we need to set up your web browser so that it will use screen-scraper as a proxy server. If you have two web browsers installed on your computer we recommend using one of them to continue through the tutorial and the other to interact with the proxy server. For example, if you have Internet Explorer and Firefox installed you may want to view the tutorial pages using Firefox and use Internet Explorer with the proxy server. Odds are you're using Internet Explorer as your primary browser, so we'll give detailed instructions on setting it up. If you're using a different web browser try one of the following links: Firefox, Opera, Mozilla, or Netscape

Open up Internet Explorer, then click on "Internet Options" from the "Tools" menu. You should get a dialog box like this:






From here click on the "Connections" tab, then on the "LAN Settings" button. Click on the checkbox beginning with "Use a proxy server for...", then on the "Advanced..." button. The dialog box should now look like this:






In the "HTTP" and "Secure" fields type "localhost" under the "Proxy address to use" column, and "8777" under "Port" (assuming you haven't changed the default port number from 8777). Hit the "OK" button a few times till you get back to your web browser. NOTE: Depending on your operating system, instead of "localhost" you may need to use either "127.0.0.1" or the IP address of the machine. If you have trouble connecting to screen-scraper's proxy with your web browser, please see this FAQ.

At this point your browser is set up such that any time you click on a link or submit a form the request will first go to screen-scraper, where it will be recorded, and then get sent to the web server it was intended for. The web server will respond back to screen-scraper, which will record the response, then send it along to your web browser.

If you're running Mac OS X, and are using screen-scraper Professional or Enterprise Edition, there's one more step you'll need to take. In screen-scraper, click the wrench icon to bring up the "Settings" dialog box. Click on the "Servers" button in the panel on the left, then remove any text contained in the "Hosts to allow to connect" text box. Because of the way Mac OS X handles IP addresses, we do this so that screen-scraper will accept connections from your web browser.

At this point we can get the proxy server running. Do this now in screen-scraper and clicking on the "Start Proxy Server" button for your proxy session. After this click on the "Progress" tab, which will display all of the requests and responses recorded by the proxy server.

You're now ready to have screen-scraper record a few pages for you...

Tutorial 1: Page 4: Recording Pages with the Proxy Server

Recording Pages with the Proxy Server

Return now to your web browser and go to the following URL:

http://www.screen-scraper.com/tutorial/basic_form.php

If you take a look at screen-scraper you'll notice that it recorded this page in the "HTTP Transactions" table. If you click on the first row in the table information related to your browser's request and response will appear in the lower pane:





If you didn't see your page show up in the "HTTP Transactions" table, or if your browser seems to have trouble, take a look at this FAQ for help.

The lower pane shows the details of the HTTP request your browser made--the request line, any HTTP headers (including cookies), as well as POST data (if any was sent). You can view the corresponding response from the server by clicking on the "Response" tab. Don't worry if a lot of what you're seeing doesn't make much sense; for the most part screen-scraper takes care of these kinds of details for you (such as keeping track of cookies).

At this point, in your web browser, type "Hello world!" (without the quotes) into the form text box and click the "Submit" button. This simply submits the form using the GET method to this same page, and displays what you typed in. We now have all of the pages we need recorded, so click on the "General" tab in screen-scraper then click on the "Stop Proxy Server" button. Now might also be a good time to adjust your web browser so that it no longer uses screen-scraper as a proxy server.

Tutorial 1: Page 5: Generating a Scrapeable File

Generating a Scrapeable File

At this point we're ready to start creating the objects that screen-scraper will use to extract data from the page. We start by creating a scraping session. A scraping session is simply a container for all of the files and other objects that will allow us to extract data from a given web site. Either click the "New Scraping Session" button (looks like a gear) or click on the "File" menu, then select "New Scraping Session". After the scraping session appears rename it to "Hello World" (note that if you imported the scraping session at the beginning of the tutorial you'll want to name it something else--perhaps "My Hello World"). Your window should now look like this:



Now return back to our "Hello World" proxy session by clicking on it in the tree on the left (the one with the globe by it), then click on the "Progress" tab. Click on the second or last row in the "HTTP Transactions" table. In the lower pane make sure "Hello World" is selected from the drop-down list labeled "Generate scrapeable file in:", then click the "Go" button. A scrapeable file is a web page that contains information we're interested in extracting. First off, let's rename our scrapeable file "Form submission". Your screen should now look like this:



Just to make sure things are good so far let's run a quick test. Run the "Hello World" scraping session by clicking on it in the tree on the left, then clicking the "Run Scraping Session" button. Now click on the "Log" tab. It should just take a moment to run, after which the log should show the following:

Starting scraper.
Running scraping session: Hello World
Processing scripts before scraping session begins.
Scraping file: "Form Submission"
Form Submission: Preliminary URL: http://www.screen-scraper.com/tutorial/basic_form.php
Form
Submission: Using strict mode.
Form Submission: Resolved URL: http://www.screen-scraper.com/tutorial/basic_form.php?text_string=Hello+... Submission: Sending request.
Processing scripts after scraping session has ended.
Scraping session "Hello World" finished.

The log is an invaluable tool in debugging scraping sessions, which you'll want to use often. In this case it shows that screen-scraper requested the only scrapeable file in our scraping session ("Form submission"). You can view the text of the file that was scraped by clicking on "Form submission" in the tree on the left, then clicking the "Last Response" tab. Click the "Display Response in Browser" button to ensure that the page looks like the one in your browser (it may not look exactly like it, but should resemble it closely). It's often helpful to view the last response for a scrapeable file after running a scraping session so that you can ensure that screen-scraper requested the right page.

QUICK TIP!!!!
A good principle of software design is to run code often as you make changes. Likewise, with screen-scraper it is a good idea to run your scraping session frequently and watch the log and last responses to ensure that things are working as you intend them to.

Now would be a good time to save your work. Click the "Save" button (looks like a disk) or select the "Save" option from the "File" menu.

Tutorial 1: Page 6: Generating an Extractor Pattern

Generating an Extractor Pattern

This is probably the trickiest part of the tutorial, so if you've been skimming up to this point you'll probably want to read this page a little more carefully. An extractor pattern is a block of text (usually HTML) that contains special tokens that will match pieces of data you're interested in extracting. These tokens are text labels surrounded by the delimiters ~@ and @~.

You can think of an extractor pattern like a stencil. A stencil is an image in cut-out form, often made of thin cardboard. As you place a stencil over a piece of paper, apply paint to it, then remove the stencil, the paint remains only where there were holes in the stencil. Analogously, you can think of placing an extractor pattern over the HTML of a web page where the tokens correspond to the holes where the paint would pass through. After an extractor pattern is applied it reveals the portions of the web page you'd like to extract.

Take a look at the HTML from the page we scraped by clicking on the "Form submission" scrapeable file, then on the "Last Response" tab. If you click the "Render HTML" button you should see a screen resembling the page you saw in your browser. Consider this snippet of HTML from the page:


You typed: Hello world!


As we're interested in extracting the string "Hello world!" our extractor pattern would look like this:

<table align="center">
<tr>
<td><span style="color: red">You typed: ~@FORM_SUBMITTED_TEXT@~</span> </td>
</tr>
</table>

The string "~@FORM_SUBMITTED_TEXT@~" is the token that corresponds to the data we're interested in, and, after this extractor pattern is applied, would hold the string "Hello world!". Returning to our stencil analogy, the "~@FORM_SUBMITTED_TEXT@~" token is analogous to the hole in the stencil where the paint would pass through. In a bit we'll look at how we might make use of the data extracted by that token.

We'll now create an extractor pattern that will extract the "Hello world!" text you typed in to the HTML form. Under the "Form submission" scrapeable file, click on the "Extractor Patterns" tab, then click on the "Add Extractor Pattern" button. Give your extractor pattern the identifier "Form data", and in the "Pattern text" box enter the extractor pattern shown above. Your screen should now look like this:



Go ahead and give the extractor pattern a try by clicking on the "Apply Pattern to Last Scraped Data" button. The following window will appear, displaying the text that our extractor pattern extracted from the page:



Looks like our extractor pattern has matched the snippet of text we were after. The "Apply Pattern to Last Scraped Data" is another invaluable tool you'll use often to make sure you're getting the right data. It simply uses the HTML from the "Last Response" tab, and applies the extractor pattern to it.

!!!!QUICK TIP!!!!
When creating extractor patterns, always be sure you use the HTML from screen-scraper's "Last Response" tab, and not by viewing the HTML source in your web browser. Before screen-scraper applies an extractor pattern to an HTML page, it "tidies" up the HTML to facilitate extraction. This will generally cause the HTML to be slightly different from the HTML you'd get directly from your web browser.

Before we continue we need to take a look at one more thing. Extractor pattern tokens have properties, one of which we'll need to modify. To modify the properties for our "~@FORM_SUBMITTED_TEXT@~" extractor pattern token double-click it (that is, double click on the text FORM_SUBMITTED_TEXT found between the ~@ @~ tokens in the "Pattern text" box) or select it, right-click it (or Control-click in Mac OS X), then select "Edit token". You'll see the following box:



screen-scraper makes use of session variables which allow you to save and persist objects throughout the life of a scraping session. This means that screen-scraper will save the extracted data in memory so that it can be used later in scripts and such. In this case we'd like to save the text that our "~@FORM_SUBMITTED_TEXT@~" extractor pattern token extracts. Indicate this now by clicking the "Save in sesssion variable?" checkbox, then closing the "Edit Token" window. In other words, when screen-scraper runs this scraping session and extracts the text for this extractor pattern it will save that text (e.g., "Hello world!") in a session variable so that we can do something with it later. Next we'll make use of the data we extract...

Tutorial 1: Page 7: Overview of Writing a Simple Script

Overview of Writing a Simple Script

We'll now do something with the data we've extracted by writing a simple script. A screen-scraper script is a block of code that will get executed when a certain event occurs. For example, you might have a script that gets invoked at the beginning of a scraping session that initializes variables. Another script might get invoked each time a row in a list of search results is extracted from a site so that the information in that search result can be inserted into a database. You can think of this as being analogous to "event handling" mechanisms in other programming languages. For example, in an HTML page you might associate a JavaScript method call with the "onLoad" event for the body tag. In Visual Basic you'll often create a sub-routine that gets invoked when a button is clicked. In the same way, screen-scraper scripts will get invoked when certain events occur related to requesting web pages and extracting data from them.

If you don't have much experience programming don't worry, generally scripts written in screen-scraper are short and simple. The script we'll be creating will simply write out the text we extract to a file.

In preparation for writing our script click the "New Script" button (looks like a pencil and paper) or select "New Script" from the "File" menu, and give it the identifier "Write extracted data to a file". Your screen should now look like this:






screen-scraper supports scripting in Interpreted Java, JavaScript, and Python when running on any operating system, and JScript, Perl, and VBScript when running on Windows. At this point, depending on the language you prefer, you can continue on with an explanation of scripting in Interpreted Java or VBScript, using one of the links below.

Tutorial 1: Page 8: Writing a Simple Script in Interpreted Java

Writing a Simple Script in Interpreted Java

screen-scraper uses the BeanShell library to allow for scripting in Java. If you've done some programming in C or JavaScript you'll probably find BeanShell's syntax familiar.

Let's get right to it. Copy and paste the following text into the box labeled "Script Text":

// Output a message to the log so we know that we'll be writing the text out to a file.
session.log( "Writing data to a file." );

// Create a FileWriter object that we'll use to write out the text.
out = new FileWriter( "form_submitted_text.txt" );

// Write out the text.
out.write( session.getVariable( "FORM_SUBMITTED_TEXT" ) );

// Close the file.
out.close();

Hopefully it's obvious what's going on, based on the comments in the script. We simply create an object used to write out the text (a "FileWriter"), write it out, then close up the file. Note the session.getVariable( "FORM_SUBMITTED_TEXT" ) method call, which retrieves the value of the "FORM_SUBMITTED_TEXT" session variable. This method call is able to get the value because we indicated earlier that the value for the "FORM_SUBMITTED_TEXT" token was to be saved in a session variable (i.e., when we checked the "Save in session variable?" box).

If you haven't done much programming, this is where things might seem a little confusing. If so, you may consider trying a basic tutorial on Java or JavaScript, which will hopefully introduce you to the basics of programming. You'll especially want to get an introduction to object-oriented programming.

Tutorial 1: Page 9: Invoking a Script

Invoking a Script

A script is executed in screen-scraper by associating it with some event, such as before or after an extractor pattern is applied to the text of a web page.

The script we've just written needs to be executed after screen-scraper has requested the web page and extracted the data we need from it.

At this point return to the extractor pattern we just created by clicking on the "Form submission" scrapeable file in the tree on the left, then on the "Extractor Patterns" tab. In the lower part of your screen click on the "Add Script" button. Select "Write extracted data to a file" in the column on the left, and select "After pattern is applied" in the third column. Your screen should now look like this:






Our "Write extracted data to a file" script will be invoked after screen-scraper has applied the "Form data" extractor pattern to the web page. That is, once the extractor pattern has applied as many times as it needs to (which is only once, in this case), it will invoke the script.

The curious might be wondering a bit more about the difference between "After pattern is applied" and "After each pattern application". Consider a web page that contains a table with 10 rows. We might create an extractor pattern that matches a single row in the table. The extractor pattern would match 10 times--one for each row in the table. If we associated a script with the extractor pattern and told it to run "After pattern is applied", the script would only get executed one time (i.e., after the pattern has matched as many times as it needs to). If we had indicated that the script should run "After each pattern application", it would get executed 10 times--one time for each match the pattern makes. In the current case, the pattern only matches one time, so it doesn't make a big difference whether we indicate "After pattern is applied" or "After each pattern application".

Tutorial 1: Page 10: Running the Completed Scraping Session

Running the Completed Scraping Session

Finally, we have everything in place to run our scraping session. Click on the "Hello World" scraping session in the tree on the left, then click on the "Log" tab. If there is existing text in the "Log" get rid of it by clicking the "Clear Log" button. Now click on the "Run Scraping Session" button. After it finishes running, take a look at the contents of the "form_submitted_text.txt" file, which will be located in the screen-scraper installation directory (e.g., C:\Program Files\screen-scraper professional edition\).

Tutorial 1: Page 11: Where to Go From Here

Where to Go From Here

Congratulations! You now have the basic core knowledge you need to scrape screens with screen-scraper. While this was a very simple example of a scraping session, we did cover most of the main principles you need to start your own project. If you have the time, we'd highly recommend continuing on to Tutorial 2: Scraping an E-commerce Site, as well as Tutorial 3: Extending Hello World. Otherwise, you may want to consider reading through some of the existing documentation as you work on your own project.