Tutorial 5: Saving Scraped Data to a Database

Saving Scraped Data to a Database

We continue on with our e-commerce site in this tutorial by inserting the data we scrape into a database. Generally once you've extracted data from a web site you want to save it out either to a file or a database. We already went over saving the data to a file in Tutorial 2, so here we'll cover inserting the information into a database.

If you haven't already gone through Tutorial 2, you'll probably want to take some time to do that before continuing on with this tutorial.

If you decided not to go through Tutorial 2, or don't still have the scraping session and scripts you created in it, you can download and load them into screen-scraper by following these steps:

  1. Download the zip file located here and unzip it. You should now have an "interpreted_java" directory and a "vbscript" directory.
  2. If you're running Windows, and prefer to program in VBScript, import the "Shopping Site (Scraping Session).sss" scraping session located in the "vbscript" directory; otherwise, import the one located in the "interpreted_java" directory. You'll also want to import the "Shopping Site--start session (Script).sss" file. Instructions on importing objects into screen-scraper can be found here.

Once you've got the scraping session and script imported into screen-scraper you're ready to roll. Click on the "Tutorial Details" link below to get going.

Tutorial 5: Page 2: Tutorial Details

Tutorial Details

There are a number of ways to insert scraped data into a database, which we outline in this FAQ. Take a minute now to look through that. We'll be giving an example of the last option mentioned, which is one of the easier methods to implement.

If you're using the Enterprise Edition of screen-scraper, you should be aware of screen-scraper's ability to handle scraped data in real time (available only in the Enterprise Edition). As of right now, this has been implemented in the Java and PHP drivers for screen-scraper. If you're running the Enterprise Edition, and want to interact with screen-scraper using either of those languages, read over the "Handling Scraped Data in Real Time" section of either our Invoking screen-scraper from Java or Invoking screen-scraper from PHP pages for details on this. The current tutorial doesn't cover this approach, but it's quite a bit easier and cleaner to implement than the method that will be described here. Later on we'll likely create a tutorial for Enterprise Edition users that makes use of this approach.

The basic idea in this tutorial is that we'll have a special scrapeable file that will POST data to a PHP file, which will handle inserting the data into a database. The flow of events will look like this:

  1. Extract the data from the site, saving each value (product title, price, etc.) in session variables.
  2. Invoke a "Save product" scrapeable file, which will POST the extracted data to a PHP file.
  3. The PHP file accepts and validates the data. If the data was incomplete it returns an error message. It then attempts to insert the data into the database. If everything went well it returns a "success" message; otherwise, it returns an error message.
  4. In screen-scraper we use an extractor pattern to check the success/failure status of the error message.

We'll start by modifying our existing "Shopping Site" scraping session a bit, adding to it the scrapeable file that will POST the data to our PHP file.

Tutorial 5: Page 3: Setting Up the Scraping Session

Setting Up the Scraping Session

We'll first modify our existing scraping session a bit to get it ready to save the scraped data to our database. First, click on the "Details page" scrapeable file in the tree on the left, then on the "Extractor Patterns" tab, then click the "Sub-Extractor Patterns" tab for our "DETAILS" extractor pattern. We're going to update each of our extractor pattern tokens so that they save their extracted values in a session variable. Do this by double-clicking each of them (e.g., on ~@TITLE@~) or right-clicking (control-clicking on Mac OS X) and selecting "Edit token". In the "Edit Token" box click the "Save in session variable?" check box, then close the "Edit Token" window. Do that for each extractor pattern token (~@TITLE@~, ~@PRICE@~, etc.).

We need to save the values in a session variable so that we can use them as POST parameters in the scrapeable file that POSTS's to our PHP file.

Let's create that scrapeable file now. Click on the "Shopping Site" scrapeable file in the tree on the left, then click the "Add Scrapeable File" button, found on the "General" tab. Once the scrapeable file appears give it the name "Save product". In the URL field enter:

http://www.screen-scraper.com/support/tutorials/tutorial5/db/save_product.php

Check the box labeled "This scrapeable file will be invoked manually from a script".

Click on the "Parameters" tab for the new scrapeable file, and give it five POST parameters, as shown in the screen-shot below:


You might remember that the ~# #~ delimiters indicate that the value of the corresponding session variable should be substituted in. For example, in our case the value of the TITLE session variable (e.g., "A Bug's Life") will be substituted in for the ~#TITLE#~ token. This value will be the one that gets submitted to the PHP file so that it can be inserted into the database.

Finally, we need to create a simple script that will invoke our new scrapeable file. Click on the "New Script" button (looks like a pencil and paper) in the button bar. Give the script the name "Save product", and give it the "Script Text":

session.scrapeFile( "Save product" );

The script simply tells screen-scraper to invoke the "Save product" scrapeable file.

Now we need to tell screen-scraper when to invoke the scrapeable file. We need it invoked for each product, so that they all get saved to the database. As such, we'll invoke the script after the "Details page" is requested. Do this by clicking on the "Details page" scrapeable file in the tree on the left, then on the "Scripts" tab. Click the "Add Script" button, and in the "Script Name" column select "Save product". Under the "When to Run" column select "After file is scraped".

Okay, we're done setting up screen-scraper, so we're ready give our scraping session a run. Before we invoke it, let's make one minor tweak so that the session doesn't take quite so long to run. In the "Shopping Site--initialize session" script, change the value for the "SEARCH" session variable from "dvd" to "bug". We'll get the two "Bug's Life" DVD's rather than every DVD in the system. Once you've done that click on the "Shopping Site" scraping session in the tree on the left, then on the "Run Scraping Session" button.

Once the scraping session has run it's course click on the "Save product" scrapeable file, then on the "Last Response" tab. You should see something like this for the response:

<?xml version="1.0" encoding="UTF-8"?>
<result>
<status>Success</status>
<product>
<title>A Bug\'s Life \"Multi Pak\"</title>
<price>35.99</price>
<manufactured_by>Warner</manufactured_by>
<model>DVD-ABUG</model>
<shipping_weight>7.00 lbs.</shipping_weight>
</product>
</result>

Which indicates that the last product was successfully inserted.

Now it's time to take a closer look at the PHP file...

Tutorial 5: Page 4: The PHP File

The PHP File

Hopefully this is obvious, but we could just as easily use a web script written in ASP, Cold Fusion, or anything else that can be accessed via HTTP. In this tutorial we use PHP as an example simply because it's one of the most commonly used languages. Don't worry if you're not familiar with PHP, though, most of what we'll be going through is simple pseudo-code.

First try accessing the PHP directly here: http://www.screen-scraper.com/support/tutorials/tutorial5/db/save_product.php.

First, you'll notice that when the PHP script is accessed directly (i.e., not via a POST request) it simply displays a form that you can use to test it. Go ahead and try that now. Enter in some bogus information, then submit the form. If you entered in at least a title and a price you should get a small XML document that resembles the one you saw in screen-scraper. Go back to the form, leave the "Price" field blank, then re-submit the form. This time you get a message indicating that the data is incomplete.

When you submit the form you're taking exactly the same action that screen-scraper does when it invokes its "Save product" scrapeable file. The data is submitted to the PHP file via a POST request, validated, then inserted into the database.

We've simplified our example some to mostly pseudo-code. Take a look over the code for the PHP file here: http://www.screen-scraper.com/support/tutorials/tutorial5/db/save_product.php.txt. It's pretty heavily commented, so hopefully you can follow it even if you don't know PHP. Your code will obviously vary depending on the database you're using and any data validation you want to perform.

Having your web application return some kind of status message allows you to handle error conditions and such within screen-scraper. In this case you would probably want to set up an extractor pattern for the "Save product" scrapeable file that might look something like this:

<status>~@STATUS@~</status>

You might then write a script that does something special in the case of an error.

Tutorial 5: Page 5: Where to Go From Here

Where to Go From Here

The best way to proceed would probably be to try this on your own project. If you run into any glitches don't hesitate to drop us a line so that we can lend a hand. You can find links to helpful items on our Support Page.