 |
Running the Scraping Session |
The last item we need to take care of is creating the text file that will contain our search terms. Let's keep it simple. Fire up your favorite text editor and create a file called "search_terms.txt" inside of screen-scraper's installation folder (e.g., "C:\Program Files\screen-scraper professional edition\search_terms.txt"). Add the following three lines to the text file:
bug
speed
blade
Those search terms should yield at least a few DVD's we can add to our collection.
All right, now's the moment of truth. Run the new scraping session by clicking on it in screen-scraper and clicking the "Run Scraping Session" button. After that, click on the "Log" file to watch it do its thing. If all goes well, once it's done, you should have a "dvds.txt" file in screen-scraper's install folder containing scraped data for all of the search terms.
Take a look carefully through the log. If it all seems to make sense, you're done. If not, read on so that we can walk through it a bit more carefully.
The flow of events goes like this, once you hit the "Run Scraping Session" button:
- The scraping session starts up, and immediately invokes the "Read search terms" script.
- The "Read search terms" script creates a few objects, then reads in the first line of the "search_terms.txt" file: "bug".
- The "Read search terms" script sets the "SEARCH" session variable with the value "bug", then invokes the "Search results" scrapeable file. You'll remember from the earlier tutorial that the "SEARCH" session variable is used to perform each search. Check the "URL" field for the "Search results" scrapeable file for a reminder on where its used.
- The "Read search terms" script initializes the "PAGE" session variable to "1". It turns out that this is probably unnecessary in this particular case, but you'll want to remember it for future projects. For each search term we're performing a completely separate search, so we need to make sure we start on the first page.
- The "Read search terms" script invokes the "Search results" scrapeable file. This is essentially the same thing as clicking the "Search" button on the search form with the current search term ("bug", in this case).
- The "Search results" scrapeable file makes the HTTP request, then applies the "PRODUCT" extractor pattern to the HTML in order to get all of the "details" links.
- For each match by the "PRODUCT" extractor pattern the script "Scrape details page" gets invoked.
- At this point screen-scraper will loop zero or more times. It will scrape the "Details page" scrapeable file for each link found on the search results page.
- Each time the "Details page" scrapeable file is invoked it requests the page, extracts out the data we want, then invokes the "Write data to a file" script, which writes out the extracted data to the "dvds.txt" file.
- Once screen-scraper has finished performing the search for "bug" control flows back to our original "Read search terms" script, where it moves on to the next search term in the file: "speed". From there you can go back to step 3, where it begins the search process again.
Remember that the "Log" tab is key to understanding the flow of events in screen-scraper. If you're still a bit fuzzy on how things are working, try looking more carefully through the log to piece together how the site is being scraped.
Recent comments
3 hours 24 min ago
3 hours 31 min ago
5 hours 37 min ago
1 day 1 hour ago
1 day 1 hour ago
1 day 2 hours ago
1 day 2 hours ago
1 day 2 hours ago
1 day 2 hours ago
2 days 23 hours ago