Global Array or Accessing a variable from 2 Scrapeables

Ok... So I'm still trying to scrape that forum. The reason I'd like to do it is that the Yahoo group I am subscribed to isn't set up like phpBB. Namely, all posts are in a big long list, instead of being broken into topics.

So at the bottom of the individual posts it has links to the responses in the thread. (Including a link back to itself.) I set up one scrapeable file POST to grab the main post and the numbers that represent the threads. I then set up another scrapeable file THREADS to follow the threads. (This is is invoked after the EXTRACTOR PATTERN)

However, when I go back to my original scrapeable file POST, I don't want to recrape any pages that have already been scraped, I want to ignore the posts that have been scraped by both POST and THREADS scrapable files.

So I want to set up a global array to keep track of these. When I scrape a file I would set ARRAY(THREAD_NUMBER) = 1 and ARRAY(POST_NUMBER) = 1

Then if only scrape THREAD_NUMBER or POST_NUMBER if the value in that ARRAY() slot is not equal to 1

I am using VBscript and I've been struggling to access the ARRAY that I set up in my start file. Any ideas?

Are there such beasts as global arrays, or is there another way to tackle this? Until then, I'm completely stumped!

Great product BTW. I know there must be a way to do this, I'm just not seeing it!

Cheers,
McFly

Global Array or Accessing a variable from 2 Scrapeables

Great news! Thanks for sharing the end result. That's fabulous that you were able to get it to work.

Todd

Global Array or Accessing a variable from 2 Scrapeables

OK so I finished scraping, and I am still amazed at the power of this program! Yay! It took me a while to work out all of the kinks (in my own programming) but I eventually got it to work. You can see the results here:

http://www.quickheads.com/archive

Now the Yahoo! Group had over 27,000 posts, and since Yahoo! also has a server setting that prevents you from downlaoding over 1000 pages in a single session, I had to download all 27,000 posts to my local hardrive over about a week using another software program.

When I had that completed I began using the screen scraping scripts suggested by Todd above. However, although this got me pointed in the correct direction using arrays, the solution posted above ultimately didn't work for me. This may be because I didn't fully understand what was going on, but it looked like Todd's script initialized an array of unknown size, and when you scraped a page it put a value of TRUE at the top of the stack and increased the size of the array by 1.

This would have worked great if I were scraping in numerical order, but since I would scrape page 1, then page 287, then page 4305, I needed a way to initialize an array of a certain size and then say ARRAY[207]=TRUE after I scraped page 207.

Since I knew that the Yahoo! Group had less than 28000 pages I used the script below to initialize the array:

// SETUP ARRAY
postIDs = new ArrayList();
session.setVariable( "POST_IDS_ARRAY", postIDs );

for &#40;int i=0; i<=28000; i++&#41;
&#123;
    postIDs.add&#40; 0 &#41;;
&#125;

This created an array with 28000 slots in it. After I scraped a page I set the Slot Number of the page to 1.

I used the following script to scrape the 27000+ Pages:

 //LOOP THROUGH YAHOO! GROUP and scrapefiles
for &#40;int i=1; i<=27525; i++&#41;
&#123;
    //MID is used in the URL field of the scraping file
    session.setVariable&#40; "MID", i&#41;;

    //This checks to see if Page Has Yet to Be Scrapes - If Not Scrape It!
    if&#40; postIDs.get&#40;i&#41;==0 &#41;
    &#123;
        session.scrapeFile&#40; "Yahoo Group - scrapeable file" &#41;;
    &#125;

&#125;

After I wrote the scraped info to a file I used the following code to set the array slot to 1

currentPostID = session.getVariable&#40; "THREADED_POSTS" &#41;;
postIDs = session.getVariable&#40; "POST_IDS_ARRAY" &#41;;
postIDs.set&#40; Integer.parseInt&#40;currentPostID&#41; , "1" &#41;;

I know this is a little confusing but if I you have questions about the method, or if you have a better way to do it, I'd be delighted to see it.

Again very impressive software! Thanks again.

Cheers,
McFly

Global Array or Accessing a variable from 2 Scrapeables

Thanks again Todd,
I'll play around with the scripts this evening and report my results. Again I appreciate your time. :D

-McFly

Global Array or Accessing a variable from 2 Scrapeables

Hi,

I can think of a few ways you could go about this, and it sounds like your idea to simply check for duplicates would work as well as any.

If you're about equally familiar with VBScript and Java, I'd highly recommend writing your scripts in Interpreted Java. It plays quite a bit better with screen-scraper than VBScript.

At the very beginning of your scraping session I would create a script that looks something like this:

postIDs = new ArrayList&#40;&#41;;
session.setVariable&#40; "POST_IDS_ARRAY", postIDs &#41;;

That script will initialize an empty array you can use to hold the various ID's of the postings.

I would then create a script that gets invoked once you've just scraped a posting, but before you want to scrape the related posts. It might look something like this:

currentPostID = session.getVariable&#40; "CURRENT_POST_ID" &#41;;
postIDs = session.getVariable&#40; "POST_IDS_ARRAY" &#41;;
postIDs.add&#40; currentPostID &#41;;

That will add the current post ID to the array, so that you can check later for duplicates.

You might currently have a script that contains a line like this:

session.scrapeFile&#40; "Posting" &#41;;

which simply scrapes the file that corresponds to a posting. I would change that to look something like this:

// Here I'm assuming that CURRENT_POST_ID corresponds
// to the ID of the POST you're about to scrape, and not the
// one you already did.
currentPostID = session.getVariable&#40; "CURRENT_POST_ID" &#41;;
postIDs = session.getVariable&#40; "POST_IDS_ARRAY" &#41;;

// If this evaluates to true, that means it doesn't contain
// the post ID you might scrape, so it should be safe to
// scrape it.
if&#40; postIDs.indexOf&#40; currentPostID &#41;==-1 &#41;
&#123;
  session.scrapeFile&#40; "Posting" &#41;;
&#125;

I haven't tested any of those code snippets, but I'm pretty confident they'll work. The basic idea is to maintain the list of IDs in the array, then check it before you're about to scrape a posting to see if you already have it.

Kind regards,

Todd

Global Array or Accessing a variable from 2 Scrapeables

Thanks for responding Todd,
I have some experience programming, but very limited experience in VBscript and Java, so I'm kind of learning as I go.

The pages on the yahoo group are set up something like this:

 
<POST #1>

   <AUTHOR>ATHOR NAME</AUTHOR>
   <DATE>DATE AND TIME</DATE>
   <BODY>BODY TEXT</BODY>

   <FOOTER>
        <RELATED POSTS>
             <POST #1>
             <POST #2>
             <POST #2555>
             <POST #2578>
       </RELATED POSTS>
   </FOOTER>

</POST #1>

So After I scrape POST #1 I grab the info I want AUTHOR, DATE AND TIME, and BODY TEXT

Then I want to scrape the RELATED POSTS and write them to the same file as POST #1 directly after it

However, since POST #1 appears again in RELATED POSTS I don't want to scrape it again.

I only want to scrape POST #2, POST #2555, and POST #2578

On top of that, when my original scrapeable file increments to POST #2 I want to ignore that now too, since I already scraped it as a RELATED POST.

I hope this makes my situation clearer. Although I'm afraid I made it more confusing. Please let me know.

In the meantime, I'll begin dusting off my JAVA skills :?

Can you post an ARRAY OBJECT example and how to access it from the scripts? I really appreciate your time.

Best regards,
McFly

Global Array or Accessing a variable from 2 Scrapeables

Hi,

I'm not sure I follow entirely, but if you're wanting to work with global arrays, that could be tricky in VBScript. The best route would be to create an instance of an ArrayList (a Java object), then use that to track your numbers. You'd store this ArrayList in a session variable, then access it wherever it's needed in any of your scripts.

Hope that helps...

Kind regards,

Todd Wilson