SS 3.0 Strip HTML inserting <cr><lf> in output t

Hi:

The "Strip HTML" option evidently is doing a bit of interpreting as well. The output stream includes a number of cr-lf pairs.

Evidently the option is interpreting

 <p>- </p>

,     <tr>
    </tr>
and possibly other tokens as line and inserting the hard coded cr-lf pair to account for output spacing.

The problem with this is that if you output the results to a tab-delimited text file, for import into Excel, Excel will no longer be able field the data properly since it (by default) will put the field data in separate lines, breaking the record structure. This is because even though you can specify the exact delimiter in Excel's Text Import Wizard (a tab, in this case), evidently it also considers a cr-lf a field delimiter as well, at tleast to the extent of placing the input on the next record line.

Is there a way to prevent this, or use some sort of reg exp after "strip HTML" to convert the cr-lf to another string?

-- Roy Zider

SS 3.0 Strip HTML inserting <cr><lf> in output t

Whoaa!! That's some translation table!

I spent a few hours last night researching the use of code pages an char translations in Java, just to avoid having to use a table like this!

Yes, I can use this -- thank you for the code.

-- Roy Zider

SS 3.0 Strip HTML inserting <cr><lf> in output t

Roy,

Ah! I had a similar problem when scraping a site in Dutch. I added to the prepareStringForOutput function in the following way. If a I'm missing a character you want to replace go ahead and add to it.

Some times there are issues with editing special characters. For your reference the encoding I use to view the special characters is Windows 1252: Western European.

String prepareStringForOutput&#40; String value &#41;
&#123;
        if &#40;value != null&#41;
        &#123;
                value = value.replaceAll&#40;"\"", "\'"&#41;;
                value = value.replaceAll&#40;"&", "&"&#41;;    
                value = value.replaceAll&#40;"

SS 3.0 Strip HTML inserting <cr><lf> in output t

Scott:

Yes, that's what ended up doing, as you've suggested here: post=processing the scrape result to make the substitutions.

Of course, it is not a literal "cr-lf" string substitution that is involved here. I inspected the output and saw that it was inserting multiple newline chars, so I used the regexp to make the substitution and inserted the tilde to preserve the location.


s2 = s.replaceAll("\n+", " ~ ");

The more general problem is that the pages I'm working with at the moment have some European content, so there are many more substitutions to be done than just three or five.

SS 3.0 Strip HTML inserting <cr><lf> in output t

Roy,

Yes, you could replace the cr-lf before writing your data to the file. One of our programmers came up with a function to handle certain kinds of code we don't want written to our delimited output files. I've inserted your [cr-lf]'s as the last snippet to be stripped.

prepareStringForOutput Function:

String prepareStringForOutput&#40; String value &#41;
&#123;
        if &#40;value != null&#41;
        &#123;
                value = value.replaceAll&#40;"\"", "\'"&#41;;
                value = value.replaceAll&#40;"&", "&"&#41;;    
                value = value.replaceAll&#40;"