Parsing HTML without retrieving external resources

Hi, thanks for reading this, hope you can help.

I am putting together a site that uses screen scraping to extract results from a number of search engines. The HTML is downloading fine and I am able to extract search engines results relatively easily using the JQuery .find() function.

My problem is that when the HTML is parsed the browser is firing requests for external resources (i.e. image & video files) that are referenced within the parsed HTML. These resources are not required by my site (I only extract the text results and don't want to display images) and just waste the user's bandwidth.

Is it possible to parse the HTML without evaluating embedded resources?

From using Fiddler and Firebug I have determined that the requests are being made during execution of this function:

function getResultSetFromDocument(document) {
// object that will be returned containing result objects
var results = new ResultSet(that.engineName);
// loop through all relevant elements to extract results
$(document).find('#res ol li').each(function(i) {
var link = $(this).find('h3 a');
var text = link.text();
var href = link.attr('href');
var description = $(this).text();
// create new result object to add to result set for later ranking etc.
if((href) && (text)) {
var result = new WebResult(text, href, i + 1);
result.setDescription(description);
results.results.push(result);
}
});
return results;
}

Parsing HTML without retrieving external resources

Parsing HTML without retrieving external resources

Topic Participants

thomaschristian

admwiggin

it

Kevin Boudloche