Harvesting Rendered HTML

lylec's picture

Hello All,

I would like to create a mashup that includes rendered HTML content from other pages. The purpose is to create a sort of collage composed of other pages.

Definitions for purposes of this post: Typically, web browsers read HTML from a web server, construct a Document Object Model (DOM) from the HTML, and then render the page in a browser window. If the page includes JavaScript code, the browser modifies the DOM as the page loads or when various events occur. I refer to the HTML from the web server as the raw HTML and the HTML of the rendered page with JavaScript changes plus CSS styles applied as the rendered HTML.

Now I know I can use the use the Scraper object along with xpath/xquery/xslt processing to harvest the text content (i.e. raw HTML) of an arbitrary web page, however

Scraper is neither CSS aware, nor JavaScript aware. As several web sites render their content dynamically, the final (i.e. rendered) HTML varies greatly from the raw HTML that Scraper and underlying WebHarvest framework can provide.

 

Presently I am exploring the option of using The Lobo Project/Cobra headless HTML rendering engine. The Cobra framework is CSS and JavaScript aware, and initial tests yield promising results.

Before I go ahead and incorporate Cobra into my WSO2 mashup server as a hosted object, I was curious to learn if there might be another (easier?) way of harvesting rendered HTML content.

Thanks in advance for your time and assistance,

Lyle "Rendering with Style"

 

tyrell's picture

I'm afraid harvesting using

I'm afraid harvesting using the scraper host object or the simple http client we recently introduced will not solve your problem, since neither are rendering engines. So we might have to use a tool-kit such as Cobra. However, I just noticed that Web Harvest has released version 2.0 with a plug-in mechanism that enables developers to extend it with their own processors. As you may already know, Web Harvest is the base library of our Scraper host object. I didn't get a chance look into this new plug-in API. But if you can extend Web Harvest to delegate page processing at some point to Cobra, the Scraper Host Object will be JS/CSS aware. This will be much more elegant than adding yet another host object. WDYT? Tyrell
lylec's picture

Announcing The Squeege

Hello Tyrell, Thanks for your prompt and helpful response. I needed to understand the limitations of Scraper before going ahead. You heard it here first :-) Announcing Squeege, a JS/CSS aware Host object composed of Scraper and Cobra components. Lyle "Squeege Swather"
tyrell's picture

Very cool. Will this new Host

Very cool. Will this new Host Object be contributed to the community? Tyrell
lylec's picture

Musings...

Hello Tyrell, So I took a look at WebHarvest v2.0 beta 1 at http://web-harvest.sourceforge.net/index.php and here what I found: The plug in mechanism (PIM) is for Java and does not support dynamic hooks so looks like there might be some recompiling involved. Essentially the PIM allows me te define my own config tags that invoke custom functionality, e.g**. (squeege url='http://myFavouriteSite.com' domNodeToHarvest='//theNode')   Plan A, or No SAX in WebHarvest plugin mechanism: I was hoping this plugin mechanism would perform SAX like parsing and fire my handler as Scraper parsed each HTML tag; Nope! What it does give me is access to the parsed data in its entirety, i.e. after the HttpProcessor has completely fetched the url. This raw HTML is not suitable for feeding Cobra for CSS recognition, as that HTML  is not infused with style. This finding disqualifies config syntax like: (squeege domNodeToGetStyleFor='//theStylizedNode') (http url='http://myFavouriteSite.com' /)(/squeege) Why? Because by the time the SqueegeProcessor is invoked, the HttpProcessor has already fetched the url - without the accompanying CSS attributes.  so on to.... Plan B: Replace the HttpProcessor with Cobra - that is to say, use Cobra to fetch the url. In this manner the SqueegeProcessor substitute HttpProcessor and: fetch the url fetch related CSS harvest nodes of interest then pass the raw html to the next processor, thus mirroring http processor output. So in place of (http url='http://myFavouriteSite.com' /) one might use: (squeege url='http://myFavouriteSite.com' domNodeToGetStyleFor='//theStylizedNode' /) Hmmm! So the at this point here is what I have: It does not look like the Scraper plugin architecture facilitates in-situ deployments. Any extension to Scraper would mean re-integrating the "new" Scraper back into WSO2 - recompiling involved :-( It seems like this might be the simplest path to harvesting CSS (short of double tapping a url - reluctant to go down this path.)   Plan C: Refining Plan B, so if recompliing is involved, maybe just go all out and... Perhaps employ Cobra (instead of HttpProcessor) to fetch the url and infuse style as properties of the HTML tags. This format would be compatible with existing processors e.g. HttpToXmlProcessor, XPath/XQuery/Xslt Processors, so in this manner those tools can be used to retrieve particular nodes of interest. So the syntax for the WebHarvest config would remain unchanged from the present, the difference being that now HTML elements would possess a myComputedStyle (or similarly named) attribute.   Your turn, WDYT?   Lyle.   **Note I am using '(' and ')' to represent the greater than and less than sign respectively.
lylec's picture

Progress

Hello Tyrell, I need to pick your brain some more please... Presently I am at the stage now where I have extended Web Harvest framework with CSS/JS awareness through Cobra framework... Now am trying to incorporate this as a host object into WSO2 Mashup Server; And here is where I am having some difficulties... I would like to continue to use the existing Scraper syntax - please consider the following fragment;   var myConfig = <config>         <var-def name="responseHtml">         <html-to-xml>                     <http method="get" url="http://giver.thruhere.net" />         </html-to-xml>         </var-def> </config> var scrape = new Scraper(myConfig); return scrape.responseHtml; My difficulty is in the final line - How do I map the JavaScript responseHtml variable defined in the (config) block to a Java function? I have looked at the tutorial at this url: http://wso2.org/library/tutorials/writing-custom-hostobject and I understand the SimpleHttpClient example. That example does show how I can create a Java function with signature public String jsGet_responseHtml(){...}, and I understand the Rhino engine binds this function to the JavaScript myObject.responseHtml property at compile-time i.e. early binding. But what I am not clear on is; How does one dynamically create the JavaScript-to-Java binding based on  run-time parameter responseHtml that is defined in an XML block as listed in the code fragment above. I also took a look at WSO2 source distribution from http://dist.wso2.org/products/mashup/2.0.2/wso2mashup-2.0.2-src.zip to try to understand how Web Harvest's (ver1)  (Scraper) object crosses the JS/Java context boundary, but am thus far stil hunting. I appreciate any light you can shed on this and thank you for your time; Lyle.
lylec's picture

Progress ver0.9a

Hello Tyrell, I created an application composed of WebHarvest (v2b1) and Cobra (v0.98.4) and this works satisfactorily as a standalone application. I can even wrap this app in a Rhino Host Object and compile it OK to an OSGi bundle using the BND tool. I can deploy this bundle to the server OK, but when I invoke the web service, I am greeting with a classDefNotFound exception related to commons.mail - I have included the beginning of the exception dump below... INFO | jvm 1 | 2010/03/26 10:08:48 | Instantiating new SqueegePlugIn()... INFO | jvm 1 | 2010/03/26 10:08:48 | The new SPI is:class org.wso2.carbon.mashup.javascript.hostobjects.squeege.SqueegePlugIn INFO | jvm 1 | 2010/03/26 10:08:49 | [2010-03-26 10:08:49,315] ERROR - org/apache/commons/mail/EmailException INFO | jvm 1 | 2010/03/26 10:08:49 | org.apache.axis2.AxisFault: org/apache/commons/mail/EmailException INFO | jvm 1 | 2010/03/26 10:08:49 | at org.apache.axis2.AxisFault.makeFault(AxisFault.java:430)   I traced the offender to a WebHarvest initialization code where it tries to init the mail plug-in.   Here is what I have been able to determine: Any changes made to sources in my project's org.webharvest.* package is not available to my code once deployed to the server. If I refactor my package to org.webharvestLC .org then everything works great! So I am led to conclude that there must exist a package by the same name in the WSO2 context that has scope precedence over my similarly named package. In am using the BND tool to create my bundle, and I have tried everything I can think of (plus some of what other ppl think too) including (a) exporting my package under a different version (b) declaring my package private, etc... but to no avail. As a consequence of refactoring my Web Harvest package I am required to include the dependency chain for WebHarvest. Ideally I would like to keep the same package name so as to avoid bloating my package. So my multi-part question is: Given user code, i.e. a bundle that references a class in a package in the same bundle, and that this package name already exists on the server; How can I have my bundle code reference its bundled package class instead of the server package code?   Appreciate any help you can provide, Lyle "Package Progress"  
tyrell's picture

Hi Lyle, First of all,

Hi Lyle, First of all, really sorry for not noticing your replies since the 16th. Somehow my forum subscriptions were messed up and I didn't receive any notifications. Back to your work :) This is great progress. I think the class loading issue you have come across is due to 2 versions of the Web Harvest library classes being in the server. The default one we ship is created as an orbit bundle (SVN: https://svn.wso2.org/repos/wso2/trunk/carbon/orbit/webharvest). You can try creating a new orbit bundle with your modified Web Harvest code and version it. For instance, the current Web Harvest orbit bundle version in trunk is 1.0.0.wso2v1. The important thing is in your new bundle's pom.xml exporting your classes versioned as below (I assume we are versioning as 1.1.0); <Export-Package>        org.webharvest.*;version="1.1.0"  </Export-Package> Once this is done, in your host object's pom.xml ensure you import this version explicitly as given below; <Import-Package>       org.webharvest.*;version="1.1.0"   </Import-Package> Now drop your new Web Harves orbit bundle and HostObject to the server and clas loading will happen without interference from the older Web Harvest version. Let me know how it goes :)   Tyrell
lylec's picture

Different Class Load Exceptions

Hello Tyrell, Sorry for my tardy response - I am in the middle of moving residences and I'm not fully online until April 28th owing to ISP scheduling backlog :-( I tried building an OSGi bundle and exporting as a different version as suggested in your last message; That fixed that particular problem, but then there were other class load failures (I am not at that development system for above reason )-: I suspect related to dependencies of WebHarvest v2.0 ( perhaps not found on the Carbon platform because Carbon platform hosts WebHarvest v1.0?) ? While I have been able to instance the Squeege Host Object (HO) from the WSO2 scripting environment and scrape a url complete with css/js properties, that Squeege HO throws a class load exception upon using any of the included processors e.g. HTML-to-xml, xpath, file, etc... My present solution to overcome this is to assign the Squeege HO scrape result to a javascript variable. I use this js variable as the data source to a (bundled WebHarvest v1.0) Scraper HO for processing using the included processors. It would be nice to use the Squeege HO for scraping as well as processing rather than having to process using a separate Scraper config. I will post the project shortly. Lyle.
lylec's picture

Squeege - beta

Here is the link for the Squeege Host Object: http://chamarette.podzone.net/mediawiki/index.php/Squeege_Project
tyrell's picture

This is great news. I will

This is great news. I will try this out and announce to the community. Tyrell
library project main code
Learn Cloud
Learn
Cloud

The WSO2 Application Server is a reliable application server that can host your enterprise web applications. The WSO2 Application Server as a Service is offered in StratosLive, the WSO2 Platform as a Service. This article explains how a simple web application can be developed and deployed from Carbon Studio to the WSO2 Application Server...

Latest Webinar
Different groups within an organization need to monitor different Key Performance Indicators (KPIs) - An operations team will be interested in the response times of business services and loads of each service,..
Thursday, February 9th 2012, 09.00 AM (PST)

Thursday, February 9th 2012, 10.00 AM (GMT)