Scraper beginer :)

solakov's picture
Hi everybody! I recently installed WSO2 Mashup server and i was trying the examples given in the screen casts. While trying to make some changes I got to a point where I don't know what to do :) What I basically want to do is to pass the url address as a parameter of the invoking function and call it respectively while creating the config variable for Scraper object. I tried with ${urlAddress} but it is not working that way. Probably for most of you this is a stupid question, but I'm just a beginner who got on a really jumpy start with WSO2 :) Can somebody help me, please? Thanks in advance! I removed < and > from the config part but I'm sure that you will understand the code. function getArticle ( urlAddress ) { var config = config var-def name="article" html-to-xml\ http method="get" url="${urlAddress}" / /html-to-xml /var-def /config; var s = new Scraper(config); var html = new XML(s.article.substring(s.article.indexOf("?>") + 2)); var downloads = html.body.form.div.div.div.div.div.div[2].toString(); return downloads; }
keith's picture

The problem is that you had

The problem is that you had used ${urlAddress} as the variable. You use that notation to refer to variables defines within your config itself. In this case you are refering to a JavaScript variable and hence $ is not necessary. Here is a working sample that I hacked up in a minute. fetchData.inputTypes="string"; function fetchData(address) {     var config =         <config>             <var-def name='response'>                     <html-to-xml>                         <http method='get' url={address}/>                     </html-to-xml>             </var-def>         </config>;       var scraper = new Scraper(config);     return scraper.response; }   Thanks, Keith. Blog : http://www.keith-chapman.org/ Note : In order to paste XML use the Rich Text Editor.
solakov's picture

Thanks!!

Thanks A LOT Keith! You'll hear from me again :)
lylec's picture

Injecting JavaScript variable into an XQuery

Hello Keith, I am trying to use an xquery to perform a keyword search. If I replace the "where" (highlighted in the code section) clause with where contains($courseDesc,"Java") then I can see all courses that countain the word Java and the code compiles and executes fine. However I would like to use the JavaScript variable strCourse as a parameter to allow the user to specify the search keyword. I have systematically tried each of the following with similar unsuccessful results: where contains($courseDesc, course) where contains($courseDesc, {course}) where contains($courseDesc, ${course}) where contains($courseDesc, {$course}) /////////  This is the error message  ///////////////// Fault: Error executing XQuery expression (XQuery = [declare variable $doc as node() external; for $x in $doc//div[@class='item-container']/table/tbody let $courseName := data($x//td[@bgcolor="#004b97"]/span/font/span[1]) let $courseNum := data($x//td[@bgcolor="#004b97"]/span/font/span[2]) let $courseTitle := data($x//td[@bgcolor="#004b97"]/span/font/span[3]) let $courseDesc := data($x//tr[5]//span[@class="course-desc"]) where contains($courseDesc,{strCourse}) return <course> {$courseName}{$courseNum}{$courseTitle} {$courseDesc} </course>])! /////////////   This is my code fragment ////////////////////// getCourses.documentation="Fetches course descriptions from LU Calendar"; getCourses.inputTypes={"strCourse" : "xs:string"}; function getCourses(strCourse){  var myConfig  =  <config>     <var-def name="response">    <xquery>    <xq-param name="doc">    <html-to-xml>       <http method="get" url="http://mycoursecalendar.lakeheadu.ca/pg144.html"/>        </html-to-xml>      </xq-param>        <xq-expression><![CDATA[         declare variable $doc as node() external;         for $x in $doc//div[@class='item-container']/table/tbody         let $courseName  := data($x//td[@bgcolor="#004b97"]/span/font/span[1])         let $courseNum   := data($x//td[@bgcolor="#004b97"]/span/font/span[2])         let $courseTitle := data($x//td[@bgcolor="#004b97"]/span/font/span[3])         let $courseDesc  := data($x//tr[5]//span[@class="course-desc"])         where contains($courseDesc,{strCourse})             return                 <course>                    {$courseName}{$courseNum}{$courseTitle}                     {$courseDesc}                 </course>     ]]></xq-expression>      </xquery>    </var-def>   </config>;     var scrape = new Scraper(myConfig);  var myResult = scrape.response;  if (myResult != null )    myResult = myResult.toString();  else   myResult = "no matches found!";    return myResult; } /////////////   Code fragment ends here. //////////////////////   Thanks in advance for your assistance, Lyle.  
jonathan.wso2.com's picture

Tricky bit.

Hi Lyle, If my guess is correct, this is indeed a tricky bit, that I stumbled over when I started too. The problem is that there are multiple levels of template processing going on. All of E4X, WebHarvest, and XQuery use the curly brackets for this notation, and it's a bit hard to keep them all straight. This is documented (briefly) at http://wso2.org/project/mashup/1.5.2/docs/index.html, select "Hosted Objects: Scraper" and look at section 2.0 point 3. Starting with the raw XQuery, you can use {$courseName} etc to insert the XQuery values into the result. However, when you put that into a web harvest , ${} is interpreted as inserting WebHarvest variables. On top top of that layer, E4X uses {} as inserting an E4X value. So in this case (no WebHarvest variables) I think you could either the CDATA escaping, and escape all of your angle brackets, and also all curly braces with &x7B; and &x7D; to hide them from WebHarvest template. Then you insert {strCourse} and let E4X do the substitution. Or you might be able to end the CDATA just before the {strCourse} and start it up again right after, to ensure E4X treats these curly brackets as a kind of "markup" and not as plain CDATA text. A little experimentation might be necessary. To debug this, I'd do a simple "return config;" prior to scraping and see if the resulting XML looks correct and that the E4X-level substitution of {strCourse} is working. Hope that helps, - Jonathan
lylec's picture

Success!

Hello Jonathan, Thanks for your prompt and relevant response. The key to my understanding was your point regarding the architecture consisting stacked templates (E4X, WebHarvest, XQuery) as mentioned in the docs "Hosted Objects: Scraper" section 2.0 point 3. My solution uses <xq param> to send the E4X variable to the XQuery and I attach my working solution here for posterity. Thanks again for your support, Lyle. //////////// Start of code fragment /////////////////// this.documentation = "Search LU courses using XQuery."; getCourses.documentation="Fetches course descriptions from LU Calendar"; getCourses.inputTypes={"strCourse" : "Java | C++ | algorithm"}; function getCourses(strCourse){  var myConfig  =  <config>   <var-def name="response">    <xquery>    <xq-param name="searchKeyword" type="string">{strCourse}</xq-param>    <xq-param name="doc">    <html-to-xml>       <http method="get" url="http://mycoursecalendar.lakeheadu.ca/pg144.html"/>        </html-to-xml>      </xq-param>        <xq-expression><![CDATA[         declare variable $doc as node() external;         declare variable $searchKeyword as xs:string external;         for $x in $doc//div[@class='item-container']/table/tbody         let $courseName  := data($x//td[@bgcolor="#004b97"]/span/font/span[1])         let $courseNum   := data($x//td[@bgcolor="#004b97"]/span/font/span[2])         let $courseTitle := data($x//td[@bgcolor="#004b97"]/span/font/span[3])         let $courseDesc  := data($x//tr[5]//span[@class="course-desc"])         where contains($courseDesc, $searchKeyword)             return                 <course>                    {$courseName}{$courseNum}{$courseTitle}                     {$courseDesc}                 </course>     ]]></xq-expression>      </xquery>    </var-def>   </config>;     var scrape = new Scraper(myConfig);  var myResult = scrape.response;  if (myResult != null )    myResult = myResult.toString();  else   myResult = "no matches found!";    return myResult; } //////////// End of code fragment ///////////////////  
library project main code
Learn Cloud
Learn
Cloud

The WSO2 Application Server is a reliable application server that can host your enterprise web applications. The WSO2 Application Server as a Service is offered in StratosLive, the WSO2 Platform as a Service. This article explains how a simple web application can be developed and deployed from Carbon Studio to the WSO2 Application Server...

Latest Webinar
Different groups within an organization need to monitor different Key Performance Indicators (KPIs) - An operations team will be interested in the response times of business services and loads of each service,..
Thursday, February 9th 2012, 09.00 AM (PST)

Thursday, February 9th 2012, 10.00 AM (GMT)