Friday, 15 August 2014

R - Simple XML parse -



R - Simple XML parse -

let's ran code below :

url.df_1 = htmltreeparse(url_1, useinternalnodes = t)

and got below htmltree :

<!-- ******************* related ******************* --> <div class="more-related-box"> <div id="app_related"> <h3>customers bought</h3> <ul> <li><a href="/app/ios/flick-golf/" title="flick golf!"><img src="http://a2.mzstatic.com/us/r1000/067/purple/v4/25/a8/91/25a891df-fed4-9dc4-0d86-1c8f5acf893f/mzl.fcctkywr.75x75-65.jpg" class="app_icon"><span class="app_name">flick golf!</span><span class="category">games</span></a></li> <li><a href="/app/ios/minecraft-pocket-edition/" title="minecraft – pocket edition"><img src="http://a1.mzstatic.com/us/r1000/070/purple2/v4/3f/56/07/3f56074b-af27-8ba3-7ef8-c97314c13ee7/mzl.rfhcaysw.75x75-65.jpg" class="app_icon"><span class="app_name">minecraft – pocket edition</span><span class="category">games</span></a></li>

what want grab above "flick-golf" , "minecraft-pocket-edition". (so above part of htmltree , want grab these names , want create them list or dataframe eventually.)

so far tried (and bunch of others)

getnodeset(url.df_1, "//div[@id = 'app_related']//h3 ")

but ended getting

[[1]] <h3>customers bought</h3> attr(,"class")

any advice? give thanks you!

first need create sure xml formed. assuming take care of that. after need right xpath arguement, in case //li/a/@title

> str <- '<div class="more-related-box"> + <div id="app_related"> + <h3>customers bought</h3> + <ul> + <li> + <a href="/app/ios/flick-golf/" title="flick golf!"> + <img src="http://a2.mzstatic.com/us/r1000/067/purple/v4/25/a8/91/25a891df-fed4-9dc4-0d86-1c8f5acf893f/mzl.fcctkywr.75x75-65.jpg" class="app_icon" /> + <span class="app_name">flick golf!</span> + <span class="category">games</span> + </a> + </li> + <li> + <a href="/app/ios/minecraft-pocket-edition/" title="minecraft – pocket edition"> + <img src="http://a1.mzstatic.com/us/r1000/070/purple2/v4/3f/56/07/3f56074b-af27-8ba3-7ef8-c97314c13ee7/mzl.rfhcaysw.75x75-65.jpg" class="app_icon" /> + <span class="app_name">minecraft – pocket edition</span> + <span class="category">games</span> + </a> + </li> + </ul> + </div> + </div>' > doc <- xmlparse(str) > getnodeset(doc, "//li/a/@title") [[1]] title "flick golf!" attr(,"class") [1] "xmlattributevalue" [[2]] title "minecraft – pocket edition" attr(,"class") [1] "xmlattributevalue" attr(,"class") [1] "xmlnodeset"

xml r xml-parsing

No comments:

Post a Comment