R - Simple XML parse -
let's ran code below :
url.df_1 = htmltreeparse(url_1, useinternalnodes = t)
and got below htmltree :
<!-- ******************* related ******************* --> <div class="more-related-box"> <div id="app_related"> <h3>customers bought</h3> <ul> <li><a href="/app/ios/flick-golf/" title="flick golf!"><img src="http://a2.mzstatic.com/us/r1000/067/purple/v4/25/a8/91/25a891df-fed4-9dc4-0d86-1c8f5acf893f/mzl.fcctkywr.75x75-65.jpg" class="app_icon"><span class="app_name">flick golf!</span><span class="category">games</span></a></li> <li><a href="/app/ios/minecraft-pocket-edition/" title="minecraft – pocket edition"><img src="http://a1.mzstatic.com/us/r1000/070/purple2/v4/3f/56/07/3f56074b-af27-8ba3-7ef8-c97314c13ee7/mzl.rfhcaysw.75x75-65.jpg" class="app_icon"><span class="app_name">minecraft – pocket edition</span><span class="category">games</span></a></li>
what want grab above "flick-golf" , "minecraft-pocket-edition". (so above part of htmltree , want grab these names , want create them list or dataframe eventually.)
so far tried (and bunch of others)
getnodeset(url.df_1, "//div[@id = 'app_related']//h3 ")
but ended getting
[[1]] <h3>customers bought</h3> attr(,"class")
any advice? give thanks you!
first need create sure xml formed. assuming take care of that. after need right xpath arguement, in case //li/a/@title
> str <- '<div class="more-related-box"> + <div id="app_related"> + <h3>customers bought</h3> + <ul> + <li> + <a href="/app/ios/flick-golf/" title="flick golf!"> + <img src="http://a2.mzstatic.com/us/r1000/067/purple/v4/25/a8/91/25a891df-fed4-9dc4-0d86-1c8f5acf893f/mzl.fcctkywr.75x75-65.jpg" class="app_icon" /> + <span class="app_name">flick golf!</span> + <span class="category">games</span> + </a> + </li> + <li> + <a href="/app/ios/minecraft-pocket-edition/" title="minecraft – pocket edition"> + <img src="http://a1.mzstatic.com/us/r1000/070/purple2/v4/3f/56/07/3f56074b-af27-8ba3-7ef8-c97314c13ee7/mzl.rfhcaysw.75x75-65.jpg" class="app_icon" /> + <span class="app_name">minecraft – pocket edition</span> + <span class="category">games</span> + </a> + </li> + </ul> + </div> + </div>' > doc <- xmlparse(str) > getnodeset(doc, "//li/a/@title") [[1]] title "flick golf!" attr(,"class") [1] "xmlattributevalue" [[2]] title "minecraft – pocket edition" attr(,"class") [1] "xmlattributevalue" attr(,"class") [1] "xmlnodeset"
xml r xml-parsing
No comments:
Post a Comment