Monday, 15 July 2013

ruby - How to prevent deletion of the tag in Nokogiri? -



ruby - How to prevent deletion of the <html> tag in Nokogiri? -

i have code this:

doc = nokogiri::html.fragment(html) doc.to_html

and html fragment parsed:

<p>some paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> <code> <html> <p> qwerty </p> </html> </code> <p>some other paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>

nokogiri deletes <html> </html> tags in <code> block. how can prevent behavior?

update:

the tin man proposed solution, pre parse fragment of html , escape html in code block

here code, it's not beautiful if want suggest solution please post comment

html.gsub!(/<code\b[^>]*>(.*?)<\/code>/m) |x| "<code>#{cgi.escapehtml($1)}</code>" end

thanks the tin man

the problem html invalid. used test it:

require 'nokogiri' doc = nokogiri::html::documentfragment.parse(<<eot) <p>some paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> <code> <html> <p> qwerty </p> </html> </code> <p>some other paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> eot puts doc.errors

after parsing document, nokogiri populate errors array list of errors found during parsing. in case of html, doc.errors contains:

htmlparsestarttag: misplaced <html> tag

the reason that, within <code> block, tags not html encoded should be.

convert using html entities to:

&lt;html&gt; &lt;p&gt; qwerty &lt;/p&gt; &lt;/html&gt;

and work.

nokogiri xml/html parser, , attempts prepare errors in markup allow you, programmer, have chance of using document. in case, because <html> block in wrong place, removes tags. nokogiri wouldn't care if tags encoded, because, @ point, they're text, not tags.

edit:

i'll seek pre parse gsub , convert html in code block

require 'nokogiri' html = <<eot <p>some paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> <code> <html> <p> qwerty </p> </html> </code> <p>some other paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> eot doc = nokogiri::html::documentfragment.parse(html.gsub(%r[<(/?)html>], '&lt;\1html&gt;')) puts doc.to_html

which outputs:

<p>some paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> <code> &lt;html&gt; <p> qwerty </p> &lt;/html&gt; </code> <p>some other paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>

edit:

this process <html> tag prior parsing, nokogiri can load <code> block unscathed. finds <code> block, unescapes encoded <html> start , end tags, inserts resulting text <code> block content. because inserted content, when nokogiri renders dom html text reencoded entities necessary:

require 'cgi' require 'nokogiri' html = <<eot <p>some paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> <code> <html> <p> qwerty </p> </html> </code> <p>some other paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> eot doc = nokogiri::html::documentfragment.parse(html.gsub(%r[<(/?)html>], '&lt;\1html&gt;')) code = doc.at('code') code.content = cgi::unescapehtml(code.inner_html) puts doc.to_html

which outputs:

<p>some paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> <code> &lt;html&gt; &lt;p&gt; qwerty &lt;/p&gt; &lt;/html&gt; </code> <p>some other paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>

html ruby parsing nokogiri

No comments:

Post a Comment