ruby - How to prevent deletion of the <html> tag in Nokogiri? -
i have code this:
doc = nokogiri::html.fragment(html) doc.to_html
and html fragment parsed:
<p>some paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> <code> <html> <p> qwerty </p> </html> </code> <p>some other paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
nokogiri deletes <html>
</html>
tags in <code>
block. how can prevent behavior?
update:
the tin man proposed solution, pre parse fragment of html , escape html in code block
here code, it's not beautiful if want suggest solution please post comment
html.gsub!(/<code\b[^>]*>(.*?)<\/code>/m) |x| "<code>#{cgi.escapehtml($1)}</code>" end
thanks the tin man
the problem html invalid. used test it:
require 'nokogiri' doc = nokogiri::html::documentfragment.parse(<<eot) <p>some paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> <code> <html> <p> qwerty </p> </html> </code> <p>some other paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> eot puts doc.errors
after parsing document, nokogiri populate errors
array list of errors found during parsing. in case of html, doc.errors
contains:
htmlparsestarttag: misplaced <html> tag
the reason that, within <code>
block, tags not html encoded should be.
convert using html entities to:
<html> <p> qwerty </p> </html>
and work.
nokogiri xml/html parser, , attempts prepare errors in markup allow you, programmer, have chance of using document. in case, because <html>
block in wrong place, removes tags. nokogiri wouldn't care if tags encoded, because, @ point, they're text, not tags.
edit:
i'll seek pre parse gsub , convert html in code block
require 'nokogiri' html = <<eot <p>some paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> <code> <html> <p> qwerty </p> </html> </code> <p>some other paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> eot doc = nokogiri::html::documentfragment.parse(html.gsub(%r[<(/?)html>], '<\1html>')) puts doc.to_html
which outputs:
<p>some paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> <code> <html> <p> qwerty </p> </html> </code> <p>some other paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
edit:
this process <html>
tag prior parsing, nokogiri can load <code>
block unscathed. finds <code>
block, unescapes encoded <html>
start , end tags, inserts resulting text <code>
block content. because inserted content, when nokogiri renders dom html text reencoded entities necessary:
require 'cgi' require 'nokogiri' html = <<eot <p>some paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> <code> <html> <p> qwerty </p> </html> </code> <p>some other paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> eot doc = nokogiri::html::documentfragment.parse(html.gsub(%r[<(/?)html>], '<\1html>')) code = doc.at('code') code.content = cgi::unescapehtml(code.inner_html) puts doc.to_html
which outputs:
<p>some paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a> <code> <html> <p> qwerty </p> </html> </code> <p>some other paragraph</p> <a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
html ruby parsing nokogiri
No comments:
Post a Comment