Quantcast
Channel: htmlagilitypack Forum Rss Feed
Viewing all articles
Browse latest Browse all 655

New Post: Problems with HTML Character References (e.g. '1') with proposed fix.

$
0
0
I downloaded the source code and made a unit test (appended below) which fails on HAP 1.4.6. The problem is that HTML character references (e.g. '1') have the ampersands encoded so that they look like this: '1'.

The code that does this is HtmlDocument.HtmlCode. The Regex in this method does not ignore HTML character references. Modified source code for this method is shown below. Does this look correct or am I misunderstanding something?
    public static string HtmlEncode(string html)
    {
        if (html == null)
        {
            throw new ArgumentNullException("html");
        }
        // replace & by & but only once!
        // Bugfix: add '(#)' to the regex so that HTML character references are not corrupted.
        //  Example: '1' should NOT be converted to '1'
        //Regex rx = new Regex("&(?!(amp;)|(lt;)|(gt;)|(quot;))", RegexOptions.IgnoreCase);
        Regex rx = new Regex("&(?!(amp;)|(#)|(lt;)|(gt;)|(quot;))", RegexOptions.IgnoreCase);
        return rx.Replace(html, "&amp;").Replace("<", "&lt;").Replace(">", "&gt;").Replace("\"", "&quot;");
    }

// THE UNIT TEST
    [Test]
    public void HtmlCharacterEntities()
    {
        string html = "<html><body>"
            + "<h1>&#65298;&#65296;&#65297;&#65299;&#12469;&#12510;&#12540;&#12461;&#12515;&#12531;&#12506;&#12540;&#12531;&#38283;&#20652;&#20013;&#65281;</h1>"
            + "<p>My first paragraph.</p>"
            + "</body></html>";

        HtmlDocument hdoc = new HtmlDocument();
        hdoc.LoadHtml(html);
        hdoc.OptionOutputAsXml = true;
        hdoc.OptionCheckSyntax = true;
        hdoc.OptionFixNestedTags = true;

        HtmlAgilityPack.HtmlNode htmlNode = hdoc.DocumentNode.SelectSingleNode("html");

        string main = htmlNode.OuterHtml;
        Assert.AreEqual(html, main);
    }

Viewing all articles
Browse latest Browse all 655

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>