I have built an HTML sanitizer using a white list, but the problem encountered is that text like this: "< 1" gets accepted as a valid HtmlNodeType.Element??? Because of this, it gets removed via the white list processing, when it's valid text entered by a user and not an actual node of any kind.
The following as an excerpt of my sanitizer with the entry point to processing:
thx
The following as an excerpt of my sanitizer with the entry point to processing:
public string RetainWhiteListedItems(string HTMLToScrub) {
if (string.IsNullOrWhiteSpace(HTMLToScrub)) return HTMLToScrub;
HtmlDocument HTMLDoc = new HtmlDocument();
HTMLDoc.OptionWriteEmptyNodes = true;
HTMLDoc.LoadHtml(HttpUtility.HtmlDecode(HTMLToScrub));
/*THIS CHECK LETS A DOCUMENT THAT IS JUST "< 1" CONTINUE AS IT SEES IT AS
A VALID ELEMENT WITH A NAME OF "1"*/
if (HTMLDoc.DocumentNode.ChildNodes.Where(node => node.NodeType == HtmlNodeType.Element).Any()) {
IList<HtmlNode> hnc = HTMLDoc.DocumentNode.Descendants().ToList();
if (hnc.Count == 0) {
return HTMLToScrub;
}
//remove non-white list nodes
for (int i = hnc.Count - 1; i >= 0; i--) {
HtmlNode htmlNode = hnc[i];
//if the htmlnode is not in the whitelist, turf it
//...all other processing for attributes, scripting etc....
}
}
}
Does the HtmlNodeType.Element and/or the HtmlNode not validate that the element is in fact an html element? Is there something else i need to do to get this functionality? I'd like to have it so if a user enters text using less than/greater than symbols that are not tags/html elements to simply bypass being processed by the sanitzerthx