If ever you accept user-written HTML code in your web applications, such as may be generated in a rich-text ‘wysiwyg’ text editor, it is vital that before displaying it back anywhere you first sanitize it. Sanitization is the process of removing potentially malicious code, primarily to prevent xss (cross-site scripting) attacks; and is generally achieved by allowing only a subset of tags and attributes in the submitted code and removing or encoding the rest.

I recently needed to do this, and a quick google turned up a project, patapage, which does just that. Although it is a java solution, there is a C# port written by Beyers Cronje; unfortunately it’s some seriously ugly code, being more-or-less a straight rip of the java version, just fixed to be valid c# code.

I realise that some people don’t think that is necessarily a bad thing, but I can’t stand to see eyesores, such as lowercase methods and type names instead of ‘var’, and so had to clean it up. I take no credit for any of the code, all I did was capitalize property/method names; replace some if/elses with ternary operators for terseness (where appropriate); replace type names with ‘var’; and change some arrays to IEnumerables (I hope this should give a bit of a performance gain, but I didn’t bother to check so don’t quote me on that).

HtmlSanitizer.cs