Html Sanitising vs Html Escaping
12 Feb 2015Content, html and text, that originated from external sources must be integrated into the primary content stream so that it cannot subvert the output programmatically.
A user inputs some data and now expects it to be reflected in the page structure. Creative Usernames.
The sanitisation of html using an explicit “whitelist” policy of allowed elements “cuts” out unspecified elements from the stream similar to the requirement that a movie is “cut” down to match a target rating classification (e.g a rating specifies there cannot be strong profanity used). It could be argued that film ratings are not applied consistently because it is more like an exclusion list than a “whitelist”.
PolicyFactory policy = new HtmlPolicyBuilder()
.allowElements("p")
.allowElements(
new ElementPolicy() {
public String apply(String elem, List<String>
attrs.add("class");
attrs.add("header-" + elem);
return "div";
}
}, "h1", "h2", "h3", "h4", "h5", "h6"))
.build();
String safeHTML = policy.sanitize(untrustedHTML);
Escaping non-compliant content makes it structurally compatible with Html and is similar to “pixilating” or “obfuscating” the offending elements to make it appear harmless (e.g A movie that had profanity over-dubs for its TV release was Ghostbusters).
Html Escaping only uses the following five ASCII characters (“'” is not defined in Html 4.01 and is excluded)
Input | Output |
---|---|
”’” | “"” |
”" | “'” |
“&” | “&” |
”<” | “<” |
”>” | “>” |
Utilities
com.google.common.html;
HtmlEscapers.htmlEscaper().escape("<script>alert('Boo!');</script>;");
CSS & JavaScript
Additional languages that can be subverted in the output stream require distinct sanitising and encoding.