Bleach is awesome. Thank you for it @willkg! It's a Python library for sanitizing text as well as "linkifying" text for HTML use. For example, consider this:
>>> import bleach >>> bleach.linkify("Here is some text with a url.com.") 'Here is some text with a <a href="http://url.com" rel="nofollow">url.com</a>.'
Note that sanitizing is separate thing, but if you're curious, consider this example:
>>> bleach.linkify(bleach.clean("Here is <script> some text with a url.com.")) 'Here is <script> some text with a <a href="http://url.com" rel="nofollow">url.com</a>.'
With that output you can confidently template interpolate that string straight into your HTML.
Getting fancy
That's a great start but I wanted a more. For one, I don't always want the rel="nofollow"
attribute on all links. In particular for links that are within the site. Secondly, a lot of things look like a domain but isn't. For example This is a text.at the start
which would naively become...:
>>> bleach.linkify("This is a text.at the start") 'This is a <a href="http://text.at" rel="nofollow">text.at</a> the start'
...because text.at
looks like a domain.
So here is how I use it here on www.peterbe.com to linkify blog comments:
def custom_nofollow_maker(attrs, new=False):
href_key = (None, u"href")
if href_key not in attrs:
return attrs
if attrs[href_key].startswith(u"mailto:"):
return attrs
p = urlparse(attrs[href_key])
if p.netloc not in settings.NOFOLLOW_EXCEPTIONS:
# Before we add the `rel="nofollow"` let's first check that this is a
# valid domain at all.
root_url = p.scheme + "://" + p.netloc
try:
response = requests.head(root_url)
if response.status_code == 301:
redirect_p = urlparse(response.headers["location"])
# If the only difference is that it redirects to https instead
# of http, then amend the href.
if (
redirect_p.scheme == "https"
and p.scheme == "http"
and p.netloc == redirect_p.netloc
):
attrs[href_key] = attrs[href_key].replace("http://", "https://")
except ConnectionError:
return None
rel_key = (None, u"rel")
rel_values = [val for val in attrs.get(rel_key, "").split(" ") if val]
if "nofollow" not in [rel_val.lower() for rel_val in rel_values]:
rel_values.append("nofollow")
attrs[rel_key] = " ".join(rel_values)
return attrs
html = bleach.linkify(text, callbacks=[custom_nofollow_maker])
This basically taking the default nofollow
callback and extending it a bit.
By the way, here is the complete code I use for sanitizing and linkifying blog comments here on this site: render_comment_text
.
Caveats
This is slow because it requires network IO every time a piece of text needs to be linkified (if it has domain looking things in it) but that's best alleviated by only doing it once and either caching it or persistently storing the cleaned and rendered output.
Also, the check uses try: requests.head() except requests.exceptions.ConnectionError:
as the method to see if the domain works. I considered doing a whois lookup or something but that felt a little wrong because just because a domain exists doesn't mean there's a website there. Either way, it could be that the domain/URL is perfectly fine but in that very unlucky instant you checked your own server's internet or some other DNS lookup thing is busted. Perhaps wrapping it in a retry and doing try: requests.head() except requests.exceptions.RetryError:
instead.
Lastly, the business logic I chose was to rewrite all http://
to https://
only if the URL http://domain
does a 301 redirect to https://domain
. So if the original link was http://bit.ly/redirect-slug
it leaves it as is. Perhaps a fancier version would be to look at the domain name ending. For example HEAD http://google.com
301 redirects to https://www.google.com
so you could use the fact that "www.google.com".endswith("google.com")
.
UPDATE Oct 10 2018
Moments after publishing this, I discovered a bug where it would fail badly if the text contained a URL with an ampersand in it. Turns out, it was a known bug in Bleach. It only happens when you try to pass a filter to the bleach.Cleaner()
class.
So I simplified my code and now things work. Apparently, using bleach.Cleaner(filters=[...])
is faster so I'm losing that. But, for now, that's OK in my context.
Also, in another later fix, I improved the function some more by avoiding non-HTTP links (with the exception of mailto:
and tel:
). Otherwise it would attempt to run requests.head('ssh://server.example.com')
which doesn't make sense.
Comments
Very nice.
I've been working on something like this recently, and experiencing similar problems.
Have you considered checking the url for spaminess and malware somehow? Especially after manually reviewing lots of submitted urls, and when old urls get thier domains taken over by other dodgey websites. There's a few services around for this, but I'm not sure which is good. The obvious one is the Google Safe Browsing service, which firefox used (or used to?) https://developers.google.com/safe-browsing/
Speaking of domains getting taken over, or going down... having a link to archive.org is kind of nice. I haven't implemented this, but an easy way might be to put an archive icon link next to every link. Not sure how to check if 'the url is sort of what it should be', because content and site design can change.
Expansion of shortened urls, and https checking would be good (there's still lots of sites with broken https). The link shorteners are often trackers. But at least now many of the mappings are tracked by 301works: https://archive.org/details/301works&tab=about So now when link shorteners stop working, or are changed, it should be possible to find out where things went to before it changed.
For comments, I've trained a spam classifier with a bunch of blog comments collected over the years. Additionally, putting limits on how many comments can be posted per hour stops bots from hammering the commenting endpoints has helped a lot. Finally, having moderation tools for site admins to mark comments as spam/not spam has helped.
In short, the world is your oyster when you have a tool like this. But one thing you can definitely do is train a spam classifier separately just on the URLs. That way, if the spammers use "good words" around spamming URLs you can catch them based on the URLs.
On this my own blog I manually moderate all comments. It sucks but it's relatively quick. The blue links stand out and alert me to take a closer look.