PT Spam Blocker

| 17 Comments

Salvador is making some noise on ARN about a comment of his being rejected by our spam filter. This post is to clarify things.

Spammers target blogs with comments. These attacks can be harsh. At times spammers will go through every single post in the blog and post three comments containing scores of links advertising every thing from child rape to internet bingo.

To counter such horrid spam, we employ a blacklist plugin that searches every comment for certain patterns and rejects any that fit. Unfortunately sometimes non-spam also gets blocked. Users are sent a message informing them of the bad content so they can change the post. (Robotic spammers ignore such messages.)

Our typical cycle of spam control went like this:

  • Spam gets through the filter.
  • We recognize the spam.
  • Add the urls from the spam to the blacklist.
  • Delete the messages that got through.

Of course, the time between 1 and 4 can be hours or days, which can lead to a lot of naughty messages sitting around the blog for a while.

I finally got fed up with this reactionary technique a few months ago, and decided if there was a better option. I tried the explanatory filter but it was unable to detect links designed by spammers. So I had to fall back to old methodology and took links from our blacklist, which I already knew had been designed by spammers, and tried to deduce some megarules from them. I ended up deciding to block urls that contained multiple hyphens, since about 75% of the spammers’ urls went something like “hot-chicks-want-to-hottub-with-you.ruky.net.” (I also blocked all .info addresses since we were only getting spam from them.)

The multi-hyphen megarule has worked very well. However, it is still experimental and has been modified more than once. If you have a problem with getting a url past the spam blocker, you can simply use tinyurl.com to create a replacement url. That is what Wesley did in this comment to link to ISCID. (Contrary to some claims we did not change the blacklist for Wesley.)

We do make our blacklist publicly available, blacklist.txt, so anyone can check if we are banning sites critical towards us. Sorry, would-be martyrs, we do not censor your favorite sites from comments, unless you’re into mature mamas or something. Besides if we wanted to censor you, we’d ban your IP, not add you to our spam blocker.

17 Comments

Thanks very much for the explanation. I have had problems a couple times with posts being rejected, but had no idea why. Since the error message indicates the text that caused the problem, I was able to make changes and post successfully. As you indicate, perfectly innocuous text made get pegged as an error. But it was easy to fix, so I’m not complaining.

“I tried the explanatory filter but it was unable to detect links designed by spammers.”

That’s because you should be looking for irreducible complexity! Haven’t the IDers taught you anything?

I am sure I lost many neurons looking in on ARN. I have heard that ARN is in the midst of a large scale purge of scientists, and the IDiots’ “blog” won’t allow any comments at all, nor does it even link to sites that oppose creationist stupidity, and Sal is moaning that he is oppressed.

Good grief!

So you have a blacklist. Have you considered adding a whitelist? For example, here’s a site that I might legitimately want to link, but that contains multiple hyphens: http:// www. don-lindsay-archive .org/ creation/god_of_gaps.html

I think that since creationists can’t build external credibility with their methods or their results, they do it by climbing on a cross and martyring themselves. When no one is willing to put the nails in, though, they just wind up looking silly. Not to imply, of course, that the talented scientific minds at ARN are somehow not the focus of every posting, comment, and defarious plot at PT. Because we all know that they are.

Meanwhile, the ARN brain trust is busy discussing “The Integrity Difference Between God and Allah.” Can a Nobel Prize in physics be far behind?

Pardon me. I meant to refer to the proprietors’ *nefarious* plots. I have no idea what their defarious plots are, but I’m sure they’re related to their datheistic dagenda.

Thank you Reed for clarifying. I withdraw my complaint regarding that threads at ISCID were being singled out by the auto-blocking features.

I accept your explanation that URLs to ISCID thread were by coincidence sharing characteristics with URLs like the one you mentioned, such as

“hot chicks want to hottub with you ruky net “ (hyphens ommitted)

and that URLs to ISCID threads were only inadvertently censored because they shared characteristics with URLs from porn sites.

I extend my thanks for your hospitatlity here at PandasThumb. You need not worry for any spam threats from me here at PandasThumb.….

Further, though I have vigorously assailed some of the writings of Wesley Elsberry, I salute him as a gentleman. He’s far more statesmanlike than I ever will be. Same can be said of many of the contributors at PandasThumb including yourself, Steve Reuland, Jason Rosenhouse, Richard Hoppe, Jack Krebs, Matt Young, Mark Perakh, etc…

regards, Salvador

Salvador, what about me? That really hurts, man.

Colin wrote

Meanwhile, the ARN brain trust is busy discussing “The Integrity Difference Between God and Allah.”

Colin, is that a joke? It must be a joke. It’s a joke, right?

No, it is not a joke. At least, not the humerous kind.

“The Integrity Difference Between God and Allah.”

To be fair, it is in the “Off Topic” forum, so it may be exempt from ARN’s normally rigorous scientific methodology.

(That one was a joke.)

The white list is not a bad idea. Put the anti-evolutionist sites on it just to make the point.

Not that in this post ‘*’ means hyphen.

I think I might have a possible modification to the no-multiple-hyphens rule. As everyone here probably realizes a URL has three parts:

method: usually http://

domain name: whatever.org

path: /documents/speech1.html

Trash/spam URLs like the hypothetical “hot*chicks*want*to*hot*tub*with*you.ruky.net” have their hyphens in the domain name. The “ubb-get_topic*f*6*t*000532.html” was the path for the ISCID link Salvador tried to link to.

A regular expression to do this should fairly easy to construct.

– Anti-spam: replace “user” with “harlequin2”

I notice that “yahoo dot com” does not appear in the blacklist as such, but it is still prohibited by the spam filter.  In e-mail addresses (not content), no less!

Is there a second filter for that?

Can you fix it and allow me the luxury of not munging my address?  Thanks.

I actually deleted yahoo.com from the blacklist yesterday. Some spammers use yahoo.com in their email, which can cause it to get added to the blacklist.

“are spam filter”?

Mike Hopkins Wrote:

A regular expression to do this should fairly easy to construct.

It’s actually harder than you might think, given the constraints of MT Blacklist.

Salvador: “I withdraw my complaint regarding that threads at ISCID were being singled out by the auto-blocking features.”

Help, help! I’m being repressed! Come see the violence inherent in the system!

Yap, yap, yap…

An alternative to tweaking the hell out of your blacklist is to switch to MT 3.x, which supports (and actually comes with MT-Blacklist 2). I had to switch my personal site to MT 3 for exactly this reason, and haven’t experienced a type 1 or type 2 error since.

Depending on how heavily you’ve modified your installation of MT, making the switch might be easy or might be a total pain the ass. Of course, there’s always drupal, which might prove to be far superior to MT for a site like this. Check it out: http://www.drupal.org/

Switching to MT 3 and Blacklist 2 is in the pipeline after our planned server OS upgrade.

About this Entry

This page contains a single entry by Reed A. Cartwright published on February 7, 2005 8:55 PM.

Fighting for (and in) the heart of America was the previous entry in this blog.

Don’t forget: Tangled Bank tomorrow is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Categories

Archives

Author Archives

Powered by Movable Type 4.381

Site Meter