Pastebin fights the spam!

A few people have emailed me recently disappointed by the level of spam postings on pastebin.com. I’ve never really understood why spammers bother, but as they are bothering in increasing numbers it was time to take some action.

Last night I built in some spam filtering which has caught hundreds of posts since going live. I also added a “report spam” link which has flagged over 500 posts in past 20 hours. By iteratively tweaking the spam filter to identify the legimately flagged posts, I’ve been able to quickly delete a lot of older spam posts.

Hopefully this will make pastebin look like a well tended garden rather than a run-down wasteland! Comments welcome…

56 thoughts on “Pastebin fights the spam!

  1. Slepp

    Though we all love the spam… Oh come on.. you know you do.. :>

    Anyhow, Paul, check out http://www.projecthoneypot.org.. It is quite effective, and you can redistribute it with your code as well (though I think an end user has to do a bit to get it going again, like getting their own account). It could compliment the new methods you implemented yourself.

  2. Anonymous

    Easy way to kill most spam on the spot:

    – rename the name/link/etc. fields to something weird
    – put this in your form:

    (Leave these fields blank!)

    – put this in the site stylesheet:
    div.trap { display: none; visibility: hidden }

    Then when processing the form, if there’s anything in the ‘name’ or ‘link’ fields, drop the post. This method alone has cut my spam by about 90-95%. I have no captcha, and no wonky heuristics that are eventually bound to flag legitimate data as spam at some point or another. Just a couple of innocent-looking fields that spambots fill in because they look important.

  3. Anonymous

    … oh, ffs, it stripped out the html. pretend these parens are angle brackets. 😛

    (div class=”trap”)
    (Leave these fields blank!)

  4. lordelph Post author

    Hah, it’s OK, I understood. I should try it, though I fear it would only work for a short period 🙁

  5. Slepp

    In response to lordelph & Anonymous, it actually works for a very long period.. Still working for me. It even works with (input type=”hidden”) for some of the absolutely stone stupid bots.

  6. lordelph Post author

    Thanks for all the comments, I’ve continued to tweak it and the amount of spam (and spam reports) has fallen dramatically. Will keep an eye on it!

  7. msg

    How about picture? Maybe u should place here some kind of engine generating images with words or totally random letter to re-type? 90% would gone (spam-bots)

  8. Rick

    msg: I’m strongly against pictures, I really dislike filling in those things and it would atleast drive me away from this site. The hidden form fields works with most bots and if that’s not enough than a bit of javascript (perhaps in combination with ajax) is sufficient to kill most spambots.

  9. lordelph Post author

    @msg: The anti-spam measures are working very well, but I’ve had one report of a legimate post getting flagged as spam, so I will either relax things a little or add a CAPTCHA only for those posts with a spam smell them!

    @Noccy: interesting, but would anyone really use it? I’ve got an idea for making the existing line highlighting features easier to use though…

  10. Vinyanov

    (Not reading the previous comments, sorry, its just a quick note). Few moments ago I boldly clicked on a Spam report link and seemingly reported a valid code. Let me explain:

    You know, as a webmaster I have a slightly daring attitude when speaking of web forms. I often click on things just to see if they offer any confirmation or how deal with invalid input. And, to my regret, your markup has not offered me any confirmation prompt, so that I could verify my decision.

    Could you possibly add an … for your anti-spam links? Hope that helps the site. 🙂

  11. Vinyanov

    Oops, an overactive code parser. Lets try again: Could you possibly add an a onclick=confirm(“Sure?”) href=http:// … for your anti-spam links? Hope that helps the site. 😉

  12. lordelph Post author

    The “report spam” link is immediate by design, to encourage its use. False positives are relatively rare and are ignored.

  13. Selig

    I agree with the hideen fields system, Other people have used it in their comment system, and it got rid of most of their spam. The spam bots dont think to check the CSS or sometimes even the field type. I strongly sugest this over CAPACA, as Capatcha (sp!) can ittitate users more than the benifit of the spam bots being detered.

  14. Gargantua

    I don’t know If I’m the only person saying this, but what exactly DEFINES spam? it could be people just posting things to transfer them elsewhere…

  15. Anonymous

    Before I put the spam filter on, there were hundreds of posts which were just lists of links and keywords for typical spam enterprises, submitted multiple times.

    If you didn’t see it before, trust me , it is pretty obvious when a post is spam!

    For those posts flagged with the “Report spam” feature, I read them and think “could someone conceivably want to send this to another individual for comment or review?”. If I see a new pattern emerging, I tweak the automated filter appropriately.

    I’ve only had one report of false positive so far, but happy to hear of such incidents…

  16. sysprv

    Hey hey hey 🙂
    I use pastebin with cURL scripts… To upload settings etc. from different computers. Please don’t restrict it too much 🙂 at least for private domains… The really simple (and therefore scriptable) page structure (form) is…delicious.

    By the way, is there a way to retrieve posts that have fallen off the “Recent Posts” box?

    Thank you for this service 8)

  17. Nicolas

    Report spam link is totally wrong for a simple reason: it’s a link. A GET request shouldn’t cause an action. If the request has a side effect, use POST (so same goes for the delete link). I can easily see a crawling bot marking all posts as spam.

    Hmm actually no… Because it’s even worse: Why The Hell Javascript? What good reason do you have to block people from reporting spam when using lynx from a headless machine?

  18. lordelph Post author

    Thanks again. But I really did want it to work the way it does. Sorry it annoys you so much though.

  19. kato

    any chance I could get a download of the new code so I don’t have to write it myself? My pastebin is overrun with spam : (

  20. Internet Expert

    Everybody hates CAPTCHAS. I think an agglomeration of inconspicuous methods (to the normal user) would be best.

    For instance, setting a cookie and ensuring that the user sends it back will ensure that at least a somewhat functional browser is being used. (Not many bots will use cookies). If no cookie is sent back along with the paste form, you can display a kind red message to the user to enable cookies. (Cookies are necessary on an ever-increasingly dynamic Web!)

    Not forgetting the hidden input methods too, and slightly medieval methods such as blacklisting on multiple detections and heuristic confirmations.

    As pastebin software becomes more popular, more spammers will tailor their bots to specifically target it and bypass your specific anti-spam techniques. This is where CAPTCHA must simply come in, until better methods of differentiating a human brain and electronic processor are developed.

  21. Eero

    “I cant post my logs any more, always flagged.”

    Same here, I cant use my pastebin anymore. It says only: “Sorry, your post tripped our spam filter – let us know if you think this could be improved”

  22. Johnny

    Have a check box that is “check this box if this is a spam post”. In the code make the checkbox look important and as though it must be clicked to post and bots will check that field and get flagged (or even better, banned).

  23. lordelph Post author

    Something like that already occurs. Most of the “spam” filtering is more about filtering posts I don’t want to be hosting.

  24. Seal

    I’ve found that when you put your comments on a separate page it seems to dramatically reduce the amount of comment spam, maybe because the comment page has a lower or zero page rank. It takes a bit away from the whole ‘flow’ of the blog but you gotta weigh up the pros and cons

  25. Adam Higerd

    Someone mentioned a honeypot earlier, as well as the problems with making the “report spam” and “delete” links GET requests instead of POST requests.

    A simple solution is to make a “honeypot” anchor link in your invisible DIV. If the honeypot link gets followed by a “user” (of course, the text of the link should indicate that you shouldn’t click it, but this would only be visible to non-CSS browsers) then that’s an indication that you’re looking at a crawler that should be temporarily banned/ignored.

    Meanwhile, to make sure that search engines work, make sure that the honeypot script (as well as the scripts that manage marking spam and deleting posts!) is listed in your robots.txt so that well-behaved crawlers know to ignore it.

  26. Xrvel

    I like your pastebin. I use it often, but i hate the spam. Why don’t you use captcha? It’s easy to implement.

  27. Russ

    I like the pastebin the way it is. Mostly I deal with people asking for help in IRC while I’m at work. If I have a minute I help people out.

    I wouldn’t take the time to deal with a captcha if it was implemented. If anything I would perhaps change the code so the captcha appears iff the first spam method triggers.

  28. Ralf

    Hi there,

    am I missing anything? I just downloaded and installed pastebin from http://pastebin.com/pastebin.tar.gz just to realize that is hardly looks different to my older version.

    I can’t see any means of spam detection or how spam could be avoided in pastebin-0.60.

    Which version of pastebin are you talking about and if it’s not 0.60, where can I get the source?

    Regards
    Ralf

  29. Axel Werner

    I dont know why.. but your crapy SPam filter realy is anoying sometimes.. i wanted to post contents of a linux file and some console output and your darn spam filter denied my post for beeing SPAM.

Comments are closed.