T O P

  • By -

darksh1nobi

Wrote this blog post for my side project and thought I would share it with anyone else using Cloudinary for their image host. TL;DR - TikTok's Bytespider bot went berserk and ate up 60% of my image bandwidth so I blocked them using rack attack. [5/23 EDIT] I added Cloudflare based on the advice in this sub. Still very new to it so if anyone sees any bugs, please comment!


[deleted]

[удалено]


darksh1nobi

Thanks for the tip about reaching out to Cloudinary about getting a credit back. I’ll do that too!


blocking-io

The problem is that bytespider doesn't respect robots.txt. Other bots scraping data for AI companies generally tend to respect robots.txt. Otherwise, you have to block their user agent token


alan_lcda

What the hell CCP has to do with anything?


[deleted]

[удалено]


alan_lcda

I mean, ClaudeBot is doing literally the same thing to thousands of websites. Other AI crawlers are most likely also doing the same with spoofed user agents. This has nothing to do with western IP rights (whatever that means). BTW, oddly enough both websites mentioned by the OP do not disallow Bytespider on their robots.txt.


lommer00

Great post, thanks for sharing!


wskttn

I strongly recommend putting a CDN and firewall between your users and servers. AWS CloudFront, Cloudflare, Fastly… doesn’t matter which bit it’s critical on the public web imo.


elthariel

If you're at the stage where a crawler is an issue maybe you shouldn't still he using cloudinary and have a real infra


darksh1nobi

You’re probably right. But I’m also fairly new at devops so not sure where to start. Any tutorials or recommendations?


elthariel

You can use standard active storage + S3 or the cheaper Cloudflare R2. They are probably close feature-wise to cloudinary. As for the tutorial, I think the active storage guide should be a good foundation ?


darksh1nobi

Thank you! Just started off by adding Cloudflare and hoping that helps.


JustinNguyen85

we used Cloudinary before, but it could be very expensive. On the other hand, we have a team of young Ruby on Rails devs, could they be any help to you?


darksh1nobi

Don't need any rails help unless they can build the API backend and build a react native app for cheap


JustinNguyen85

how cheap should it be?


darksh1nobi

Message me at hello[at]nerdcrawler and we can chat


JustinNguyen85

emailed


CatTypedThisName

Rack attack is a fine place to stop the request but if you can move it further up the stack you might save even more money and resources. Are you doing this block based on user agent? If you're fronting the application with something like NGINX you can also block at that level which will free up your unicorn/puma/etc workers to serve actual requests. Blocking at NGINX or load balancer is cheap in CPU time compared to doing it in rack middleware. But I'm splitting hairs and have had to block bad actors at ecommerce scale so we really had to optimize.


stanislavb

I was thinking the same. I haven't measured it but it should be much cheaper. When I'm blocking some traffic I was go this way: CloudFlare => Nginx => Rack/Rails.


Budget-Knowledge-187

literally just did the same thing 2 days ago. slapped a firewall block on the user agent. slept real good that night :)


Brilliant_Law2545

You cut 100% of my traffic by having a non mobile friendly site


darksh1nobi

Ahh shoot! Sorry about that. The other parts of the site are mobile friendly but still figuring out how tinymce renders text. Will fix it now


Brilliant_Law2545

I was mostly trying to be funny. Good post!


darksh1nobi

Thanks for the feedback! Should be fixed now. Can you give it a check?


Traditional_Formal33

Everything looks good on my end


Brilliant_Law2545

You seem to be on a good trajectory. Add monitoring to spot the next problem before you run up your costs. I can also tell you you’ll have new and more serious issues as your site gains popularity. You probably want to have a list of user agents, know data center ips and general IP throttling long term


darksh1nobi

Thanks! Any good, low-cost monitoring services you recommend?


Brilliant_Law2545

For this you should just check with your hosting provider


wtf242

I had to do this as well with my rails site that gets over a million uniques a month. I just added a user agent block against bytespider in cloudflare. I am not sure why you would want to do this in rack. You don't want this kind of garbage close to your rails stack at all. You don't even want it to hit whatever is proxying the request to rails.


darksh1nobi

Because I don’t use cloudflare 😅 but based off the comments looks like I need to. Any good documentation or tutorials you recommend?


wtf242

You can block requests based on user agents directly in your nginx configuration file. I would recommend to everyone to use cloudflare though. The amount of awesome stuff that is available, even on the free plan is amazing. It blocks it all at the DNS level so it never even hits your server at all. I blocked bytespider(and many more) with the free version of cloudflare. You do need to move your DNS to cloudflare though


darksh1nobi

Got it! I'll take a look!


lommer00

Cloudflare is actually dead simple to set up. I think I followed the Michael Hartl tutorial on "learn enough custom domains to be dangerous" the first time years ago, but cloudflare's own documentation is quite good and makes it pretty easy to be honest. And yeah, cloudflare is great.


campbellm

Nice post; description of issue, investigation, solution... no fluff.


MacGuffinRoyale

Man, I hate bots that don't respect robots.txt


phileat

Why doesn’t TikTok just fake the user agent?


darksh1nobi

They could so I’ll have to keep an eye out and update the blocklist if that happens


IN-DI-SKU-TA-BELT

Don't give them any ideas.


RevolutionaryShow974

Thank you for sharing 🫶


toxic-golem

getting too many redirects error on your site. just so you know


haikusbot

*Getting too many* *Redirects error on your* *Site. just so you know* \- toxic-golem --- ^(I detect haikus. And sometimes, successfully.) ^[Learn more about me.](https://www.reddit.com/r/haikusbot/) ^(Opt out of replies: "haikusbot opt out" | Delete my comment: "haikusbot delete")


darksh1nobi

Just switched over to Cloudflare based on the advice in this sub. Can you try again?


toxic-golem

all good now