Bot control
Edit on GitHubBot traffic is unavoidable in the modern internet and there are two kinds of bots:
- official/honest bots - HTTPS clients representing themselves as bots (crawlers, scanners originating from search engines, LLMs, social media, etc)
- malicious bots - HTTPS clients mimicking user browsers, usually harvesting information for different purposes or simply attacking the application to make it slow, not stable or even failing, or increasing cloud costs significantly.
Because of that difference, the countermeasures are also different.
Managing honest bots
Honest bots usually respect the industry standard file - robots.txt. This is a must-have, simple tool that defines rules for bots: what is allowed for bots, what is not, frequency of scan, etc.
For more information, see Crawler Control.
Managing malicious bots
Malicious bots, on the other hand, usually ignore such recommendations. For that reason, the only effective way of limiting their negative impact is to filter out their requests.
There are a few options here:
Basic HTTP auth
Close application or part of it behind authentication, using nginx configuration.
IP Filtering
Filter out IP ranges based on geoip databases, user agents, URLs, or any combination of those.
AWS WAF
AWS WAF - a firewall that has a set of standard rules to block suspicious, not useful traffic.
Advanced bot management services
The most efficient, dynamic, and feature-rich solutions are tools like Akamai, Cloudflare-like services for bot control, not only to granularly control official bots (for example social networks and commercial LLMs), but also those unofficial - crawlers, scanners, and malicious ones.
The problem with nginx-based solutions
It requires a custom nginx configuration, which leads to the necessity of creating a custom docker/sdk branch and using it in the cloud environment. But that approach moves the responsibility of maintaining this branch from Spryker to a partner or customer (for updates, upgrades, etc).
Thank you!
For submitting the form