Robot Wars: A Brief History of Bad Bots

Ben Davey

8 February 2022

Robot Wars - A Brief History of Bad Bots

Bots have existed since the dawn of the internet. They vary from simple scripts to complex AI representations of real people. In many ways, they are the ideal employees; they can execute fast and repetitive tasks without ever complaining or asking for a pay rise.

It’s for this reason that technology companies employ bots in abundance for a wide variety of repetitive tasks. They can be scripts looking for unauthorised copyright, provide up-to-date information on the news or weather, or can crawl the internet for search engines. Googlebot, for example, produces categorised results and targeted advertisements.

As useful as these bots are, the focus here will be on their more nefarious cousins, bad bots. In the Imperva 2021 Bad Bot report* it is estimated that a whopping 25.6% of internet traffic can be attributed to bad bots, while 15.2% comes from good bots. This has likely already increased since the time of writing and will continue to increase (or even hyper-increase) as fraudsters continue to look for ways to exploit the power of automation at scale. Human initiated internet traffic will become a smaller and smaller percentage.

It’s for this reason that not understanding or controlling bad bots has a huge impact on business continuity. It affects customer experience by slowing things down, impacts good customers with additional authentications, and allows fraudsters to automate testing/committing fraud against consumers.

By delving into the evolution of bad bots we can then explain how they have developed over time, to better assess where they are now, where they will be in future and how companies can mitigate their impact.

1st Generation - cURL Days. Keep It Simple, Stupid

Bots were inhouse scripts using simple cURL HTTP requests to scrape websites, spam forms and test card credentials.
Fixed set of locations, often from servers rather than desktops.
Generally simple to identify via simple User Agent mismatches and IP velocities. They came from Data Centres via a limited number of proxy IPs making 1000s of requests.
They could not maintain cookies or utilise JavaScript, failing at basic website challenges.
Limited spread of host locations and low sophistication. This meant that simple WAF and CDN mitigation strategies on IP blocking and user agent discrepancies could easily identify these bots.

2nd Generation – Spiders & Web Creepy-Crawlers

Bots evolved to operate through website development, often called ‘web crawlers’ extracting website data.
Apache Nutch and Scrapy are two common examples, Nutch is part of the Hadoop distributed framework and Scapy is an open-source framework built in Python.
These web crawlers remain relatively easy to detect due to a lack of JavaScript, iframe tampering and high velocities with no purchases.

3rd Generation – Low and Slow. The Headless Horseman

These bots are full blown browsers which are ‘headless’, lacking graphical interfaces. Some examples are PhantomJS and CasperJS, but these have been surpassed by more mainstream browsers with later versions of Chrome and Firefox (Selenium) allowing headless mode. Although these are legitimate technologies, in this instance they can be used fraudulently to support automated bot functions. Puppeteer is very popular with bot providers and is a node library to provide API control over Chrome.
Unlike previous generations, these bots can maintain cookies and execute JavaScript and are a direct evolutionary consequence of the growing use of JavaScript challenges in Websites.
They operate in a low and slow fashion using a multitude of IP addresses, making the basic velocity rules used for previous generations virtually obsolete.
These bots are typically used for denial-of-service, scaping, form spam and ad fraud.
They can be identified with device and browser data, by identifying specific JavaScript conditions, iframe manipulation, sessions, virtual machines, and cookies. Behavioral analysis such as mouse and keyboard interactions can also be useful to detect these bots.

4th Generation - Bots Evolved. Clever HALs (2001: A Space Odyssey)

The latest generation of bots are very hard to distinguish from human interactions. They can move the mouse in a random way rather than in the traditional straight lines of a typical bot. They can also change user agent strings while rotating through 10,000s of IP addresses. Mobile Emulators mean that this is not restricted to browser-based events.
These bots are also capable of hijacking behavior from genuine users, recording swipe and mouse patterns, how long people hover over an icon, how hard they press an icon. Essentially completely mimicking user behavior.
From an audio and visual perspective Deepfakes represent an emerging fraud trend, witness a recent example where a CEO was duped into a $240,000 transfer via a deepfake.** We will see a boom in digital fingerprinting as a solution which analyses metadata hash-values to see any sign of third party tampering.
In terms of mitigation, using behavior alone will generate too many false positives. API protection, deep packet inspection in the cloud, real-time behavior analysis and an edge control policy are all key. Analysis needs to span the entire journey of the user.
From a machine learning perspective, understanding and modelling intent becomes key to distinguish these advanced bots, and advanced supervised, unsupervised semi-supervised techniques are required.

Evolving Bot Defences – Enter the Next Generation of Detection, Prevention and Control

Unlike historical sequences, bots have not evolved in a simple linear fashion. It’s not effective to just tackle the latest breed of bots given that fraudsters deploy different techniques, in different ways, for different purposes. All generations need to be combatted at the same time. New and advanced security and fraud solutions should therefore consider the following capabilities:

Holistic AI deployed across all networks, feeding behavioral and decision models. All data should contribute to configurable and dynamic decisions. Feature generation as code and adaptive analytics should be key components. The ability to constantly risk assesses events, and compare them to past events, either via individuals or peer groups, should be built into the product.
Hybrid deployment whether via a vendor-controlled cloud, customer controlled private cloud or on premise are prerequisites for a fast and evolving deployment.
A reverse proxy sitting beside the CDN, provides the ability to assess all traffic in real time, rather than just point in time transactions. The reverse proxy also allows the bots to be on client/vendor-controlled turf, allowing constant assessment, dynamic treatment strategies with no impact on legitimate customers.
Deep packet-to-person inspection sliced and compartmentalised. Each slice acts as its own virtual network, passing packets back and forth. The system needs to look at the data inside these packets in real time, allowing only legitimate traffic to cross the network layers. Simply inspecting headers on packets is no longer sufficient.
Risk assessing one attack at a time is no longer enough. The system needs real-time decisions and behavioral signals to allow assessment of the whole session, providing contextual analysis from previous sessions.
It needs to operate at the network edge, moving resources closer to the point of access. Having a single secured perimeter is not adequate, every point in the network must become a control point.

Assessing the Impact of New Robot Wars Era

We will see further developments in bot evolutions; bots-as-a-service will facilitate the continued de-skilling of fraudsters and hackers, which remove even more barriers to entry. Social media will continue to entice the young. Like fraud vendors, bot developers are now pricing their subscriptions only for successful events. The bots arms race will accelerate over the next decade. There will be a paradoxical shift where previously 1 in 100 transactions are bad to 1 in 1000 transactions being legitimate. The bad bots will impact businesses by lowering SEO rankings, destroying customer trust, providing skewed analytics, disturbing revenue flows, and leaving behind a ruined reputation. Only companies with the most secure and strongest defences will survive the bad bot onslaught.

About Darwinium

Darwinium is a Digital Risk AI solution built to be more agile than the adversaries attacking it. Future-proofed to protect against tomorrow’s risks, today. In short, Darwinium is digital risk transformed, web security enhanced and customer experience optimized, giving business control over every type of bot.

* https://www.imperva.com/blog/bad-bot-report-2021-the-pandemic-of-the-internet/

** https://www.forbes.com/sites/jessedamiani/2019/09/03/a-voice-deepfake-was-used-to-scam-a-ceo-out-of-243000/?sh=4da6e2dd2241