SpikeStrip is a module for Apache 2.x that implements a new form of content access control for online social networks. SpikeStrip enables identification, tracking, and rate limiting of rogue crawlers. SpikeStrip encrypts hyperlinks within pages served by Apache using the current browser's session key and a server-side secret key. Incoming requests for encrypted URLs are decrypted and served as normal. This link encryption procedure creates a unique "view" of the protected website for each browser.
Because each "view" is unique to a specific browsing session, this enables SpikeStrip to track each client's requests with 100% accuracy. If crawlers attempt to evade detection by changing their session keys their enqueued URLs will be invalidated, since they will no longer decrypt properly. This forces crawlers to restart their traversal from scratch, effectively defeating them. Furthermore, SpikeStrip significantly hinders stealthy, distributed crawlers. Binding URLs to individual sessions prevents crawling machines from being able to effectively share URLs and state, which are key coordination operations for distributed crawlers.
SpikeStrip uses its precise client-tracking capabilities to perform rate-limiting on HTTP requests on a per-session basis. This approach to request rate-limiting is superior to IP based tracking since it can disambiguate users who are behind NATs and proxies. SpikeStrip's rate-limiting prevents rogue crawlers from being able to index the protected website's content in a timely fashion.
The protection offered by SpikeStrip is 100% configurable via the Apache conf file. The IP addresses and hostnames of known good crawlers, such as Googlebot, can be whitelisted so that they will not be rate-limited. This enables websites to still be indexed by search engines. The URLs that SpikeStrip encrypts are also configurable through the use of regex statements. This enables website administrators to determine precisely which portions of their site receive SpikeStrip protection.
SpikeStrip is designed for high efficiency. In most cases, SpikeStrip imposes only 7% CPU performance penalty on protected webservers, and uses less than 30 megabytes of RAM.