There has been a significant uptick in the online privacy discussion of late, with industry heavyweights in the crosshairs of users as well as lawmakers. A very important and often under-discussed element of that conversation is anonymity – and it’s telling that many of those driving the discussion focus on the methods by which users abdicate their anonymity rather than the protection of said anonymity.
Consider your private data as a treasure trove, and the current conversation as plans to construct thick walls around them. Consent laws and the like fit this mold – they constitute efforts to gatekeep your data. However, the issue with constructing walls and gates and the like is that they can be breached, easily – case in point, Equifax.
The real solution to online privacy is clear: your treasure trove should be empty. Your identifying data should not be at the center of any corporate entity’s data castle. While this sounds radical, there is a very real technological compromise here – anonymity must be part of the data science of every tech company. The broad claim is simple: uniqueness, not identity, represents the bulk of your data’s value for the purposes of marketing.
As such, I am not proposing an end to corporate entities tailoring marketing towards your footprint, but rather an end to corporate entities tailoring marketing towards your identity.
When people learn that I work in ad tech, I like to present them with a simple scenario to explain my interest in anonymous methods. I ask them: “You’re not bothered by an ad tech company using user data to learn that people in general buy shoes on Tuesday nights, but you’re not a fan of them knowing that you like to buy shoes on Tuesday nights, correct?” Nearly everyone I have asked this of has said yes, some quite emphatically.
Marketing as traditionally defined (demographics and statistics about the market) is understandable to everyone as a necessary component of doing business. Intrusive personal level data is not, however. This underscores the difference between user data for marketing and user data for stalking – both can represent marketing value, but only one crosses the line into what individuals consider invasive.
Can this be done? Can we get market data of importance to a business and yet allow the users of our services to remain anonymous?
Absolutely, yes. I will illustrate this with a very common statistic of importance in marketing – the number of unique users. When we deal with user visits to a website or users that viewed an ad online, we are talking about millions of actions per hour, if not more. This data stream is the hallmark of the Big Data age, lots of data streaming through your production system. Calculating the number of unique anything (cookies, IPs, etc.) seen in a day or week would require large amounts of computer memory that may not be available. Enter a data streaming algorithm called HyperLogLog.
Data streaming algorithms like HyperLogLog handle the fire hose of data in today’s world in a clever way. They refuse to memorize all of the data so as to efficiently calculate a statistic of the data. So instead of keeping track of all the unique things it sees in a data stream, HyperLogLog keeps a small summary of what it has seen. This small summary is the reason that HyperLogLog can do two seemingly impossible things at the same time. It accurately counts the number of uniques seen, but it cannot tell you who it has seen!
This is accomplished by randomly hashing your identity and throwing it on a metaphorical dartboard. Any semblance of an individual identity is completely obliterated by the algorithm and the “identity” becomes a unique high water point within the algorithm. If the algorithm has already seen a higher example than you, it does not even use your hashed data.
HyperLogLog is extremely useful for data science and marketing. For example, you can count the number of unique users who showed up at a particular website across a particular time frame. You will not, however, know who they were – no identities, simply anonymous, unique data points. You can then compare that set with the unique users who showed up at the same website on a different time frame, say a week after the original.
Despite enforcing full user anonymity, you as a website owner get the information that is necessary for the function and marketing of your site. Anonymity makes demographics and marketing possible while preserving uniqueness; HyperLogLog represents a uniqueness, not identity, algorithm.
Ad tech has always seen a big hole as an inevitable and negative part of the industry – identity. The unknown. We often think of anonymity as representing holes in our knowledge. However, this is not the case. Just as other industries have figured out how to use (literal rather than figurative) holes to their advantages – think colanders, or flyswatters, or even performance car air intakes – so too can and should the ad tech space.
Consider how Facebook famously builds identities around the “hole” represented by an individual who does not use their platform. This still requires the use of identity to fill in the assumptions around a hole, but represents, to an extent, a taste of the power of anonymity in web data. It just fails to protect privacy, as it still relies on Facebook’s walled castle full of private data in order to create and market to identities.
This use of HyperLogLog, to compare unique match rates between two data streams is not unknown. But it is hardly used. It is an enormous wasted opportunity to use anonymity to drive marketing insights. My colleague, Peter Martel, has released an open source HyperLogLog library that makes this easier to do. Besides the functionality and algorithms that you can do with other HyperLogLog packages, this library provides a new function not commonly implemented in these libraries – namely, it allows you to save the “sketch” of the data stream. Comparing sketches, not doing a single shot calculation as commonly done, is the greater value of HyperLogLog
There are two strategic moves that the industry should make in order to make space for anonymity in ad tech. First: a communication channel to audiences. Marketing science is already mostly a demographics play anyways; this was the DNA of advertising before the Internet and identifiable individuals, and continues to be at its core.
The second move relates to the kind of data science that generates new inventions. You should agree at this point that we can find new algorithms and technologies to increase the use of anonymity in ad tech. And this sort of technology is not a future project or hope. HyperLogLog exists today. Swoop has started to use it to provide deeper insights into the number of unique users that are seen during an ad campaign. We have open sourced our software for doing so. Our HyperLogLog library is our small contribution to expanding the use of anonymity in ad-tech, but we believe it is a big example of our dedication to this expansion.