There’s an interesting paradox that arises when you try to protect your
anonymity online. Let’s say that you want to avoid online tracking, so you set the
“do not track” bit and off you go. Anyone who respects Do Not Track will
leave you alone, but anyone who does not suddenly is able to track you
much more easily. Paradoxically, cutting out tracking entirely from ethical
advertisers who respect DNT has made your targeted advertising situation worse
overall.
In order to understand why that is, we need to lay a bit of foundation. Our
assumed goal is to minimize the worst targeted advertising. The more accurately
advertisers can target you, the worse it is for you because you are more likely to be
affected by the ads. Being vaguely identifiable to a lot of advertisers means
sometimes an advertiser might get lucky and show you something which really
resonates with you. Being specifically identifiable to a few advertisers means being
exposed to advertisements that are specifically tailored to be persuasive to you on a
fairly regular basis.
The problem is in the way that you are tracked online. Advertisers can
keep track of your browsing in several ways. Perhaps the most simple is to
simply ask you to identify yourself. This is done by sending your browser a
cookie—a file containing some arbitrary data—for their website. Then, when
they encounter a new browser, they ask whether or not the browser has
a cookie, and that cookie can then be used to identify you. If you don’t
want to be identified, it’s as simple as not sending the cookie. There’s an
incentive to track people even when they do not want to be tracked, so if
you don’t play nice with the cookie they move on to more sophisticated
methods.
Every time you connect to a website, you send all sorts of information. Just to
initiate the connection, you need to provide your IP address. Otherwise, the web
server wouldn’t even be able to send you back the page you wanted to see. You
provide something called a user agent string which identifies your web browser and
operating system—among other things—so that the website can do things like
provide you with download links to software versions that will work on your
computer. Once you’re on the page, javascript needs access to your screen resolution
and the fonts that you have installed so that the content on the page can be
displayed properly. The page can check whether you have a webcam installed, and
what features it supports. The page can check whether you have Flash or HTML5
support.
Sometimes information is leaked unintentionally as well. For example, there is a
feature of web browsers that they will color hyperlinks differently depending on
whether or not you’ve visited the page they link to. This is extremely helpful if you
are trying to navigate a page with a lot of links, and you are trying to keep track of
what you have seen and what you haven’t. It turned out that this coloring
information was discoverable by the page itself, which lead to websites being
essentially able to query your browser history by trying to color a bunch of different
links to popular websites.
All of these things individually are features that are shared by a large
number of people. Pretty much everyone has approximately the same set
of fonts installed, and uses one of a fairly modest set of browser versions.
Taken in aggregate, we call all of these signals a “browser fingerprint,” and
in total they can be used to pretty effectively and uniquely identify you.
Where Do Not Track comes in is that it is an additional signal that can
be used as part of a fingerprint. Since most people don’t use the Do Not
Track bit, setting it makes you more trackable since the tracker can rule out
any of their profiles that don’t have that bit set while trying to identify
you.
We can do some more formal analysis of how much more trackable it makes us
by thinking about our anonymity as a proportion of the total population.
There are approximately 7 billion people on the planet. Around half of
them1
are online, but those that are often use multiple devices, so we can call it a wash and
say there are 7 billion browsing devices to track. The smallest unit of information is a
single bit, which can take a value of either 0 or 1. If we provide each person on the
planet with their own ID number, we would need ⌈log 27,000,000,000⌉ = 33 bits to
do so. As a standard measure of how much anonymity a particular piece of
information strips away, we can think about how many bits of that ID number it
could determine. If half the planet uses Firefox, and the other half uses
Chrome, then knowing which browser a person is using strips away 50%
of their anonymity, or 1 bit. Doing something really outside of the norm,
like having a bunch of rare, curated fonts or a nonstandard size browser
window will blow away essentially all of your bits of anonymity, since the pool
of people who also have your same setup is so small. Using a version of a
browser that is too old, or too up to date, might only cost a fraction of a
bit.
Why not just compare percentages directly, instead of converting into bits first?
The benefit we get from the bits of anonymity metric is that we can compare the
privacy implications of different actions across different spaces. If you are in a small
community, you have much fewer bits of anonymity to spend, and so something that
reduces you by a couple bits really hurts. Being the one person on a niche community
forum who uses an obscure browser will uniquely identify you, but you can still do
the analysis of how many bits it costs by using population numbers pulled from the
global population.
We can use this metric to compare decisions that we might make related to our
online anonymity as well. If 90% of users don’t use Do Not Track, then setting it
costs log 2(1∕0.1) ≊ 3.3 bits of anonymity. If you’re using desktop chrome, that costs
≈ 0.73 bits, since Chrome has around 60% desktop market share. Safari has 17%, so
that’s ≈ 2.5 bits.
If you truly want to be anonymous online, then you have two options: either you
have an entirely default configuration with the most standard browser size and
version, the most general IP address you can use (which probably means a popular
VPN), no fonts, a clean, unmodifed version of the browser with no history or
cookies and so on. The other option is to hide yourself so well that you cannot
be fingerprinted for some other reason. The risk is that if you don’t get
it exactly right, whatever unusual configuration choices you’ve made will
uniquely identify you, stripping away every last bit of anonymity you have.
Unless you’re very confident in the success of your scheme for hiding from
advertisers, think about how many bits of anonymity it strips away if it goes
wrong, and whether that’s more bits than you’d lose just by using a default
setup.