I lie, you lie, everybody lies: WiFi tracking in the era of MAC randomization

Over the last few years many services have appeared, that use the WiFi signals emitted by smartphones to monitor the presence of people. The leading application of such services has been in the retail analytics sector, where they are widely used to monitor how many people pass in front of a store, how many enter it, how long they stay, how often they come back etc.

Figure 1: Retail customer model (courtesy of RetailerIN)

To better understand their use in retail we probably need a little step back. Traditionally, retailers have measured the number of people entering their store. First they used passive infrared technology, then moved to cameras (stereo, 3d, thermal.. you name it). Typically such cameras hang close to the entrance door and can tell how many people enter the store and how many people exit it. Yet, they can’t measure how long people stay. And they can’t measure how often people come back to the store. These metrics are highly relevant for retail stores, in that they act as proxies of engagement and loyalty, respectively. And engagement and loyalty correlate strongly with sales.

This is where WiFi kicks in. It is still kind of a secret (at least for laypeople), but if you have WiFi turned on (no, I didn’t say you are connected to a WiFi network, you are just walking to office with the phone in your pocket and the WiFi turned on) your presence can be tracked. There is indeed one part of the WiFi standard that allows WiFi-enabled devices to advertise their presence. Technically, this particular procedure is termed active scanning, and the messages used are called probe request frames. Think of it as your phone shouting “Hi, I am here!”. This was introduced in the standard to enable phones to connect faster to a known WiFi network (one you have connected to in the past and is in your list of favourite nets). If works like this (more or less):

  • Your phone: “Hi, I am here!”
  • A known WiFi network’ router: “Hey dude, welcome back, do you wanna connect to the Internet?”
  • Your phone: “Yeah, why not?”
  • And then you are basically online…

The point is that when sending out hello messages your phone is signing them with a ‘name’ that uniquely identifies it (formally: the MAC address of the WiFi interface). Think of it as kind of your fiscal code or (if you are based in the US) your social security number. And as no encryption is used here, anybody can see it. Which means that anybody with a tracker device (no NSA-like devices: a standard laptop running a proper software, and there are plenty of options, can make it) can track your presence without you noticing it.

Depending on how paranoid you are, you may feel scared that somebody can actually track where you go etc. Which is probably happening right now, actually. Of course, more correctly somebody can track the presence of your phone, but nobody can know that it is your phone. Still, people can track the location where your phone happens to be. Bye bye privacy?

You may wonder whether this is legal. The reality is that it’s in a kind of grey area. In legal terms people should wonder whether the MAC address of a WiFi interface is “personally identifiable information”, which is the kind of data that falls under personal data regulation and privacy directives. Most companies that do WiFi tracking for a living state it is not:

  • Walkbase: “All data that we gather is anonymous and we do not store any personal information”
  • Euclid Analytics: “Collects and aggregates non-personal data that is broadcast by Wi-Fi-enabled phones.”
  • RetailNext: “ RetailNext logs the MAC address of these broadcasts along with the time, the signal strength, and the location of the observing sensor. RetailNext does not store the payload of wireless network packets. Unless you explicitly opt into a related service from one of our customers, it is our policy never to use this data to identify a person by name.”

Despite what companies claim, regulators have a different opinion, in both US and EU, and clearly hint that the MAC address is actually personal information:

  • FTC: “This persistent identifiers (MAC address) often can be linked to individuals by name. For example, when you sign into a commercial WiFi hotspot, your MAC address is tied to the information you use to sign up for the service. Additionally, automatic WiFi probes also broadcast the names of last networks a device has connected to, which potentially reveal additional information about the individual, such as the name of their home or work network”
  • EC (Article 29 Data Protection Working Party, Opinion 01/2017): “WiFi-tracking, depending on the circumstances and purposes of the data collection, such tracking under the GDPR is likely either to be subject to consent, or may only be performed if the personal data collected is anonymised.”

Refraining from taking a position here (I am not a lawyer…), the point is that tracking WiFi is getting rather common. And it does raise legitimate privacy concerns.

To address privacy issues Apple, starting from v.8 of its iOS operating system, added a WiFi MAC randomization feature. iPhones, when sending out hello messages, are using a randomized version of their real MAC address. Think of it as a fake identity. So you can still see the presence of a device, but such device may change its identity at any time (even multiple times during a same day), so that tracking becomes ineffective.

And Android? Well, also Android has added support for WiFi MAC randomization, starting from v.6, but this feature is not used consistently by all phone manufacturers (see here for a detail description of what various vendors have done).

The question is then: provided that now many (we measured ~50%) of devices use WiFi MAC randomization, what is the statistical significance of the analytics provided by WiFi tracking? In other words, are these services (RetailNext, WalkBase, Euclid or even ourselves at RetailerIN) lying?

Some of them actually claim they use techniques able to ‘defeat’ MAC randomization. While it is in principle possible to do it (see this great article on various techniques and workarounds) this is not feasible in practice (trust us, we tried!). It would be a bit long to explain you the various techniques and why they fail (I might actually do it in a separate, more technical, post) but the reality is that randomized MAC addresses cannot be derandomized. Good for privacy! But what about analytics?

Most companies in the arena actually end up using only real MAC addresses and, assuming that they represent a statistically significant sample, use such data as the basis for their analytics.

But this is just wrong from a methodological standpoint. Devices using real MAC addresses are often old devices, which adds a big bias on the population sampled (this is known in statistics as sampling bias). And raises serious doubts on the effectiveness of the metrics computed.

So, is everybody lying? Yes and no. I can’t speak for competitors, but what our fellows are doing at RetailerIN is to do some rather complex statistical processing in order to compute good estimates based on both real and random MAC addresses. This builds on a continuously calibrated mathematical model that is based on the popularity of various smartphones (and how such phones use randomization features) to avoid (or, better: minimise) sampling bias.

Does it work? Well, our tests with ground truth (based on cameras) suggest more than 85% accuracy. So yes, it works fairly well after all 🙂