MAC address anonymization performs a one-way function on a MAC address so that the result may be used in tracking systems for reporting and the general public, while making it nearly impossible to obtain the original MAC address from the result. The idea is that this process allows companies like Google,[1] Apple[2] and CrowdVision[3] - which track users' movements via their computer hardware - to simultaneously preserve the identities of the people they are tracking, while tracking the hardware itself.
An example of MAC address anonymization would be to use a simple hash algorithm. Given an address of 11:22:33:44:55:66
, the MD5 hash algorithm produces eb341820cd3a3485461a61b1e97d31b1
(32 hexadecimal digits).[4] An address only one character different (11:22:33:44:55:67
) produces 391907146439938c9821856fa181052e
,[5] an entirely different hash due to the avalanche effect.
The problem lies in the fact that there are only 248 (281,474,976,710,656) possible MAC addresses. Given the encoding algorithm, an index can easily be created for each possible address. By using rainbow table compression, the index can be made small enough to be portable. Building the index is an embarrassingly parallel problem, and so the work can be accelerated greatly e.g. by renting a large amount of cloud computing resources temporarily.
For example, if a single CPU can compute 1,000,000 encrypted MACs per second, then generating the full table takes 8.9 CPU-years. With a fleet of 1,000 CPUs, this would only take around 78 hours. Using a rainbow table with a "depth" of 1,000,000 hashes per entry, the resulting table would only contain a few hundred million entries (a few GB) and require 0.5 seconds (on average, ignoring I/O time) to reverse any encrypted MAC into its original form.
In 2018, academics found that with modern computing equipment with the ability to calculate 6 billion MD5 hashes and 844 million SHA-256 hashes per second the authors are able to recover 100% of 1 million hashes in:[6]
Another approach that has been tested is truncating the MAC Address by removing the Organizationally unique identifier (the first 24 bits of the 48 bit MAC Address).[7] However, as only 0.1% of the total Organizationally unique identifier space has been allocated and not all manufacturers fully utilise their allocated MAC Address space, this fails to offer any meaningful privacy benefit.[8] Furthermore, manufacturers will frequently assign contiguous address blocks to specific devices allows for fine-grained mapping of the devices in use - allowing the device type to be identified with only a small part of the MAC Address.[9]
Due to the pitfalls of existing approaches, more robust anonymization approaches have been developed by academics.[10] In particular, Junade Ali and Vladimir Dyo developed an approach which works by:[11]
The degree to which a resulting hash is truncated is a balancing act between the privacy offered and the desired collision rate (the probability that one anonymised MAC Address will overlap with another). Previous work has suggested that it is therefore difficult to control the anonymity set size when using approximations of the Birthday Paradox.[12] Instead, Ali and Dyo use the overall rate of collision in the dataset and provide that the probability of there being a collision p can be calculated by where there are m MAC Addresses and n possible hash digests. Therefore "for digests of 24 bits it is possible to store up to 168,617 MAC addresses with the rate of collisions less than 1%".