Thought about this too a while back. If I remember correctly the "issue" (actually I consider it a feature) with most modern smartphones is that they will randomly change their MAC address in order to prevent exactly this kind of tracking.
Do you have any idea how often they change? Like if you were only looking counts of distinct MAC addresses, could you shorten the listening window so that an individual phone would have the same address?
On probe requests this can be very often(< second or within 50 frames), especially in the case of IOS. On Android MAC randomisation is implemented only on a handful of specific devices and some vendors dont at all.
Mac randomisation is flawed at the least and can still be used to identify an individual in some cases.
The paper below lists techniques explored by the US Navy published in 2017.
https://arxiv.org/pdf/1703.02874.pdf
It's 48bits, so a space or around 2.8 million billion possibilities, so you'd need a over a million billion devices to have a 50% chance of collision. _If_ everything was statistically randomly distributed across the entire address space. But they aren't. There's a few kinds of structure in MAC address formats that reduce that but potentially a _lot_...
Theoretical worst care, a MAC address has 24bit of organisation identifier and 24 bits of device identifier. So If an organisation/manufacturer only makes one model of device, they'd "only" need to build ~16.7 million (24bits) of them before they repeated a device identifier (if they chose not to use up any of their organisation bits to reflect that rollover). Again, maybe half that if they just randomly choose a device ID each time instead of enumerate the space.
(Also, many Wi-Fi adaptors have easily changeable MAC addresses. Back in the day when cafes used to charge for Wi-Fi access, it wasn't uncommon to sniff the network for a "paid up" MAC address, and either wait til they left and use it, or de-auth them and do a hostile takeover of their paid-for internet access. Apologies to anyone who used to pay for "unreliable" Wi-Fi at Atlas Cafe on Alabama St back in the late 90s/early 2000s...)
Yep - the chance of any collisions at all, vs the chance of a collision with _your_ specific device/MAC.
That does though drop the chance of any collision at all (aka the birthday paradox) of devices discriminated solely by the 24 bit device identifier down to sort(2^24) which is only 4096. A significantly smaller number than I expected...
I think another question to ask is: do they randomize across all possible MAC addresses, or just within the block of addresses assigned to the type of device. My experience suggests it’s the latter.
They do retain the mac for scanning networks that they have previously connected to AFAIK. This is what allows the correlation of mac approved (ie paid or bound mac networks).
However, if your wifi is on and not connected to an AP, then you will broadcast mgmt frame pings with the correct MAC for all the networks you've been connected to.
An enterprising hacker could submit those to wigle and figure out not only uniques, but also tell what geographical part of the world you're from.
Nicer hackers share this for public knowledge on HN :-D
(edit: really? -1'ed? How is this wrong? Would love to hear from detractors, as this technique is how malls and supermarkets track individual users.)
iOS has sadly made it more difficult to disable WiFi in Control Center (turns back on after fixed time). I wonder if iOS12 Shortcuts could perform geofencing of WiFi.