Wednesday, May 19, 2010

Technical details of the Street View WiFi payload controversy

The latest privacy controversy with Google is that while scanning for WiFi access-points in their Street View cars, they may have inadvertently captured data payloads containing private information (URLs, fragments of e-mails, and so on).

Although some people are suspicious of their explanation, Google is almost certainly telling the truth when it claims it was an accident. The technology for WiFi scanning means it's easy to inadvertently capture too much information, and be unaware of it.

This article discusses technically how such scanning works.



BACKGROUND

There have been many controversies surrounding Street View. The first is about the images the cars take. They often contain private information, such as the license plates of cars parked on the side of the road. Google keeps improving algorithms to fix this, such as automatically covering license plates within images.

Street View cars also record nearby WiFi access-points. The purpose of this is to provide an alternate to GPS. A computer without GPS can scan for nearby access-points, look up their location in Google's database, and figure out it's own location. This is similar to the service provided by the company Skyhook Wireless, or by the collaborative effort WIGLE. Even though this is a useful tool, it is still a bit controversial, because it's yet one more piece of data (the location of everyone's access-points) that Google knows about us.

The current controversy s that while scanning for access-points, it may have captured private data.

PACKET SNIFFERS

Many people expect that Google would use a standard tool for mapping access-points, such as "NetStumbler". This is a popular program on Windows that makes it easy to both find access-points, as well as record their GPS location.

The problem with NetStumbler is that while it's easy to use, it isn't comprehensive. It doesn't capture the raw signals from access-points, but instead relies upon the underlying operating system (Windows) to do the work for it. A lot of information is lost in the process. In order to comprehensively map access-points, you need to capture the raw wifi signals and packets, such as through a "packet-sniffer".

The WiFi radio in your laptop receives all packets on the current channel, including packets sent by other people's laptops near you. However, the WiFi device checks the incoming packets to see if they have the proper "MAC address" (the unique serial number assigned to your WiFi device). If they have the wrong MAC address, then the WiFi device will drop the packets. Only packets destined to your MAC address will be continue to be processed by your computer.

The way a packet-sniffer works is to turn off the MAC address check. All packets received by the WiFi radio are kept in the system, then saved to disk.

There are two parts to a packet: the header information (the envelope) and the payload/contents. Google is only interested in the headers.

THE BEACON PACKET

There are many types of packets. The most interesting packet is the "Beacon". The average access-point sends out this Beacon many times per second, advertising its existence, its name (the "SSID"), and a list of features (like whether a password is required).

When your laptop gives you a list of nearby access-point, it's simply listing the access-points from which it has received a Beacon. A program like NetStumbler builds its list from the same information.

The following picture shows a typical Beacon packet. I'm sitting a Panera café with free WiFi, this is the Beacon from the local access-point.

The raw packet is shown in a "hex dump" at the bottom, with the decode explanation at the top. I've selected the "SSID" field to show how the decoded information corresponds to the selected hex data.

THE DATA PACKET

While the Beacon packet is the most useful packet, other packets can be useful too.

Let's say that there is an access-point within a building, but the Beacon packets are blocked or too weak to reach the street. The access-point will exist, but Street View won't be able to see it.

However, somebody could be using a laptop halfway between the access-point and the Street View car. The laptop's packets can reach both the car and the access-point. Thus, even though Street View cannot see the access-point itself, it can still infer its existence by looking at the DATA packets.

A data packet example is shown below (in typical decode+hexdump format). This packet sent was sent by my laptop to the local access-point. I've highlighted the "BSS ID" field, which is the MAC address of the access-point (the same one shown in the Beacon above).


In addition, you'll notice that the signal strength in the decode. Google can use this to triangulate the location of the device that sent the packet. Street View knows the precise GPS location of the car as it rolls down the street. If it can get three beacons (or other data packets) from the access-point, it can triangulate the position of the access-point. Moreover, if it stores the raw packets from one day as the car takes one route, it can correlate the packets with another day's packets on a different route.

Triangulation is a lot harder than you'd think. This is because many things will block or reflect the signal. Therefore, as the car drives buy, it wants to get every single packet transmitted by the access-point in order to figure out its location. Curiously, with all that data, Google can probably also figure out the structure of the building, by finding things like support columns that obstruct the signal.

What's important about this packet is that Google only cares about the MAC addresses found in the header, and the signal strength, but doesn't care about the payload. If you look further down in the payload, you'll notice that it's inadvertently captured a URL.

Take a look again. Even though the access-point MAC address is highlighted, there's extra data in the packet. These extra data will include URLs, fragments of data returned from websites (like images), the occasional password, cookies, fragments of e-mails, and so on. However, the quantity of this information will be low compared to the total number of packets sniffed by Google.

That's the core of this problem. Google sniffed packets, only caring about MAC addresses and SSIDs, but when somebody did an audit, they found that the captured packets occasionally contained more data, such as URLs and e-mail fragments.


NEIGHBORHOOD SNIFFING

Google captured very little as it drove through neighborhoods. The primary reason is that most people encrypt their connections (by putting a password on their access-point). The second reason is that the car is only near an access-point for a few seconds - during which time it's unlikely that any data is being transferred.

You can verify this yourself (assuming it's legal in your area). You can download a version of Linux full of security tools like "BackTrack 4". You don't have to install it on your laptop, but instead, can put it on a USB flash drive, and boot from the flash drive. You can then run a tool like "ferret" that will sniff the wifi and show you interesting private information, like URLs. (Only half the wifi devices support such raw sniffing, you may have to buy a separate USB wifi stick for $10).

If you drive down the street running 'ferret', you'll see that it almost never shows you any information other than wifi control traffic (like Beacons).

The real reason Google might have data payload isn't from neighborhoods at all, but from cyber cafes and hotspots. If a Google Street View car came down the street near this Panera, it will be flooded with data packets from people within the café. This is the slowest period of the day, but I count 7 people using their laptops and one person using their iPhone to surf the web. Moreover, whereas people may encrypt their traffic at home, the hot spot here at Panera is unencrypted.


PROTECT YOURSELF, DON'T PUNISH GOOGLE

It's really easy to protect your data: simply turn on WPA. This completely stops Google (or anybody else) from spying on your private data (assuming you haven't done something stupid like chosen an easily guessed password, or chosen WEP instead of WPA). If you don't encrypt your traffic, then by implication, you don't care if people eavesdrop on it.

Laws against this won't stop the bad guys (hackers). They will only unfairly punish good guys (like Google) whenever they make a mistake.

HOW TO FIX IT

Google can easily get rid of the payloads by "slicing" data packets to the first 24 bytes. This preserves the MAC address and signal strength information Google wants, but gets rid of any private information inadvertently gathered from people who do not encrypt their connection.

EDITORIAL: THE NEED FOR TRANSPARENCY

This situation was only found because somebody audited Google's data. Just because they have no evil intentions doesn't mean they haven't made an evil mistake. The more Google becomes our overlord, the more we should demand that they be open and transparent about what data they are keeping about people.

CONCLUSION

What I've focused on here is question of Google's collection of "data" payloads, not the other privacy issues. Some people have accused Google of lying, and for having some nefarious purpose for gathering these packets. However, anybody who has experience in WiFi mapping would believe Google. Data packets help Google find more access-points and triangulate them, yet the payload of the packets do nothing useful for Google because they are only fragments.

25 comments:

Digital said...

Mistake or not - and we can't know for certain - it shouldn't excuse this sort of behavior and a large company like Google has to be responsible for its own mistakes. I agree that users should try to protect themselves, but just because a user doesn't know how to protect themselves doesn't give Google the right to snoop on them. Furthermore, we really don't have alternative while traveling because nearly all hotspot operators refuse to enable encryption.

Dan Mobile said...

This is the coolest most useful security blog on the entire internet. Thank you.

Matt Weir said...

I completely agree with you. Who here has run a kismet scan without only limiting it to the AP you were looking at? Let's be honest, we've all grabbed other people's web traffic either by accident, or just because it's too much trouble not too. The important thing is what do you do with it.

It amazes me that people are worried about Google obtaining a 10 second peak at their wireless traffic when those same people are typing their most personal/private questions into Google's search box, using gmail, and commenting on blogger ;)

I'm not saying people shouldn't be concerned with all of the data Google is collecting, or how they might use it. I'm just saying the wireless sniffing they did probably is not the issue that we need to be focusing on.

Deny Phorm said...

Unfortunately you don't pay a lot of attention to the news do you?

Firstly, there are many nefarious uses of this packet data not least in typing geo-locations to Google cookies etc.

If you bothered to read the news you would see that France just released a statement yesterday saying they have examing the data and it contains emails, email passwords and countless other pieces of sensitive data.

So how about the Google apologists stop talking nonsense and actually do some research on the issue?

Skepticratic said...

Any criticism of Google in this event is irrelevant seeing as they only got data from unencrypted network broadcasts. Key word: broadcast.

warspite said...

...thank you. This was clear, concise, and well-written. And how often does That occur on the internets?

¬¬davekov.com

Robert Graham said...

Firstly, there are many nefarious uses of this packet data not least in typing geo-locations to Google cookies etc.

I don't deny there are "nefarious" uses. I only point out that Google almost certainly didn't intend "nefarious" uses -- only that if you scan for access-points for geolocation purposes, you'll accidentally get additional data as well.

Moreover, if you wanted to be nefarious, StreetView cars wouldn't be the way to do it. You would instead want to park outside a location that gives you a large data set.


If you bothered to read the news you would see that France just released a statement yesterday saying they have examing the data and it contains emails, email passwords and countless other pieces of sensitive data.

I read that article, and it's not in conflict with what I wrote.

Occasional passwords, e-mails, and other sensitive data will show up -- but only rarely.

So how about the Google apologists stop talking nonsense and actually do some research on the issue

Research the issue? I'm one of the worlds top experts on the issue. I've researched it over 20 years.

Rob said...

@Digital. Because it's a big company it's bad? What part of "fragments" and "not useful" did you have trouble understanding? Claiming that you don't have any alternative while traveling is silly. I VPN to my home or my ISP when I travel - and my data is nicely encrypted.

Mike said...

I apologize, but basically, to me, this boils down to "Google wasn't aware of how the technology was working." I'm not sure how you can try and argue that as a defence. As a technology-based company, more specifically, an internet-based tech company, should be intimately familiar with this kind of technology. So either they're guilty of gross incompetence in their field of speciality, or they're lying. A company like Google has no excuse not to know about the difference between beacon and data packets, and that data packets can contain information beyond the identification of the target router.

Further, I've never heard any acceptable justification for capturing SSIDs and wi-fi locations in a residential or business area in the first place. The desire to post a complete map of all WLAN locations isn't justification. Just because a wi-fi location isn't secured doesn't make it an invitation for the public to use, and if it's located on private property, it's certainly not Google's responsibility to report its existence.

While this explanation might excuse an individual's actions in something like this, I find it abhorrent that anyone would think this excuses Google's actions.

Deny Phorm said...

The facts are:

Google paid someone to write this code, which purposefully retained the unencrypted data, which in turn contained sensitive data protected from interception by law. Whether you agree with those laws is utterly irrelevant - the law was broken - for 3 years - across 30 countries - and hundreds of millions of addresses. You can make excuses for them as much as you like, that fact will not change.

Google filed a patent for this technology in 2008 yet then claim it was the work of a "rogue" coder. Claiming ignorance is not just implausable but utterly untenable. Again, despite your efforts to vindicate Google - THAT fact will not change.

The data collected can be leveraged for significant commercial benefit and no matter how much you try to dismiss that FACT as mere fragments with no utility - it does not change.

When you come up with a valid counter argument to all these facts instead of just dismissing them because they are inconvenient - give me a shout. As an objective researcher, I would be happy to read them.

Deny Phorm said...

Whether you have "researched" for 20 years or not is completely irrelevant - given the Google WiFi issue has only been ongoing for about a month.

"I don't deny there are "nefarious" uses. I only point out that Google almost certainly didn't intend "nefarious" uses -- only that if you scan for access-points for geolocation purposes, you'll accidentally get additional data as well."

So why not discard that data the same way as they discarded the "useless" encrypted data? The retention of the unencrypted data was not arbitrary therefore it cannot be passed off as accidental.

"I read that article, and it's not in conflict with what I wrote."

Here is what you wrote, right there at the bottom of your conclusion:

"yet the payload of the packets do nothing useful for Google because they are only fragments."

Now, you are wrong on this, completely wrong and it is not an opinion it is a fact based on the data audit that was carried out by the French - so please explain to me how that is not in conflict with what you wrote?

I guess it depends on how you define a fragment? At a Universal level the earth can be seen as a fragment of the "stuff" which was expelled by the "Big Bang" yet it is pretty clear to see that that fragment contains many stories and a great deal of information.

The same can be said for the "fragments" Google collected. Google's cookies are pervasive across the Internet and at any 1/5th of a second on a global scale there are probably millions if not billions of Google's cookies traversing networks. Chances are a reasonably high number of networks Google eavesdropped on had the potential to have Google cookies in them.

Are you denying that geo-validating the location of a "Google User" (user being the unique ID in said cookies) has any commercial value to Google? Take your time it is not a trick question.

"Occasional passwords, e-mails, and other sensitive data will show up -- but only rarely."

And that makes it OK then? And how do you define rare - 1 in 30? Because so far the -first- country that has analysed that data (out of 30) has discovered it contains sensitive data including passwords and email content - and that is just for the data relevant to France.

We don't yet know how many instances of said sensitive data appeared in that single country's data, but it is still there in the very first country. I suppose you might argue that is coincidence? I don't think many other people would support that argument.

But aside from all that - how dare you pass off the retention of even "rare" instances of sensitive data captured, in such a way? You have absolutely no authority and no right to say it's ok because it only happens rarely.

"Research the issue? I'm one of the worlds top experts on the issue. I've researched it over 20 years."

Could your ego actually get any bigger? As I said above - the Google issue has only been known about for around a month, so I couldn't give a rats ass how much of an "expert" you think you are - you have only had as much time as every other "expert" to research this issue and frankly your findings are laughable.

Instead of actually finding anything contrary to the facts presented you merely pass them off as irrelevant because they don't fit into your evaluation of the situation.

Instead of actually presenting any facts which refute the evidence against Google - you merely dismiss those facts with self righteous rhetoric.

With "experts" like you fighting in Google's corner, Google doesn't need enemies.

abc said...

Slightly OT, but any concerns about the as-yet unpublished Google patent filing that discusses the use of a 'mobile device data collection module' to 'collect data on a set of mobile devices which are using [a] wireless base station', including GPS location information, time information, and 'application specific data, such as, map requests, etc.'?

http://docs.google.com/fileview?id=0B-VQYa94fZpfNmNjZTBmNTQtNTllMS00YTE5LTk3MmMtMDM0N2RlODhiZmE0&hl=en

Mark said...

"Just because they have no evil intentions doesn't mean they haven't made an evil mistake"

Isn't "evil mistake" a bit of an oxymoron?

LarrySDonald said...

I would say it's a combination of mistake and "who knew someone would care?". Implementing something like this, I would almost certainly simply snarf down everything and analyze it later. That might be naive, because I assume if you don't crypto you're figuring someone might be reading it. If I write my email and password on a sign and stuck it on my lawn, I wouldn't come back later and go "ZOMGWTF Someone read it?!?". I highly doubt Google had any intention of abusing it, they're already pretty much winning this war and there's no reason to stir trouble. It's easy to see where people may not be as aware though, hence why they're saying "Opps, sorry, didn't know you were so picky about that. We'll start slicing the packets".

Mike said...

@ Deny Phorm:

"And how do you define rare - 1 in 30? Because so far the -first- country that has analysed that data (out of 30) has discovered it contains sensitive data including passwords and email content - and that is just for the data relevant to France."

I would think that "rare" in this case means rare with respect to the amount of sensitive data collected compared to total data collected, not the number of countries that have found sensitive data. If 1 piece of sensitive information is found by each of the 30 countries out of millions of packets, it is certainly not common. Surely even you can agree with that.

jason said...

Id just like to point out that in your article defending googles practices of packet sniffing, you tell users to 'try it themselves'.....you being the following paragraph with this sentence "You can verify this yourself (assuming it's legal in your area). "

IF ITS POSSIBLE THAT IT IS NOT LEGAL FOR ME TO DO THIS, THEN HOW THE HELL IS IT LEGAL FOR GOOGLE TO DO IT.

pay attention to what you're writing and quit contrdicting yourself on a venue the entire world can see

jason said...

WHAT A BUNCH OF SHIT, YOU HAVE TO APPROVE MY COMMENT......SOME SOLID CENSOPSHIP (IMO), I BET MY FIRST COMMENT WONT EVEN MAKE YOUR BLOG....PS, CAPS LOCK IS CRUISE CONTROL FOR AWESOME

Robert Graham said...

@jason
Your comment is so awesome I wouldn't dream of censoring it. Of course, I never said it was legal for Google, nor that it's illegal for you to do it, but other than that, your point it taken.

Ryan Mercado said...

Deny, your lack of grasp on the facts is saddening. You simply haven't understood any of what has been written here, because you're attempting to rebut it with arguments based on "facts" that it clearly has already refuted. You are approaching this with some underlying assumptions that just aren't true, and refuse to accept any data which contradicts them. I will try to point out some of your fallacies:

"So why not discard that data the same way as they discarded the "useless" encrypted data? The retention of the unencrypted data was not arbitrary therefore it cannot be passed off as accidental."

This is a fairly odd standard to apply. How do you figure that "not arbitrary" means it cannot be "accidental"? Because I remember to lock my door means I must also have remembered to turn off the stove -- and thus when my house burns down I'm an arsonist? Your logic does not follow. Just because you don't screw up on one thing doesn't mean you have to be malicious if you screwed up somewhere else.

"We don't yet know how many instances of said sensitive data appeared in that single country's data, but it is still there in the very first country. I suppose you might argue that is coincidence? I don't think many other people would support that argument."

So what you're saying is that in a country with a population of 67 million, at *least* 1 person had a password exposed. That's what the French Government has said. No, that's not a "coincidence", that's an inevitability. It's a coincidence if you win the lottery, it's not a coincidence if someone, somewhere wins it.

In a country of 67 million people, it's bound to have happened at least a couple of times -- but probably not very many.

"But aside from all that - how dare you pass off the retention of even "rare" instances of sensitive data captured, in such a way?"

It's not "OK" because it's "rare". It's OK because it was a total accident and the data was never looked at by Google because they were, until recently, unaware it was even there. The amount of data is around 600 gigabytes. That might sound like a lot, but keep in mind this was over the course of years -- and that amount of data would fit on a single hard drive.

"Google paid someone to write this code, which purposefully retained the unencrypted data, which in turn contained sensitive data protected from interception by law. "

How matter how many times you claim this was a purposeful act it won't magically become one. There's no evidence to support to that and to the contrary all the evidence indicates the opposite.

Google paid someone to do something, and they made a tiny mistake with vast unintended consequences. It's more like a typo, and less like a grand conspiracy of evil.

"Google filed a patent for this technology in 2008 yet then claim it was the work of a "rogue" coder. Claiming ignorance is not just implausible but utterly untenable. Again, despite your efforts to vindicate Google - THAT fact will not change."

What Google filed a patent on is what this Blog describes, had it been properly implemented (truncating the packets so as to not include payload data). The patent describes something that would be perfectly legal, normal, and non-controversial. Whereas what actaully happened is as if somebody tried to implement that patent, but accidentally screwed up some code. Google's ignorance is the only explanation that makes sense given the legal/publicity risks in gathering this data, the lack of value of the data, and the far better and easier ways to get this data.

Robert Graham said...

So why not discard that data the same way as they discarded the "useless" encrypted data

They didn't discard the useless encrypted data. They retained that as well. It's just that nobody cares they retained it, because it's encrypted and useless.

Robert Graham said...

Whether you agree with those laws is utterly irrelevant - the law was broken - for 3 years - across 30 countries - and hundreds of millions of addresses.

I don't know about other countries, but there is no law against this in the United States.

That doesn't mean it's legal. Prosecuters called leing about your name to MySpace when creating an account was the same as "hacking" into MySpace servers. If they can stretch the Computer Fraud and Abuse Act that far, they can stretch to law to cover this as well.

Robert Graham said...

"yet the payload of the packets do nothing useful for Google because they are only fragments."

Now, you are wrong on this, completely wrong and it is not an opinion it is a fact based on the data audit that was carried out by the French - so please explain to me how that is not in conflict with what you wrote?


The amount of data in the packets is not useful. So the French found 10 packets out of all the packets contained passwords. So what, millions of passwords are sent via e-mail through Gmail every day.

What Google might want to personally identifiable information at each house it passed by. They got that for less than 1% of the houses. That's not useful for Google.

Robert Graham said...

And that makes it OK then? And how do you define rare - 1 in 30? Because so far the -first- country that has analysed that data (out of 30) has discovered it contains sensitive data including passwords and email content - and that is just for the data relevant to France.

100% of the countries will find passwords.

Just for less than 1% of the households Google's car drove by.

Grab a copy of BackTrack 4. It's a CD-ROM/USB drive that boots into a copy of Linux with a bunch of hacking tools. Run "ferret --wifi scan" from the command-line. Then drive around your neighborhood. Chances are that you will find little that is useful.

Robert Graham said...

The data collected can be leveraged for significant commercial benefit and no matter how much you try to dismiss that FACT as mere fragments with no utility - it does not change.

Google cannot leverage that data for any commercial benefit.

Jordan said...

Listen people, I just read this entire posting of comments, and I must say something... If an individual or business is ignorant enough to send out their data over the air waves without first encrypting it, then THEY ARE AT FAULT. If I give out my credit card info over a CB radio, I would IN NO WAY SHAPE OR FORM expect it to not be heard by someone else. This is the EXACT same thing, except the data google collected is digital and on a different frequency. I mean, how can some of you people be defending people who are so careless about their personal data? They are knowingly sharing their data with anyone who cares to listen in (same as someone listening to others talk over a CB radio), and it is ignorant to believe noone is going to capture this data for nefarious or legitimate purposes. When its all said and done, individuals are responsible for protecting their property. If I left my bike sitting on a side walk in a busy metropolitan area and said "oh screw it, i don't need to put a lock on it, people will infer that that is my bike and know I WANT it protected." Thats crazy! And so is broadcasting non-encrypted data over wifi! Oh, and one more thing to you people saying its not "your fault" that hotspots/coffeeshops don't turn on encryption... Lets return to the bike example: I ride my bike to the nearest shopping mall and when I get there I ask,

"hey security guard, where are the bike locks? Oh, you don't have any? Its ok, everyone knows that not their bike, so they won't take it. I mean its not like the mall should expect me to bring my own lock or something... THE NERVE OF SOME BUSINESSES!!"

See how silly this sounds?! Bring your own lock (VPN)!!