Street View cars, they may have inadvertently captured data payloads containing private information (URLs, fragments of e-mails, and so on).
Although some people are suspicious of their explanation, Google is almost certainly telling the truth when it claims it was an accident. The technology for WiFi scanning means it's easy to inadvertently capture too much information, and be unaware of it.
This article discusses technically how such scanning works.
There have been many controversies surrounding Street View. The first is about the images the cars take. They often contain private information, such as the license plates of cars parked on the side of the road. Google keeps improving algorithms to fix this, such as automatically covering license plates within images.
Street View cars also record nearby WiFi access-points. The purpose of this is to provide an alternate to GPS. A computer without GPS can scan for nearby access-points, look up their location in Google's database, and figure out it's own location. This is similar to the service provided by the company Skyhook Wireless, or by the collaborative effort WIGLE. Even though this is a useful tool, it is still a bit controversial, because it's yet one more piece of data (the location of everyone's access-points) that Google knows about us.
The current controversy s that while scanning for access-points, it may have captured private data.
NetStumbler". This is a popular program on Windows that makes it easy to both find access-points, as well as record their GPS location.
The WiFi radio in your laptop receives all packets on the current channel, including packets sent by other people's laptops near you. However, the WiFi device checks the incoming packets to see if they have the proper "MAC address" (the unique serial number assigned to your WiFi device). If they have the wrong MAC address, then the WiFi device will drop the packets. Only packets destined to your MAC address will be continue to be processed by your computer.
The way a packet-sniffer works is to turn off the MAC address check. All packets received by the WiFi radio are kept in the system, then saved to disk.
There are two parts to a packet: the header information (the envelope) and the payload/contents. Google is only interested in the headers.
THE BEACON PACKET
There are many types of packets. The most interesting packet is the "Beacon". The average access-point sends out this Beacon many times per second, advertising its existence, its name (the "SSID"), and a list of features (like whether a password is required).
When your laptop gives you a list of nearby access-point, it's simply listing the access-points from which it has received a Beacon. A program like NetStumbler builds its list from the same information.
The following picture shows a typical Beacon packet. I'm sitting a Panera café with free WiFi, this is the Beacon from the local access-point.
The raw packet is shown in a "hex dump" at the bottom, with the decode explanation at the top. I've selected the "SSID" field to show how the decoded information corresponds to the selected hex data.
THE DATA PACKET
While the Beacon packet is the most useful packet, other packets can be useful too.
Let's say that there is an access-point within a building, but the Beacon packets are blocked or too weak to reach the street. The access-point will exist, but Street View won't be able to see it.
However, somebody could be using a laptop halfway between the access-point and the Street View car. The laptop's packets can reach both the car and the access-point. Thus, even though Street View cannot see the access-point itself, it can still infer its existence by looking at the DATA packets.
A data packet example is shown below (in typical decode+hexdump format). This packet sent was sent by my laptop to the local access-point. I've highlighted the "BSS ID" field, which is the MAC address of the access-point (the same one shown in the Beacon above).
In addition, you'll notice that the signal strength in the decode. Google can use this to triangulate the location of the device that sent the packet. Street View knows the precise GPS location of the car as it rolls down the street. If it can get three beacons (or other data packets) from the access-point, it can triangulate the position of the access-point. Moreover, if it stores the raw packets from one day as the car takes one route, it can correlate the packets with another day's packets on a different route.
Triangulation is a lot harder than you'd think. This is because many things will block or reflect the signal. Therefore, as the car drives buy, it wants to get every single packet transmitted by the access-point in order to figure out its location. Curiously, with all that data, Google can probably also figure out the structure of the building, by finding things like support columns that obstruct the signal.
What's important about this packet is that Google only cares about the MAC addresses found in the header, and the signal strength, but doesn't care about the payload. If you look further down in the payload, you'll notice that it's inadvertently captured a URL.
Take a look again. Even though the access-point MAC address is highlighted, there's extra data in the packet. These extra data will include URLs, fragments of data returned from websites (like images), the occasional password, cookies, fragments of e-mails, and so on. However, the quantity of this information will be low compared to the total number of packets sniffed by Google.
That's the core of this problem. Google sniffed packets, only caring about MAC addresses and SSIDs, but when somebody did an audit, they found that the captured packets occasionally contained more data, such as URLs and e-mail fragments.
Google captured very little as it drove through neighborhoods. The primary reason is that most people encrypt their connections (by putting a password on their access-point). The second reason is that the car is only near an access-point for a few seconds - during which time it's unlikely that any data is being transferred.
You can verify this yourself (assuming it's legal in your area). You can download a version of Linux full of security tools like "BackTrack 4". You don't have to install it on your laptop, but instead, can put it on a USB flash drive, and boot from the flash drive. You can then run a tool like "ferret" that will sniff the wifi and show you interesting private information, like URLs. (Only half the wifi devices support such raw sniffing, you may have to buy a separate USB wifi stick for $10).
If you drive down the street running 'ferret', you'll see that it almost never shows you any information other than wifi control traffic (like Beacons).
The real reason Google might have data payload isn't from neighborhoods at all, but from cyber cafes and hotspots. If a Google Street View car came down the street near this Panera, it will be flooded with data packets from people within the café. This is the slowest period of the day, but I count 7 people using their laptops and one person using their iPhone to surf the web. Moreover, whereas people may encrypt their traffic at home, the hot spot here at Panera is unencrypted.
PROTECT YOURSELF, DON'T PUNISH GOOGLE
It's really easy to protect your data: simply turn on WPA. This completely stops Google (or anybody else) from spying on your private data (assuming you haven't done something stupid like chosen an easily guessed password, or chosen WEP instead of WPA). If you don't encrypt your traffic, then by implication, you don't care if people eavesdrop on it.
Laws against this won't stop the bad guys (hackers). They will only unfairly punish good guys (like Google) whenever they make a mistake.
HOW TO FIX IT
Google can easily get rid of the payloads by "slicing" data packets to the first 24 bytes. This preserves the MAC address and signal strength information Google wants, but gets rid of any private information inadvertently gathered from people who do not encrypt their connection.
EDITORIAL: THE NEED FOR TRANSPARENCY
This situation was only found because somebody audited Google's data. Just because they have no evil intentions doesn't mean they haven't made an evil mistake. The more Google becomes our overlord, the more we should demand that they be open and transparent about what data they are keeping about people.
What I've focused on here is question of Google's collection of "data" payloads, not the other privacy issues. Some people have accused Google of lying, and for having some nefarious purpose for gathering these packets. However, anybody who has experience in WiFi mapping would believe Google. Data packets help Google find more access-points and triangulate them, yet the payload of the packets do nothing useful for Google because they are only fragments.