How secure is your Social Security Number? If your answer is "Very: I only ever give it to organizations who are entitled to know it", that may not be as safe as it sounds. Of course, there are a couple of fairly generic issues:

  • some legitimate, convenient organizations may ask for it who are, nevertheless, not "entitled" to it
  • the difference between legitimate and illicit organizations (web pages, URLs etc.) is not always as pronounced as you might think - otherwise, we wouldn't have to worry about phishing

There are other issues, though. A paper for the Proceedings of the National Academy of Sciences called "Predicting Social Security numbers from public data" by Alessandro Acquisti and Ralph Gross claims that an SSN is relatively quick and easy to guess if you have other information relating to the individual's place and date of birth.

This shouldn't actually surprise you unless you think of your SSN as being like a password, which isn't how it works. But what's the difference? The point of a password, though we don't necessarily think of it like this, is to achieve a balance between randomness and convenience. In fact, many sites use a password generator which simply puts together random characters in a string. If randomness (or entropy, if you want to sound as if you know what you're talking about in conversation with information theorists) was the only consideration, then that would be a pretty good mechanism.

In practice, there are a number of other considerations: for instance, whether it's straightforward enough to remember, the maximum length of key, and the variety of characters you can use. For example, a four digit PIN (Personal Identification Number)  can be broken very quickly indeed, all things being equal. All you need to do is compile a list of every possible permutation of numbers between 0000 and 9999 and try each one till you get to the one that works. (Of course, this assumes that you can just keep trying till you get find the right permutation: if access is locked after three tries,  as is often the case with ATMs - Automatic Teller Machines - that's rather more secure: the attacker needs to force or trick the target into giving up his PIN, or intercept an authentication transaction somehow, to stand much chance of breaking in.)

Moving away from ATMs, though, if you can use a wider variety of characters (alphanumerics, punctuation and whitespace) and a longer passphrase, that makes a considerable difference. In previous blogs on this issue (and a white paper we have in preparation on the subject), Randy and I have mentioned a number of strategies for generating stronger passwords. Most of them involve introducing an element of (pseudo-)randomization because the more "random" the passphrase, the harder it is to guess (for humans or for computers).

However, a Social Security Number is not like a password. In fact, it's essentially a database primary key, an identifier that is unique to you, and the most practical way of generating such a key is often to enhance predictability, not to reduce it. A primary key is often just a numeric value incremented automatically: for example, if the primary key for the first record is 1, the key for the second is 2, the key for the tenth is 10, and so on - think autonumbering in Microsoft Access. For an attacker trying to guess the key for a victim's record, that might still be fairly random: if you're aiming to exploit a database known to have millions of entries, you may not be able to start to guess where in the sequence your victim's identifier is. (And in fact, limitations in an autonumbering mechanism could introduce additional complications in this context.)

However, for really big databases, it's often convenient for the administrator to include information in the key that's specifically intended to reduce entropy, by giving more information than simple sequential numbering. Acquisti and Gross claim that there is a correlation between an SSN and the birthdate of its "owner" that makes it feasible to infer the SSN, given knowledge of that birthdate and assisted by public access to the Social Security Administration's Death Master File, which allowed the researchers to determine SSN allocation patterns based on the zip code of their birthplace and the date of issue.

Unfortunately, as we may have mentioned before, that sort of information is all too easy to find online nowadays, when so many people make it available on social networking sites like Facebook. Acquisti and Gross say that "Our results highlight the unexpected privacy consequences of the complex interactions among multiple data sources in modern information economies and quantify privacy risks associated with information revelation in public forums." What they're really talking about here is an example of what we sometimes call data aggregation: while each datum may be harmless in isolation, the aggregation of data from different sources gives away more information than an unwary target would expect. The whole is greater than the sum of the parts.

The sky is not falling. Knowing your SSN is not necessarily enough to give an attacker control over your finances or steal your identity: it depends on the context and on what other information he has. And as the Social Security Office has rightly, if evasively, pointed out, Acquisti and Gross have not "cracked a code for predicting an SSN". What they have done is make it feasible to predict the SSN for some people, given a sufficiency of resources. According to the LA Times, they were able "to identify all nine digits for 8.5% of people born after 1988 in fewer than 1,000 attempts. For people born recently in smaller states, researchers sometimes needed just 10 or fewer attempts to predict all nine digits."

While most sites that use SSNs to verify a user's identity will not allow unlimited attempts, a botherder can use multiple machines to access multiple resources such as online credit approval services to test numbers: while botnets are most often associated in the media with spam distribution and distributed denial of service (DDoS) attacks, botnets can be (and are) used for all sorts of distributed computing tasks like this, if the attack surface is big enough or vulnerable enough.

The Social Security Office has said that it is moving over to a more randomized allocation system. (A pseudonymized approach might be one possibility, but I don't suppose they're going to tell us exactly what they're going to do.) That isn't necessarily going to fix the whole problem, though.

  1. It's may not impact directly on the fact that SSNs are often used for where they're not really appropriate (i.e. as authentication rather than as identification) or properly protected: that's probably not going to change.
  2. It might not help all the people whose SSNs are vulnerable right now: generally, government departments are not eager to change identifiers that are supposed to be "for life" for millions of people.
  3. It's certainly not going to fix the fact that so many people of all generations are increasingly likely to publish data like birthdates on social networking sites that probably shouldn't be so widely known.

I came across the following resources while I was researching it:

http://blogs.discovermagazine.com/80beats/2009/07/07/researchers-guess-social-security-numbers-from-public-data/
http://www.informationweek.com/news/security/privacy/showArticle.jhtml?articleID=218400854
http://www.pittsburghlive.com/x/pittsburghtrib/news/pittsburgh/s_632528.html

I should also mention the Securing Our eCity community initiative, which tries to address issues like this that affect the public: http://www.securingourecity.org/

David Harley
Director of Malware Intelligence