Wednesday, January 30, 2008

Collective analysis of AOL Search Data

Prof. Chen has brought up the issue of privacy as it pertains to contributed content in social web applications. Most social web applications we have discussed involve users who willfully provide content and thus have little expectation of privacy. Other Internet applications and transactions lead users to believe their contributions are to a large degree private and/or meaningless when considered out of context. I believe it is important for users realize their data and transactions when analyzed in aggregate can be much more telling than e.g. a single blog post. As an example of a privacy breach Prof. Chen mentioned the AOL search database that was released in 2006 (inadvertently in hindsight). This data set provides a dramatic example of how much Internet companies can learn about users if they wish to, in this case by looking at an individual user's sequence of search terms.

A secondary point of interest is the power of collective analysis of and commentary on large data collections like this one. I recall a number of socially-interactive web sites sprung up to share this data set and to direct people to the most interesting examples. I've not found the exact site I recall visiting in 2006, but appears to be the best example now (despite being affected by some level of spam). [Be forewarned, there is definitely information of an adult nature in this dataset.] The site allows the public to tag individual searchers and to provide "psychoanalytic" commentary. Other sites are out there (e.g. These sites are not that great, but they at least demonstrate the power of collaborative data analysis, which in this case permitted many individual search users to be rapidly identified, thus raising everyone's awareness of the seriousness of the breach.

