posted by Marian-Andrei Rizoiu

The digital traces left behind by the users in the online environment reveal more about them than they might like. As our recent WSDM’16 paper shows, machine learning algorithms can be used to uncover hidden links between an user’s past activity and her private traits – like gender, education level or religious views –, even for retired users.

The problem

The cumulative effect of collective online participation has an important and adverse impact on individual privacy. As an online system evolves over time, new digital traces of individual behavior may uncover previously hidden statistical links between an individual’s past actions and her private traits. To quantify this effect, we analyze the evolution of individual privacy loss by studying the edit history of Wikipedia over 13 years, including more than 117,523 different users performing 188,805,088 edits. We trace each Wikipedia’s contributor using apparently harmless features, such as the number of edits performed on predefined broad categories in a given time period (e.g. Mathematics, Culture or Nature). We show that even at this unspecific level of behavior description, it is possible to use off-the-shelf machine learning algorithms to uncover usually undisclosed personal traits, such as gender, religion or education. We provide empirical evidence that the prediction accuracy for almost all private traits consistently improves over time. Surprisingly, the prediction performance for users who stopped editing after a given time still improves. The activities performed by new users seem to have contributed more to this effect than additional activities from existing (but still active) users. Insights from this work should help users, system designers, and policy makers understand and make long-term design choices in online content creation systems.

Sample results

Privacy Loss is evaluated as the capability to predict hidden personal traits, based on simple past recorded activity (i.e. number of page edits within a given interval). An increasing prediction accuracy involves loss of privacy.

Static behavior analysis correlates with gender: males tend to edit more the content of Wikipedia articles, while females seem to concentrate more on the social interaction. Privacy Loss over time to the “online breadcrumbs” left behind by users (red line) compared to the Privacy Loss due to information learned from other users (blue line). Privacy Loss occurs even for retired editors, who have been active prior to 2008 (blue period), but stopped contributing afterwards.


Marian-Andrei Rizoiu, Lexing Xie, Tiberio Caetano and Manuel Cebrian. Evolution of Privacy Loss on Wikipedia, in Proceedings International Conference on Web Search and Data Mining (WSDM ‘16), San Francisco, USA, 2016.

Download:        Paper PDF + SI     Talk slides     Poster
Data: User edit behavior (82MB)     Wikisample (1%) (495MB)     Wikicomplete (3.6GB)
    address = {San Francisco, CA, USA},
    author = {Rizoiu, Marian-Andrei and Xie, Lexing and Caetano, Tiberio and Cebrian, Manuel},
    booktitle = {International Conference on Web Search and Data Mining},
    doi = {10.1145/2835776.2835798},
    keywords = {de-anonymization, online privacy, temporal loss of privacy},
    title = {{Evolution of Privacy Loss on Wikipedia}},
    year = {2016}

January 8, 2016
502 words

social media privacy online

Recent updates

Getting in touch!
-- drop us a line if you are interested in knowing more about our work, collaborating, or joining us. Compelling stories gets read and responded promptly.
comments powered by Disqus