November 07, 2022 | Press Release

Is Facebook’s Advertising Data Accurate Enough for Use in Social Science?

Researchers of the Digital and Computational Demography Lab at the Max Planck Institute for Demographic Research (MPIDR) and colleagues recently published a paper in the “Journal of the Royal Statistical Society: Series A” which systematically assesses the quality of Facebook’s ads data for use in social science research. They assessed the accuracy of this data by comparing self-reported and Facebook-classified demographic information on sex, age and region of residence of more than 133,000 users recruited via an online survey. Their results suggest that Facebook’s ads data can be used when additional steps are taken to validate the accuracy of the information under consideration.

Social scientists increasingly use Facebook's advertising platform (FAM) for research by either collecting information on users and their digital traces, or recruiting participants for surveys.

The FAM provides aggregated demographic information of Facebook users such as the place of residence. For example, social scientists leverage information on the number of Facebook users who live abroad to estimate the number of certain immigrant groups in a given country. Information available via the FAM is particularly useful in times of crisis, as it is more readily and immediately available than traditional censuses and registers.

To recruit participants for surveys, e.g. international migrants, researchers target users who live abroad via the FAM. This way, it is easy and cost effective to recruit members of small and traditionally hard to reach subpopulations.

Both approaches – using digital traces and conducting surveys – depend on the accuracy of the data that Facebook provides about its users. But little is known about how accurate this data is.

How to indirectly assess the accuracy of Facebook’s user classification

“Assessing its accuracy remains difficult as researchers usually have no access to the algorithms that Facebook uses to classify its users as belonging to certain demographic groups or having certain interests”, says MPIDR Director Emilio Zagheni.

Researchers therefore need to find ways to indirectly assess the accuracy of Facebook’s user classification. “We address this gap by leveraging the COVID-19 Health Behavior Survey (CHBS) that we conducted via targeted Facebook ads. More than 133,000 respondents from eight countries participated in this large-scale, anonymous, cross-national online survey that we conducted to study behaviors and attitudes during the early months of the COVID-19 pandemic, from March to August 2020”, says André Grow, a former MPIDR Researcher.

The participants were recruited via the FAM, which split the users into a large number of non-overlapping demographic subgroups, to maximize representativeness of results. This aspect made the survey well suited to assess the accuracy of Facebook’s user data in a cross-national, comparative way.

Facebook’s classification was largely reliable

“We were able to indirectly assess how often self-reported young or female survey participants were correctly classified by the FAM as being young or female, and how the level of correct classification differed across countries”, says Daniela Perrotta. The MPIDR Researcher and her colleagues quantified the number of matches between users’ survey answers and Facebook’s classification in three central demographic characteristics: sex, age, and region of residence within a country.

They found that Facebook’s classification was largely reliable but with some differences across them and across countries. Across countries, between 86% and 93% of respondents were correctly classified on all three characteristics. The number of completely correct classifications was lowest in Belgium and France, and highest in the Netherlands. Misclassifications were most likely to occur for region of residence and least likely to occur for sex.

Why was the error rate for sub-national region of residence higher than for sex and age? One possible explanation is that Facebook’s gender and age classifications are largely based on self-reported information that is unlikely to change, or changes predictably, over time. By contrast, users’ region of residence is inferred by Facebook through looking at data such as mobile phone location, and may change frequently (e.g. people commuting daily), thereby increasing the chance of incorrect classification. In fact, most of the incorrect region classifications involved people who reported living in regions that were adjacent to those to which they were incorrectly assigned by Facebook.

“Our key suggestion to other researchers is to assess the accuracy of Facebook’s user classification for any characteristic they are interested in before launching a survey”, says Daniela Perrotta. She adds, “In this way, they can safeguard against bias in population estimates, and devise appropriate strategies to reduce the bias. They can also avoid excessive costs from recruiting ineligible participants to surveys.”

Original Publication

Grow, A., Perrotta, D., Del Fava, E., Cimentada, J., Rampazzo, F., Gil-Clavel, S., Zagheni, E., Flores, R.D., Ventura, I, Weber, I.: Is Facebook’s advertising data accurate enough for use in social science research? Insights from across-national online survey. Journal of the Royal Statistical Society: Series A (2022). DOI: 10.1111/rssa.12948