July 21, 2013
In a new paper, recently published in the open access journal PLOSONE, Benjamin Mako Hill and I build on new research in survey methodology to describe a method for estimating bias in opt-in surveys of contributors to online communities. We use the technique to re-evaluate the most widely cited estimate of the gender gap in Wikipedia.
A series of studies have shown that Wikipedia’s editor-base is overwhelmingly male. This extreme gender imbalance threatens to undermine Wikipedia’s capacity to produce high quality information from a full range of perspectives. For example, many articles on topics of particular interest to women tend to be under-produced or of poor quality.
Given the open and often anonymous nature of online communities, measuring contributor demographics is a challenge. Most demographic data on Wikipedia editors come from “opt-in” surveys where people respond to open, public invitations. Unfortunately, very few people answer these invitations. Results from opt-in surveys are unreliable because respondents are rarely representative of the community as a whole. The most widely-cited estimate from a large 2008 survey by the Wikimedia Foundation (WMF) and UN University in Maastrict (UNU-MERIT) suggested that only 13% of contributors were female. However, the very same survey suggested that less than 40% of Wikipedia’s readers were female. We know, from several reliable sources, that Wikipedia’s readership is evenly split by gender — a sign of bias in the WMF/UNU-MERIT survey.
In our paper, we combine data from a nationally representative survey of the US by the Pew Internet and American Life Project with the opt-in data from the 2008 WMF/UNU-MERIT survey to come up with revised estimates of the Wikipedia gender gap. The details of the estimation technique are in the paper, but the core steps are:
- We use the Pew dataset to provide baseline information about Wikipedia readers.
- We apply a statistical technique called “propensity scoring” to estimate the likelihood that a US adult Wikipedia reader would have volunteered to participate in the WMF/UNU-MERIT survey.
- We follow a process originally developed by Valliant and Dever to weight the WMF/UNU-MERIT survey to “correct” for estimated bias.
- We extend this weighting technique to Wikipedia editors in the WMF/UNU data to produce adjusted estimates of the demographics of their sample.
Using this method, we estimate that the proportion of female US adult editors was 27.5% higher than the original study reported (22.7%, versus 17.8%), and that the total proportion of female editors was 26.8% higher (16.1%, versus 12.7%). These findings are consistent with other work showing that opt-in surveys tend to undercount women.
Overall, these results reinforce the basic substantive finding that women are vastly under-represented among Wikipedia editors.
Beyond Wikipedia, our paper describes a method online communities can adopt to estimate contributor demographics using opt-in surveys, but that is more credible than relying entirely on opt-in data. Advertising-intelligence firms like ComScore and Quantcast provide demographic data on the readership of an enormous proportion of websites. With these sources, almost any community can use our method (and source code) to replicate a similar analysis by: (1) surveying a community’s readers (or a random subset) with the same instrument used to survey contributors; (2) combining results for readers with reliable demographic data about the readership population from a credible source; (3) reweighting survey results using the method we describe.
Although our new estimates will not help us us close the gender gap in Wikipedia or address its troubling implications, they give us a better picture of the problem. Additionally, our method offers an improved tool to build a clearer demographic picture of other online communities in general.
January 21, 2013
- Kuler (via Liz Gerber) – Adobe-owned site for user-uploaded color combinations. Elegant color combinations can be wonderful tools for data visualization, graphic design, and more. In this case, the names provided for the combinations are more colorful than the combinations.
- Metro Chicago Data – A handy repository of public datasets from various government agencies and other public sources.
- Chicago Crime Map – Crime statistics with a friendly interface. Provided by the Chicago Tribune.
- Bewtween the Bars – Brilliant prison blogging project initiated (as I understand it) by Charlie de Tar at the MIT Center for Civic Media. Best of all, you can help transcribe posts!
- Bikenapped (via Mako) – A site mapping bicycle theft data around the Boston area. This should exist for every city.
January 14, 2013
Aaron Swartz’s suicide over the weekend is a tragedy. His death has affected many people very deeply, including many of my friends who were very close with Aaron.
Personally, I did not know Aaron well, but I regard him as an inspiration – as much for his quiet thoughtfulness and kindness as for his amazing achievements, intellect, projects, and democratic (small “d”) ideals.
I don’t have much to add to some of the heartfelt responses many people (including Cory Doctorow, Larry Lessig, and Matt Stoller) have posted elsewhere; however, as I have thought and read about Aaron over the past couple of days, I have decided that I want to commemorate his life and work through some concrete actions. Specifically, I have made some vows to myself about how I want to live, work, and relate to people in the future. Most of these vows are fundamentally democratic in spirit, which was part of what I find so inspiring about so much Aaron’s work. Not all of my commitments are coherent enough or sensible enough to list here, but I will put one out there as a public tribute to Aaron:
I will promote access to knowledge by ensuring that as much of my work as possible is always available at no cost and under minimally restrictive licenses that ensure ongoing access for as many people in as many forms as possible. I will also work to convince my colleagues, students, publishers, and elected or appointed representatives that they should embrace and promote a similar position.
This is a very small and inadequate act given the circumstances.
August 12, 2012
The news that during the creation of Bain Capital Mitt Romney sought large investments from some of El Salvador’s notorious “oligarchs” should not be particularly surprising given the extent of the support the US government and private sector provided to the Salvadoran government and political elites during the country’s civil war from 1980-92.
In the grand narrative of the 2012 presidential campaign, I suspect this story will figure as (at most) a very minor footnote. Nevertheless, I wanted to draw attention to it because of my personal connections to El Salvador and the persistence of that country’s civil war in the lives of so many of its residents.
During and after my first year of college, I visited El Salvador for almost 9 months. At the time, I worked with a small community organization called Grupo Tamarindo (that would later be known as the Tamarindo Foundation – warning: link is to their Facebook page).
The Tamarindo Foundation online presence doesn’t really do the group justice. By the standards of the nonprofit and NGO sector, it’s a tiny organization with almost no budget and even less in the way of public relations materials to document what its members have accomplished in the group’s nearly 20 years of existence.
During that time, the group and it’s leader John Guiliano have sought to rebuild the town of Guarjila, El Salvador and, in particular, to create opportunities for the town’s young people to pursue education, arts, athletics, and entrepreneurship.
The Tamarindo has made progress towards all of these goals and more while working on a shoestring budget, microscopic scale, and generational time-horizon. That said, any progress has been slow, and the underlying cycles of poverty and violence that make life in El Salvador and in Guarjila so precarious persist.
In this way, over twenty years since the Salvadoran civil war ended, its effects can still be felt, whether in the gangs that were formed in U.S. prisons by war refugees or in the lack of adequate educational and career opportunities that lead so many of El Salvador’s young people to seek employment and security through illegal migration.
The current goal of the Tamarindo is to build a community center where the organization can continue to do its work. In order to support this goal, John has started a bicycle ride across the United States. After handing in my dissertation draft last Sunday, I joined the first three days of the ride from Boston to New Haven. You can follow the ride through John’s Facebook updates and donate (via their ImAthlete page). The itinerary includes stops across the country and, if you want to learn more about the Tamarindo, the ride, or El Salvador, I encourage you to contact John and attend one of the tour events.
July 29, 2012
Crowdsourcing, outsourcing, and other sorts of distributed work have long since made inroads into professional journalism, but a recent scandal involving a few major metropolitan newspapers outsourcing their local reporting to a company named Journatic reveals the scope and extent of those inroads.
Since This American Life first broke the story a couple of weeks ago, the details of the Journatic story have made their way all over the Internet (See, e.g., coverage from Poynter, Romenesko, and Gigaom for some of the more thoughtful examples).
The basics are straightforward: Journatic is a company that specializes in generating content for a variety of purposes, among them local news stories (they also have a sister company called Blockshopper that provides a similar service for real estate listings). It seems that typically a client – say, a major U.S. newspaper like the Chicago Tribune, for example – contracts with Journatic, which then hires dozens of independent subcontractors (mainly in the Philippines and the U.S.) who construct and edit hyperlocal news items in a distributed, piecemeal fashion before passing the finished product back to the client for publication.
You can get a much better feel for the process by listening to the TAL interviews with Journatic editor Ryan Smith, or by reading Smith’s tendentious editorial about his experience (has has subsequently quit working for Journatic, although – interestingly – he was not fired or even reprimanded for his efforts to publicly criticize the company’s practices and products).
The stickiest part of the scandal seems to be that the Trib, along with several other major metropolitan dailies (the San Francisco and Houston Chronicles as well as the Chicago Sun Times) , had been printing these stories under false by-lines (such as Jake Barnes – the name of a famous Hemingway character), which violates the paper’s own ethical standards.
I find the story pretty engaging for several reasons:
The fact that Journatic figured out how to crowdsource journalism is actually pretty impressive. Some friends at CMU have been trying for a while now to generate magazine-style writing using workers on Amazon’s Mechanical Turk. Likewise, I’d like to develop and test methods for crowdsourcing peer review of academic papers. Apparently, the folks at Journatic have already solved many of the practical problems involved in performing a complex knowledge-based task like reporting using a globally distributed workforce of highly variable skill.
Second, despite the rhetoric surrounding the story, Journatic is neither the end of journalism as we know it nor its salvation. While I share the concerns voiced by Smith, TAL reporter Sarah Koenig, and others over the wages paid to Journatic’s Philippino contractors as well as the confusion about the Tribune’s apparent willingness to buck its own editorial policies about attribution in this case, these issues need to be distinguished from questions about whether crowdsourcing is “bad” or “good” for the future of media. I believe the emergence of companies specializing in crowdsourced journalism is merely another wrinkle in a complex organizational ecosystem where incumbent firms are struggling to retain some sort of comparative, competitive advantage in the face of declining revenues. When you consider Journatic in the context of other experiments in crowdsourced journalism, such as some of ProPublica’s distributed reporting project, CNN’s iReports, or even the political blogosphere, paying workers around the world to assemble stories sounds less like a violation of basic journalistic principles and more like the latest in a long line of process innovations that might or might not help to reinvent the field.
Last, but not least, many of us (myself included) may not like the fact that the cost of local news coverage has exceeded the demand in many places, but I think there’s got to be a more effective response than petitioning Sam Zell to stop outsourcing. Instead, I’d like to see a combined effort to improve Journatic’s models of content production in order to (1) address the ethical concerns raised in the Tribune scandal; (2) improve the quality of coverage in order to correct some of the terrible reporting practices documented by Smith in his op-ed; and (3) more effectively integrate teams of remote and on-site local reporters.
Ultimately, you can’t ignore the fact that Journatic smells bad. They paid off contractors not to talk to the media a few months ago, provide SEO and content farm services on the backs of cheap overseas labor, and when faced with complaints about the fact that their real estate listing service, BlockShopper, violated people’s privacy, they responded by issuing a Zuckerbergian declaration against expectations of privacy online and hiding the identities of their writers. Oh, and they also hide their company’s website from Google’s robots (go to http://journatic.com and use the “view source” option in your browser to see their robots.txt policy).
That said, the whole situation offers a chance to think about what a more responsible, ethical, and constructive version of crowdsourced journalism could look like. For that reason alone, I think Journatic deserves even more attention than it has already received.