In a new paper, recently published in the open access journal PLOSONEBenjamin Mako Hill and I build on new research in survey methodology to describe a method for estimating bias in opt-in surveys of contributors to online communities. We use the technique to re-evaluate the most widely cited estimate of the gender gap in Wikipedia.

A series of studies have shown that Wikipedia’s editor-base is overwhelmingly male. This extreme gender imbalance threatens to undermine Wikipedia’s capacity to produce high quality information from a full range of perspectives. For example, many articles on topics of particular interest to women tend to be under-produced or of poor quality.

Given the open and often anonymous nature of online communities, measuring contributor demographics is a challenge. Most demographic data on Wikipedia editors come from “opt-in” surveys where people respond to open, public invitations. Unfortunately, very few people answer these invitations. Results from opt-in surveys are unreliable because respondents are rarely representative of the community as a whole. The most widely-cited estimate from a large 2008 survey by the Wikimedia Foundation (WMF) and UN University in Maastrict (UNU-MERIT) suggested that only 13% of contributors were female. However, the very same survey suggested that less than 40% of Wikipedia’s readers were female. We know, from several reliable sources, that Wikipedia’s readership is evenly split by gender — a sign of bias in the WMF/UNU-MERIT survey.

In our paper, we combine data from a nationally representative survey of the US by the Pew Internet and American Life Project with the opt-in data from the 2008 WMF/UNU-MERIT survey to come up with revised estimates of the Wikipedia gender gap. The details of the estimation technique are in the paper, but the core steps are:

  1. We use the Pew dataset to provide baseline information about Wikipedia readers.
  2. We apply a statistical technique called “propensity scoring” to estimate the likelihood that a US adult Wikipedia reader would have volunteered to participate in the WMF/UNU-MERIT survey.
  3. We follow a process originally developed by Valliant and Dever to weight the WMF/UNU-MERIT survey to “correct” for estimated bias.
  4. We extend this weighting technique to Wikipedia editors in the WMF/UNU data to produce adjusted estimates of the demographics of their sample.

Using this method, we estimate that the proportion of female US adult editors was 27.5% higher than the original study reported (22.7%, versus 17.8%), and that the total proportion of female editors was 26.8% higher (16.1%, versus 12.7%). These findings are consistent with other work showing that opt-in surveys tend to undercount women.

Overall, these results reinforce the basic substantive finding that women are vastly under-represented among Wikipedia editors.

Beyond Wikipedia, our paper describes a method online communities can adopt to estimate contributor demographics using opt-in surveys, but that is more credible than relying entirely on opt-in data. Advertising-intelligence firms like ComScore and Quantcast provide demographic data on the readership of an enormous proportion of websites. With these sources, almost any community can use our method (and source code) to replicate a similar analysis by: (1) surveying a community’s readers (or a random subset) with the same instrument used to survey contributors; (2) combining results for readers with reliable demographic data about the readership population from a credible source; (3) reweighting survey results using the method we describe.

Although our new estimates will not help us us close the gender gap in Wikipedia or address its troubling implications, they give us a better picture of the problem. Additionally, our method offers an improved tool to build a clearer demographic picture of other online communities in general.

Convention Crowd, Chicago. 1912. Library of Congress via Flickr Commons

  1. Kuler (via Liz Gerber) – Adobe-owned site for user-uploaded color combinations. Elegant color combinations can be wonderful tools for data visualization, graphic design, and more. In this case, the names provided for the combinations are more colorful than the combinations.
  2. Metro Chicago Data – A handy repository of public datasets from various government agencies and other public sources.
  3. Chicago Crime Map – Crime statistics with a friendly interface. Provided by the Chicago Tribune.
  4. Bewtween the Bars – Brilliant prison blogging project initiated (as I understand it) by Charlie de Tar at the MIT Center for Civic Media. Best of all, you can help transcribe posts!
  5. Bikenapped (via Mako) – A site mapping bicycle theft data around the Boston area. This should exist for every city.

Aaron Swartz’s suicide over the weekend is a tragedy. His death has affected many people very deeply, including many of my friends who were very close with Aaron.

Personally, I did not know Aaron well, but I regard him as an inspiration – as much for his quiet thoughtfulness and kindness as for his amazing achievements, intellect, projects, and democratic (small “d”) ideals.

I don’t have much to add to some of the heartfelt responses many people (including Cory Doctorow, Larry Lessig, and Matt Stoller) have posted elsewhere; however, as I have thought and read about Aaron over the past couple of days, I have decided that I want to commemorate his life and work through some concrete actions. Specifically, I have made some vows to myself about how I want to live, work, and relate to people in the future. Most of these vows are fundamentally democratic in spirit, which was part of what I find so inspiring about so much Aaron’s work. Not all of my commitments are coherent enough or sensible enough to list here, but I will put one out there as a public tribute to Aaron:

I will promote access to knowledge by ensuring that as much of my work as possible is always available at no cost and under minimally restrictive licenses that ensure ongoing access for as many people in as many forms as possible. I will also work to convince my colleagues, students, publishers, and elected or appointed representatives that they should embrace and promote a similar position.

This is a very small and inadequate act given the circumstances.

Penny-Matthias_Shapiro-cc_by

The question of whether paid crowd work violates U.S. employment and minimum wage laws may finally make it into court thanks to Christopher Otey, an Oregon resident who is suing CrowdFlower Inc. for wages he claims the company owes him as an “employee.”

You can (and should) read the full text of Otey’s complaint or coverage of the story on Crowdsourcing.org or MissionLocal.

I have a few preliminary, and mostly mixed, feelings about this. However, I should preface everything by saying that (1) I have known one of the defendants named in the suit, CrowdFlower CEO Lukas Biewald, for many years through mutual acquaintances at Stanford, where we were both enrolled at the same time; and (2) I worked as a paid, independent consultant with CrowdFlower on several projects between 2008-2011. That said, I have never held, nor hold at this time, any material interest, financial or otherwise, in the company.

My initial reaction is that I can’t believe it’s taken this long for someone, somewhere in the United States to sue one of the companies engaged in distributing paid crowdsourcing work for violation of the Fair Labor Standards Act (FLSA). Smart lawyers like Alek Felstiner and Jonathan Zittrain have been making some form of the argument that this is a major issue for Crowdsourcing for at least three years now. Felstiner even made the case in a series of posts on CrowdFlower’s blog here, here, and here in 2010. I am hardly the only person to regard as remarkable the fact that a whole venture-funded industry has sprung up around a set of activities that, on the surface, seem to resemble a massive minimum wage violation scheme.

At the same time, there are a lot of reasons to believe that crowdsourcing represents a fundamentally different sort of phenomenon than the varieties of “work” and workplace abuses the US congress sought to regulate with the FLSA back in 1938. For starters, crowd work is radically flexible – in terms of time and location – as well as minimal in terms of the commitment, skill, and obligations required of workers. As a result, it’s not clear that the relationships established between requesters and providers of work in this context are really anything like relational contracts that exist between traditional employers and employees. Crowd workers do what they for a variety of reasons, in a variety of ways, and under a variety of conditions, making it pretty hard to determine whether they ought to be considered employees of the organizations that may play a role in compensating them for their efforts (and this is potentially an important point since CrowdFlower plays something of a middle-man role between the individuals and companies that post tasks to its site and those who complete the tasks and receive compensation in exchange for their labor).

One particular challenge posed by the suit and the fact that Otey and his attorneys have chosen to seek compensation under US minimum wage laws ($7.50 per hour). Depending on the outcome, the impact of a ruling against CrowdFlower could therefore make paid crowd work as it exists today financially impractical within the United States. While such a ruling might represent a crucial step in enforcing legal, ethical, and financial standards of fairness in online environments, it might also undermine the growth of a valuable source of future innovation, employment, research, and creativity. Crowd-based systems (whether paid or unpaid) of distributed information creation, processing, and distribution have accounted for some of the most incredible accomplishments in the short history of the Internet, including Wikipedia, ReCaptcha, Flickr, Threadless, Innocentive, Kiva, Kickstarter, YouTube, Twitter, and the Google search engine.

As some colleagues and I have argued in a forthcoming paper, The Future of Crowd Work, there are many ways in which paid crowd work as it exists today does not look like the kind of job you would necessarily want your child to take on as a career.  And yet, while crowd work is very, very far from ideal by almost any standard, I would be disappointed if the impact of this case somehow resulted in the destruction of the industry and the stifling of the innovative research and applications that have developed around it. The outcome will boil down to the ways in which paid labor – even flexible, remote, and relatively straight-forward tasks that are paid only $0.01 – is regulated as compared with volunteer labor.

The news that during the creation of Bain Capital Mitt Romney sought large investments from some of El Salvador’s notorious “oligarchs” should not be particularly surprising given the extent of the support the US government and private sector provided to the Salvadoran government and political elites during the country’s civil war from 1980-92.

In the grand narrative of the 2012 presidential campaign, I suspect this story will figure as (at most) a very minor footnote. Nevertheless, I wanted to draw attention to it because of my personal connections to El Salvador and the persistence of that country’s civil war in the lives of so many of its residents.

During and after my first year of college, I visited El Salvador for almost 9 months. At the time, I worked with a small community organization called Grupo Tamarindo (that would later be known as the Tamarindo Foundation – warning: link is to their Facebook page).

The Tamarindo Foundation online presence doesn’t really do the group justice. By the standards of the nonprofit and NGO sector, it’s a tiny organization with almost no budget and even less in the way of public relations materials to document what its members have accomplished in the group’s nearly 20 years of existence.

During that time, the group and it’s leader John Guiliano have sought to rebuild the town of Guarjila, El Salvador and, in particular, to create opportunities for the town’s young people to pursue education, arts, athletics, and entrepreneurship.

The Tamarindo has made progress towards all of these goals and more while working on a shoestring budget, microscopic scale, and generational time-horizon. That said, any progress has been slow, and the underlying cycles of poverty and violence that make life in El Salvador and in Guarjila so precarious persist.

In this way, over twenty years since the Salvadoran civil war ended, its effects can still be felt, whether in the gangs that were formed in U.S. prisons by war refugees or in the lack of adequate educational and career opportunities that lead so many of El Salvador’s young people to seek employment and security through illegal migration.

The current goal of the Tamarindo is to build a community center where the organization can continue to do its work. In order to support this goal, John has started a bicycle ride across the United States. After handing in my dissertation draft last Sunday, I joined the first three days of the ride from Boston to New Haven. You can follow the ride through John’s Facebook updates and donate (via their ImAthlete page). The itinerary includes stops across the country and, if you want to learn more about the Tamarindo, the ride, or El Salvador, I encourage you to contact John and attend one of the tour events.

Members of the Grupo Tamarindo training for the ride.

newsroom panorama by David Sim (http://www.flickr.com/photos/victoriapeckham/)

Crowdsourcing, outsourcing, and other sorts of distributed work have long since made inroads into professional journalism, but a recent scandal involving a few major metropolitan newspapers outsourcing their local reporting to a company named Journatic reveals the scope and extent of those inroads.

Since This American Life first broke the story a couple of weeks ago, the details of the Journatic story have made their way all over the Internet (See, e.g., coverage from Poynter, Romenesko, and Gigaom for some of the more thoughtful examples).

The basics are straightforward: Journatic is a company that specializes in generating content for a variety of purposes, among them local news stories (they also have a sister company called Blockshopper that provides a similar service for real estate listings). It seems that typically a client – say, a major U.S. newspaper like the Chicago Tribune, for example – contracts with Journatic, which then hires dozens of independent subcontractors (mainly in the Philippines and the U.S.) who construct and edit hyperlocal news items in a distributed, piecemeal fashion before passing the finished product back to the client for publication.

You can get a much better feel for the process by listening to the TAL interviews with Journatic editor Ryan Smith, or by reading Smith’s tendentious editorial about his experience (has has subsequently quit working for Journatic, although – interestingly – he was not fired or even reprimanded for his efforts to publicly criticize the company’s practices and products).

The stickiest part of the scandal seems to be that the Trib, along with several other major metropolitan dailies (the San Francisco and Houston Chronicles as well as the Chicago Sun Times) , had been printing these stories under false by-lines (such as Jake Barnes – the name of a famous Hemingway character), which violates the paper’s own ethical standards.

I find the story pretty engaging for several reasons:

The fact that Journatic figured out how to crowdsource journalism is actually pretty impressive. Some friends at CMU have been trying for a while now to generate magazine-style writing using workers on Amazon’s Mechanical Turk. Likewise, I’d like to develop and test methods for crowdsourcing peer review of academic papers. Apparently, the folks at Journatic have already solved many of the practical problems involved in performing a complex knowledge-based task like reporting using a globally distributed workforce of highly variable skill.

Second, despite the rhetoric surrounding the story, Journatic is neither the end of journalism as we know it nor its salvation. While I share the concerns voiced by Smith, TAL reporter Sarah Koenig, and others over the wages paid to Journatic’s Philippino contractors as well as the confusion about the Tribune’s apparent willingness to buck its own editorial policies about attribution in this case, these issues need to be distinguished from questions about whether crowdsourcing is “bad” or “good” for the future of media. I believe the emergence of companies specializing in crowdsourced journalism is merely another wrinkle in a complex organizational ecosystem where incumbent firms are struggling to retain some sort of comparative, competitive advantage in the face of declining revenues. When you consider Journatic in the context of other experiments in crowdsourced journalism, such as some of ProPublica’s distributed reporting project, CNN’s iReports, or even the political blogosphere, paying workers around the world to assemble stories sounds less like a violation of basic journalistic principles and more like the latest in a long line of process innovations that might or might not help to reinvent the field.

Last, but not least, many of us (myself included) may not like the fact that the cost of local news coverage has exceeded the demand in many places, but I think there’s got to be a more effective response than petitioning Sam Zell to stop outsourcing. Instead, I’d like to see a combined effort to improve Journatic’s models of content production in order to (1) address the ethical concerns raised in the Tribune scandal; (2) improve the quality of coverage in order to correct some of the terrible reporting practices documented by Smith in his op-ed; and (3) more effectively integrate teams of remote and on-site local reporters.

Ultimately, you can’t ignore the fact that Journatic smells bad. They paid off contractors not to talk to the media a few months ago, provide SEO and content farm services on the backs of cheap overseas labor, and when faced with complaints about the fact that their real estate listing service, BlockShopper, violated people’s privacy, they responded by issuing a Zuckerbergian declaration against expectations of privacy online and hiding the identities of their writers. Oh, and they also hide their company’s website from Google’s robots (go to http://journatic.com and use the “view source” option in your browser to see their robots.txt policy).

That said, the whole situation offers a chance to think about what a more responsible, ethical, and constructive version of crowdsourced journalism could look like. For that reason alone, I think Journatic deserves even more attention than it has already received.

Image from BoingBoing (cc-by-nc).

In the midst of all the excitement about the Higgs Boson, I’m not the only one who has been fascinated by the metaphors that different people use to try to explain what’s going on to us non-physicists.

Depending on who you ask, the Higgs field might be better imagined as a giraffe or paparazzi or a swimming pool. The best explanation anybody’s encountered was a hand-drawn animation. Meanwhile, the Higgs particle itself has been compared (figuratively) with God. The research project itself has even been set to song .

All of this led me to think a little more about the fact that metaphors are among the best linguistic tools you can imagine when it comes time to explain a complex idea to someone who is (relatively speaking) clueless about the topic in question.

So that (and the more general notion that academics ought to put Malcolm Gladwell out of business <link>) got me thinking about some big sociological ideas that could do with a bit more metaphori-ification (?) in order to make them more intelligible.

First on my list is the notion of social structure – or maybe any of the big, structural social forces that contribute to the reproduction of social inequality (e.g. class, race, gender, etc.).

Even though it’s important to continue to think and argue about exactly how these phenomena operate, it’s critical to communicate that, in general, social forces are often invisible and never equally experienced by everyone, even though they effect everyone to some extent or another.

In other words, socioeconomic structure could be thought of as a sort of Higgs field in its own right — conditions of birth, early childhood, culture, and socialization impart a certain, relative amount of “mass” (poverty? oppression?) to individuals, who are then generally able to move more or less easily through the social world as a result.

Does this make sense? Are there other, better ways to explain complex sociological notions in a manner that do not involve the words habitus, governmentality, hegemony, institutions, etc. but that also do not deviate too far from the way sociologists use them? For someone who has never encountered such theoretical jargon, these terms can be literally meaningless and it’s up to someone else who understands them to provide some sort of conceptual bootstraps so that the rest of us can haul ourselves up.

Follow

Get every new post delivered to your Inbox.