Aaron Swartz’s suicide over the weekend is a tragedy. His death has affected many people very deeply, including many of my friends who were very close with Aaron.

Personally, I did not know Aaron well, but I regard him as an inspiration – as much for his quiet thoughtfulness and kindness as for his amazing achievements, intellect, projects, and democratic (small “d”) ideals.

I don’t have much to add to some of the heartfelt responses many people (including Cory Doctorow, Larry Lessig, and Matt Stoller) have posted elsewhere; however, as I have thought and read about Aaron over the past couple of days, I have decided that I want to commemorate his life and work through some concrete actions. Specifically, I have made some vows to myself about how I want to live, work, and relate to people in the future. Most of these vows are fundamentally democratic in spirit, which was part of what I find so inspiring about so much Aaron’s work. Not all of my commitments are coherent enough or sensible enough to list here, but I will put one out there as a public tribute to Aaron:

I will promote access to knowledge by ensuring that as much of my work as possible is always available at no cost and under minimally restrictive licenses that ensure ongoing access for as many people in as many forms as possible. I will also work to convince my colleagues, students, publishers, and elected or appointed representatives that they should embrace and promote a similar position.

This is a very small and inadequate act given the circumstances.

Image from BoingBoing (cc-by-nc).

In the midst of all the excitement about the Higgs Boson, I’m not the only one who has been fascinated by the metaphors that different people use to try to explain what’s going on to us non-physicists.

Depending on who you ask, the Higgs field might be better imagined as a giraffe or paparazzi or a swimming pool. The best explanation anybody’s encountered was a hand-drawn animation. Meanwhile, the Higgs particle itself has been compared (figuratively) with God. The research project itself has even been set to song .

All of this led me to think a little more about the fact that metaphors are among the best linguistic tools you can imagine when it comes time to explain a complex idea to someone who is (relatively speaking) clueless about the topic in question.

So that (and the more general notion that academics ought to put Malcolm Gladwell out of business <link>) got me thinking about some big sociological ideas that could do with a bit more metaphori-ification (?) in order to make them more intelligible.

First on my list is the notion of social structure – or maybe any of the big, structural social forces that contribute to the reproduction of social inequality (e.g. class, race, gender, etc.).

Even though it’s important to continue to think and argue about exactly how these phenomena operate, it’s critical to communicate that, in general, social forces are often invisible and never equally experienced by everyone, even though they effect everyone to some extent or another.

In other words, socioeconomic structure could be thought of as a sort of Higgs field in its own right — conditions of birth, early childhood, culture, and socialization impart a certain, relative amount of “mass” (poverty? oppression?) to individuals, who are then generally able to move more or less easily through the social world as a result.

Does this make sense? Are there other, better ways to explain complex sociological notions in a manner that do not involve the words habitus, governmentality, hegemony, institutions, etc. but that also do not deviate too far from the way sociologists use them? For someone who has never encountered such theoretical jargon, these terms can be literally meaningless and it’s up to someone else who understands them to provide some sort of conceptual bootstraps so that the rest of us can haul ourselves up.

Truth and conferences

March 11, 2012

Craig Newmark (with an assist from the Colbert-head-on-a-stick puppet) shares his feelings about what he'd like to tell people who use the Internet to spread nefarious lies and misinformation.

It’s been a busy week. I spent two days of it attending the Truthiness and Digital Media symposium co-hosted by the Berkman Center and the MIT Center for Civic Media. As evidenced by the heart-warming picture above, the event featured an all-star crowd of folks engaged in media policy, research, and advocacy. Day 1 was a pretty straight-ahead conference format in a large classroom at Harvard Law School, followed on day 2 by a Hackathon at the MIT Media Lab. To learn more about the event, check out the event website, read the twitter hashtag archive, and follow the blog posts (which, I believe, will continue to be published over the next week or so).

In the course of the festivities, I re-learned an important, personal truth about conferences: I like them more when they involve a concrete task or goal. In this sense, I found the hackathon day much more satisfying than the straight-ahead conference day. It was great to break into a small team with a bunch of smart people and work on achieving something together – in the case of the group I worked with, we wanted to design an experiment to test the effects of digital (mis)information campaigns on advocacy organizations’ abilities to mobilize their membership. I don’t think we’ll ever pursue the project we designed, but it was a fantastic opportunity to tackle a problem I actually want to study and to learn from the experiences and questions of my group-mates (one of whom already had a lot of experience with this kind of research design).

The moral of the story for me is that I want to use more hackathons, sprints, and the like in the context of my future research. It is also an excellent reminder that I want to do some reading about programmers’ workflow strategies more generally. I already use a few programmer tools and tactics in my research workflow (emacs, org-mode, git, gobby, R), but the workflow itself remains a kludge of terrible habits, half-fixes, and half-baked suppositions about the conditions that optimize my putative productivity.

Matt Salganik and Karen Levy (both of the Princeton Sociology Department) recently released a working paper about what they call “Wiki Surveys” that raises several important points regarding the limitations of traditional survey research and the potential of participatory online information aggregation systems to transform the way we think about public opinion research more broadly.

Their core insight stems from the idea that traditional survey research based on probability sampling leaves a ton of potentially valuable information on the table. This graph summarizes that idea in an extraordinarily elegant (I would say brilliant) way:

Figure 1 from Salganik and Levy (2012), which they title: "a schematic rank order plot of contributions to successful information aggregation systems on the Web."

Think of the plot as existing within the space of all possible opinion data on a particular issue (or set of issues). No method exists for collecting all the data from all of the people whose opinions are represented by that space, so the best you – or any researcher – can do is find a way to collect a meaningful subset of that data that will allow you to estimate some characteristics of the space.

The area under the curve thus represents the total amount of information that you could possibly collect with a hypothetical survey instrument distributed to a hypothetical population (or sample) of respondents.

Traditional surveys based on probability sampling techniques restrict their analysis to the subset of data from respondents for whom they can collect complete answers to a pre-defined subset of closed-ended questions (represented here by the small white rectangle in the bottom left corner of the plot). This approach loses at least two kinds of information:

  1. the additional data that some respondents would be happy to provide if researchers asked them additional questions or left questions open-ended (the fat “head” under the upper part of the curve above the white rectangle);
  2. the partial data that some respondents would provide if researchers had a meaningful way of utilizing incomplete responses, which are usually thrown out or, at best, used to make estimates about the characteristics of whether attrition from the study was random or not (this is the long “tail” under the part of the curve to the right of the white rectangle).

Salganik and Levy go on to argue that many wiki-like systems and other sorts of “open” online aggregation platforms that do not filter contributions before incorporating them into some larger information pool illustrate ways in which researchers could capture a larger proportion of the data under the curve. They then elaborate some statistical techniques for estimating public opinion from the subset of information under the curve and detail their experiences applying theses techniques in collaboration with two organizations (the New York City Mayor’s Office and the Organization for Economic Cooperation and Development, or OECD).

If you’re not familiar with matrix algebra and Bayesian inference, the statistical part of the paper probably won’t make much sense, but I encourage anyone interested in collective intelligence, surveys, public opinion, online information systems, or social science research methods to read the paper anyway.

Overall, I think Salganik and Levy have taken an incredibly creative approach to a very deeply entrenched set of analytical problems that most social scientists studying public opinion would simply prefer to ignore! As a result, I hope their work finds a wide and receptive audience.

A Modest Academic Fantasy

January 9, 2012

Image credit: curious zed (flickr)

For today’s post, I offer a hasty sketch of a modest academic fantasy: free syllabi.

As a graduate student, I have often found myself searching for and using syllabi to facilitate various aspects of my work.

Initially, syllabi from faculty in my department and others helped me learn about the discipline I had chosen to enter for my Ph.D. Later, I sought out syllabi to design my qualifying exam reading lists and to better understand the debates that structured the areas of research relevant to my dissertation. More recently, I have turned to syllabi yet again to learn about the curriculum and faculty in departments where I am applying for jobs and where I could potentially teach my own courses. When I design my own syllabi, I anticipate that I will, once again, search for colleagues’ syllabi on related topics in order to guide and advance my thinking.

The syllabi I find are almost always rewarding and useful in some way or another. The problem is that I am only ever able to find a tiny fraction of the syllabi that could be relevant.

This is mainly a problem of norms and partly a problem of infrastructure. On the norms side, there is no standard set of expectations or practices around whether faculty post syllabi in publicly accesible formats or locations.

Many faculty do share copies of recent course syllabi on their personal websites, but others post nothing or only a subset of the courses they currently teach.

I am not aware of any faculty who post all the course syllabi they have ever taught in open, platform independent file formats to well-supported, open archives with support for rich meta-data (this is the infrastructure problem).

Given the advanced state of many open archives and open education resources (OER) projects, I have to believe it is not completely crazy to imagine a world in which a system of free syllabi standards and archives eliminates these problems.

At minimum, a free syllabi project would require faculty to:

  • Distribute syllabi in platform independent, machine-readable formats that adhere to truly open standards.
  • Archive syllabi in public repositories
  • License syllabi for at least non-commercial reuse (to facilitate aggregation and meta-analysis!).

In a more extreme version, you might also include some standards around citation formats and bibliographic information for the sources and readings listed in the syllabi.

In any case, some sort of free syllabi project seems doable; useful; and relatively inexpensive (at least in comparison to some expensive, resource intensive projects that involve streaming full video and audio of classes).

Update: Joseph Reagle, who is – as usual – much better informed on these topics than I am, responded to my post over a Berkman Center email list. Since  Joseph’s message points to some really great ideas/references on this topic, I’m re-publishing it in full below (with his permission):

Aaron S’s posting today about “A Modest Academic Fantasy” [1] (free syllabi) reminded me I wanted to share a post of my own [2] in response to Greg Wilson’s question of “would it be possible to create a ‘GitHub for education’”? [3].

While a super-duper syllabus XML format might be great (as I’ve heard David W discuss) — but would have fork-merge-share problem’s as Wilson notes — I’ve always (since 2006) provided my syllabus online, in HTML, with an accompanying bibtex file for the reading list. I think this is the best way currently to share without waiting for a new standard.

On the course material front, I recently started sharing my class notes and slides. These are written in markdown — which makes them easy to collaborate on — put up at Github, and are used to generate HTML5 slides (e.g., [4]). I’ve also started putting up classroom best practices and exercises (e.g., [5]) on a personal wiki; I’d love to see something like this go collaborative.

For in class collaboration, I understand Sasha C[ostanza-Chock] has successfully used etherpad. The PiratePad variant even permits wiki-style links. I desperately want a light-weight synchronous editor with wiki-style links but none exist. (etherpad-lite is a great improvement on etherpad in terms of memory requirements, but does not have wiki-style links; I’ll probably end up using Google Docs because I don’t have to worry about any back-side maintenance.)

I’d love to hear from other people about what they are doing!?

[1]: https://fringethoughts.wordpress.com/2012/01/09/modest-academic-fantasy/
[2]: http://reagle.org/joseph/blog/career/teaching/fork-merge-share
[3]: http://software-carpentry.org/2011/12/fork-merge-and-share/
[4]: http://reagle.org/joseph/2011/nmc/class-notes.html
[5]: http://reagle.org/joseph/zwiki/teaching/Exercises/Tasks/Mindmap.html

Thanks, Joseph!

When science fails

November 13, 2011

I just read this short piece by Richard Van Noorden in Nature about the rising number of retractions in medical journals over the past five years and it got me thinking about the different ways in which researchers fail to deal with failure (the visualizations that accompany the story are striking).

Esther Vargas 2008 cc-by-nc-sa

The article specifies two potential causes behind the retraction boom: (1) increased access to data and results via the Internet facilitating error discovery; and (2) creation of oversight organizations charged with identifying scientific fraud (Van Noorden points to the US Office of Research Integrity in the DHHS as an example). It occurred to me in reading this that, a third, complementary  cause could be the political pressure exerted on universities and funding agencies as a result of the growing hostility towards publicly funded research. In the face of such pressure, self-policing would seem more likely.

Apparently, the pattern goes further and deeper than Van Noorden is able to discuss within the confines of such a short piece. This Medill Reports story by Daniel Peake from last year has a graph of retractions that goes all the way back to 1990, showing that the upturn has been quite sudden.

All of these claims about the causes of retractions are empirical and should/could be tested to some extent. The bigger question, of course, remains: what to do about the reality of failure in scientific research? As numerous people have already pointed out, in an environment where publication serves as the principal metric of production, the institutions, organizations & individuals that create research – universities, funding agencies, peer reviewed journals, academics & publishers – have few (if any) reasons to identify and eliminate flawed work. The big money at stake in medical research probably compounds these issues, but that doesn’t mean the social sciences are immune. In fields like Sociology or Communication where the stakes are sufficiently low (how many lives were lost in FDA trials because of the conclusions drawn by that recent AJS article on structural inequality?), the social cost of falsification, plagiarism, and fraud remain insufficient to spur either public outrage or formal oversight. Most flawed social scientific research probably remains undiscovered simply because, in the grand scheme of policy and social welfare, this research does not have a clear impact.

Presumably, stronger norms around transparency can continue to provide enhanced opportunities for error discovery in quantitative work (and I should have underscored earlier that these debates are pretty much exclusively about quantitative work). In addition, however, I wonder if it might be worth coming up some other early-detection and response mechanisms. Here were some ideas I started playing with after reading the article:

Adopt standardized practices for data collection on research failure and retractions. I understand that many researchers, editors, funders, and universities don’t want the word to get out that they produced/published/supported anything less than the highest quality work, but it really doesn’t seem like too much to ask that *somebody* collect some additional data about this stuff and that such data adhere to a set of standards. For example, it would be great to know if my loose allegations about the social sciences having higher rates of research failure and lower rates of error discovery are actually true. The only way that could happen would be through data collection and comparison across disciplines.

Warning labels based on automated meta-analyses. Imagine if you read the following in the header of a journal article: “Caution! The findings in this study contradict 75% of published articles on similar topics.” In the case of medical studies in particular, a little bit of meta-data applied to each article could facilitate automated meta-analyses and simulations that could generate population statistics and distributions of results. This is probably only feasible for experimental work, where study designs are repeated with greater frequency than in observational data collection.

Create The Journal of Error Discovery (JEDi). If publications are the currency of academic exchange, why not create a sort of bounty for error discovery and meta-analyses by dedicating whole journals to them? At the moment, blogs like Retraction Watch are filling this gap, but there’s no reason the authors of the site shouldn’t get more formal recognition and credit for their work. Plus, the first discipline to have a journal that goes by the abbreviation JEDi clearly deserves some serious geek street cred. Existing journals could also treat error discoveries and meta-analyses as a separate category of submission and establish clear guidelines around the standards of evidence and evaluation that apply to such work. Maybe these sorts of practices already happen in the medical sciences, but they haven’t made it into my neighborhood of the social sciences yet.

British Library by Steve Cadman (2007) CC-BY-SA

As in just about all the coverage I’ve seen of the Google Books deal with the Author’s Guild,
Friday’s NY Times story raises the familiar specter of Google-as-monopolist. This continues the longer-term trend of tarring the Mountain View, CA based firm with the same brush as it’s older, bigger, and more widely-distrusted rival from Redmond, WA. I’d like to point out a problem with this storyline that stems from the nature of the particular terms of the agreement.

In my mind, the nastiest and most inexplicable aspect of this agreement is not the bare fact that Google is about to buy the rights to a massive proportion of the world’s books. That has been a long time coming and is not a surprise. As many have pointed out, it could, in principle, lead to price manipulations when Google turns around to sell access back to libraries. However, I suspect we won’t see anything like that. Google’s lawyers aren’t stupid – they know that the Justice Department will be hot on their trail as soon as they get the faintest whiff of something like this. They will not want to let this happen if they can help it.

No, as I understand it, the truly nefarious part of the agreement preserves Google’s right to pay the same rate for licenses that publishers might offer to a hypothetical Google competitor in the future. This means that Google has effectively cornered the market for buying digital books – putting it in a position to shape the market to its liking in a way that looks much less evil to consumers.

The likely outcome is that buyers of access to Google Books won’t necessarily pay a premium price as they would with a typical monopolist. Instead, it is the organizations that sell rights to the books to Google in the first place who will be paid less than they would in a more competitive retail market.

Back in the 1930’s, the industrial economist Joan Robinson termed this kind of market failure a “monopsony.” The ideal typical form of monopsony arises in situations where the market for a particular good only has one buyer. This monopsonistic buyer is able to manipulate prices in much the same way that a monopolistic vendor would. The result is a market where the pricing mechanism fails to reflect supply and demand, perpetuating distortions and the breakdown of all the nice side effects that come along with a working market such as quality control, incentives to innovate, and adequate compensation for the producers of the goods in question.

Monopsony makes for less exciting headlines because it does not threaten consumers. Monopsonistic retailers slowly suck profits from their suppliers, forcing them to accede to their demands through the threat of massive revenue losses. The paradigmatic examle of a monopsonist in recent years has been Wal-Mart. The giant firm gets low prices for consumers by manipulating the prices of vendors (and by systematically undermining the labor market, but that’s another story). As soon as Wal Mart threatens to stop carrying their product, a seller has little bargaining leverage since they cannot afford to lose such a massive customer.

At least one person (who I don’t feel comfortable naming or quoting directly without permission) in the Berkman Center’s cyberlaw clinic assures me that the terms of Google’s agreement with the Author’s Guild are not totally clear on this point. Nevertheless, the fact that such an interpretation is not inconsistent with the text of the agreement should be reason enough to worry anyone who reads, writes, buys, or sells books.