2012-05-25

Traditional documentation requires a process where a few people write for many potential users (especially in the case of API documentation). The resulting documentation, more often than not just doesn’t cut it – There aren’t enough examples, details, or explanations.

Crowd documentation turns the traditional documentation process on its head – knowledge is created and curated by a mostly uncoordinated collective. The potential is massive and already happening: stackoverflow.com allows users to ask and answer questions about programming topics, blog posts allow developers to write tutorials and provide solutions to otherwise undocumented issues. StackOverflow already has 3 million questions (with 85% percent answered in a median of 11 minutes) and countless number of blog posts have been written.

But a burning question remains, can we trust crowd documentation? Will it be complete, will it be fast, will it be authoritative? What type of content is created by the crowd and who contributes?

Analyzing API Discussions on Stackoverflow

To answer these questions, we obtained a data dump of the StackOverflow database and we measured the amount of discussion of different API elements, such as classes or methods, on StackOverflow.
We wanted to know:

  • Will different API elements be widely covered

  • If an API element is discussed infrequently is it also discussed infrequently in practice

  • How fast is the crowd at covering an entire API

To measure discussion of an API element (focusing on classes), we looked for traceability links in the questions and answer body. We then built a model mapping API elements to the questions and answers via their traceability links. The results surprised us in several ways:

You can find a Stackoverflow thread for a majority of API elements (classes), but the growth of availability occurs only linearly (despite an exponential growth in users contributing). We also found that API elements that were not frequently discussed on StackOverflow were not frequently used in practice (based on numbers we obtained from google code search).

Visualizing API Discussions on Stackoverflow

API designers may want insight into the “hot spots” that are problematic for developers or may have “gaps” in coverage. For example, we observed that not many developers talked about accessibility or DRM in Android.

We have a treemap visualization tool that helps visualize the coverage and usage data of API elements in a treemap.

Play with the Android Treemap. Play with the Java Treemap.

Automatically Generating Documentation

And one more thing. Given all the data about API elements from the crowd, is there something more we can do?

Consider one idea, a format similar to JavaDoc could be automatically generated, including popular questions about a class, recommendations to external resources such as blogs, and popular code snippets. See an example created automatically by our prototype tool:

Click here for the html version: MessageDigest auto-crowd doc example

Read More

Read our preliminary technical report for more detail. We go into more detail about traceability links to code examples and the effects of filtering out questions and answers based on filters such as vote scores or views.

Checkout the reddit discussion.

_“Crowd Documentation: Exploring the Coverage and the Dynamics of API Discussions on Stack Overflow”. _Chris Parnin, Christoph Treude, Lars Grammel, Margaret-Anne Storey.



blog comments powered by Disqus