Tuesday, December 21, 2010

nGram viewer

Earlier this week, Google launched an exciting new tool: the Books nGram Viewer for visualizing how the occurrences of phrases in books have waxed and waned over the years. The viewer sits on top of a dataset of 500 billion words from 5.2 million books in Chinese, English, French, German, Russian, and Spanish, with phrases up to five words and a count of how many times the phrase appears each year (a subset of Google's 15 million digitized books worldwide).

Play around with it, and you'll see firsthand how a clean and simple visual can allow you to understand a massive amount of data in seconds and use that data to start to create and tell stories.

Say, for instance, that I want to understand the varying popularity of my personal favorite amusement park ride (the Ferris Wheel) in English literature over the years. For a point of comparison, I'll also plot my least favorite amusement park ride (the rollercoaster). Here is the visual:

Ferris Wheel vs. Rollercoaster popularity over time

We see both rides beginning to be captured with the written word in the 1930s. The Ferris Wheel has had several relative rises and falls in popularity since then, with (sadly) a continued decline since the mid-1990s. The popularity of rollercoasters, on the other hand, was initially slow to build, but then overtook the Ferris Wheel around 1985 and has skyrocketed in comparison since that time. Based on this visual, my affection for Ferris Wheels puts me in a dwindling minority, while rollercoasters are rapidly gaining in popularity.

As a reminder on the importance of context, let's add another series. If you enjoy Ferris Wheels like I do, you may know that the first one was built for the World's Columbian Exposition in Chicago in 1893 and that it was intended to rival the Eiffel Tower, that had been built for the Paris Exposition 4 years earlier. Let's check out what happens if we plot mentions of the Eiffel Tower in English literature on our chart:

Eiffel Tower mentions dwarf Ferris Wheel and Rollercoaster throughout history

As I called out in the chart title, mentions of the Eiffel Tower dwarf our initial two series. Also note that, whereas we see mentions of the Eiffel Tower pop up immediately following its unveiling, the Ferris Wheel took a little longer to make its way from the World Expo to the written word.

Just starting to play around with this sparks more interesting questions: what led to the bumps in the Ferris Wheel's popularity? What genre of novels most mention the Eiffel Tower - romance? history? Google gives us the ability to dig to our heart's content by making the full datasets freely downloadable as well.

What stories might you use nGram to tell?

Tuesday, December 14, 2010

label your axes

The following comic from xfcd has been making its way around the data visualization blogs and I couldn't help but repost it here.

The lesson is a good one: every axis should have a label - no exceptions! (Ok, one exception: if the values are January, February, March, ..., you probably don't need to label the axis "months", but anything less explicit than that simply must be labeled!)

The lack of a label, even if you think it's obvious from context, leaves space for your audience to question what they are looking at. If you state it explicitly with an axis label, rather than spend their brainpower trying to figure out what the axis represents, your audience can spend that power on actually understanding the information that is being presented in the graph. Wouldn't you rather that be the case?

Thursday, December 9, 2010

5 easy tips

Let's start with the basics. Here are 5 straightforward tips to help you communicate effectively with data.

1. Keep your audience in mind. You are creating a data visualization because you want to communicate something to someone; keep that someone top of mind throughout the design process. Use visual cues (size, color, placement on page) to help direct your audience's eye and provide signals on what to pay attention to. Easy test: show your visual to a colleague who has limited context and let them tell you how they process the information (where they focus, what observations they make) - this is a good proxy for your audience, so if they aren't paying attention to the right things, revisit the design.

2. Choose display based on what you want to show. Let the question you are trying to answer determine the appropriate chart type. The correct answer to the question "what is the right chart type?" is always the same: whatever will be the easiest for your audience to interpret. Don't shy away from bar charts because they are common: use them because they are common - this means less of a learning curve for your audience to understand the information that you are providing.

3. Aim for simplicity. A complicated-looking visual can turn off an audience, as it means it will likely take time to get at the information that's being provided. Don't make your audience work to get the information - as the designer, you should take that work upon yourself to make the message clear. Strip out anything that doesn't have informative value - every step in reduction makes what remains stand out more. Don't be afraid of white space. Preserve margins (if you're unable to do this and have already eliminated the nonessential, you should think about breaking the message into multiple pieces so as not to overwhelm). Simple is better than complicated.

4. Support with text. Every chart needs a title, every axis needs a label - no exceptions! As the designer of the visual, you are more familiar with the content than your audience; help them understand the information by explaining the unfamiliar, citing data sources and as of date, and outlining methodology as warranted. The best place to put text is a close as possible to what it's describing, so long as it doesn't obscure the information. If you want your audience to draw a specific conclusion, state it explicitly.

5. Use color strategically. The use of color should always be an explicit decision. Use color sparingly and strategically to highlight the important parts of your visual: color is a strong visual cue to help your audience understand where they should focus their attention. In general, aim to use a color palette of shades of grey with pointed use of color. Around 10% of people are colorblind, which typically means difficulty distinguishing between shades of red and shades of green, so keep this in mind in your design.