Tuesday, December 27, 2011

don't fall victim to this

I came across this graph recently when catching up on some reading over the holiday. My question to you is simple: can you read it?

The website where this interactive visual resides is called worldshapin, and it implores you to "compare countries through their shape." It visualizes data from the Human Development Report 2011 as a "star plot" along the six dimensions of education, population, health, workplace equality, carbon footprint, and living standards. As shown above, you can look at this data between countries and as it compares to continents and the world (when the world isn't obscured by the countries and continents you've chosen, as it is above).

Before I get to the don't fall victim portion of this blog post, let me first say that I do think this helps make the data in the report more accessible by making it visual. You can get a quick idea of how one part of the world stacks up to another across these dimensions that you wouldn't get with a table of data, for example. This is fine for information discovery. This assumes you are making it available for an audience who will have an appetite to "play" with the data.

This visual is not fine, however, if you have a specific story that you want to tell through data.

To convince you of this, I'm going to take one of my own failed data visualizations from my past and remake it into something that works. First, a bit of history:

I used to make charts like this. I called them "spider graphs." In a prior life, I worked in banking, managing home equity fraud. When it comes to fraud, the ways you can impact it can be classified into 8 categories (where each category is a piece of the fraud management lifecycle): deterrence, prevention, detection, mitigation, analysis, policy, investigation, and prosecution (Wes Wilhelm, The Fraud Management Lifecycle Theory). So if we were to look at our efforts in each of these areas and rate the activities along a scale from 0 (we have nothing in place) to, say, 10 (the unattainable utopia of fraud management - we've solved every problem), we could show how well we're doing on a relative basis in each area, with the goal of maximizing our coverage and balancing activity across the different parts of the lifecycle. The spider graph was perfect for this!

I was able to locate an old annual review on the topic of home equity fraud that I put together that highlighted progress to date and introduced forward-looking plans. I'm going to assume it's ok to share an excerpt here, given that the financial institution I did this work for is now defunct (due to much bigger issues than my poor data viz). Here's what it looked like:

The visual starts off with an explanation, shows an example of how to read the graphs on the right, followed by the real-data-graphs across the bottom (the titles across the very bottom are the 5 different types of home equity fraud that we were tracking).

Lesson 1 (foreshadowing): if you have to have a graph to show how to read your graph, your visual may be too complicated.

When it comes to the visual at the bottom ("FML for Home Equity"), let's try to look past the black background and meaningless colors (while annoying, we have bigger fish to fry here) to the actual data. Same question as I led this post with: can you read it?

Before I answer that question with my current data viz lens on, let's back up the better part of a decade to take a look at what I thought of these visuals when I created them. I thought they looked really cool. Sexy, even. I also thought they clearly showed what I wanted to show: mainly, that we had a lot of work to do - we were failing in a lot of places and needed to make some changes.

But people found them really hard to read. I found myself explaining, repeatedly (to the same people even!) how to read them. At the time, I thought this was an issue with my audience.

When I look at the graphs through today's lens, I recognize that the issue was not with my audience, but rather with me. It was a visual design failure. I stubbornly persisted to show data in a way that wasn't straightforward for my audience to consume (even when it became obvious through their questions that it wasn't clear!). When information isn't straightforward, it's hard to look at. For an audience, this feels uncomfortable. Most people don't want to spend a lot of time with things that make them feel uncomfortable. Even when you try to convince them to. Can you blame them?

Let's talk about some other ways to visualize this same data. The sort of data we have lends itself easily to a matrix structure, with fraud management lifecycle stage across one axis and fraud type across the other. When I see the data organized this way, I think heatmap. But the main drawback to a heatmap in this scenario is that, while it gives us a decent visual comparison of how we're doing across the different buckets (both by fraud management lifecycle stage and by fraud type), we don't get a visual comparison of where we are vs. where we'd like to be, which I think is the most important piece here.

Instead, I'll leverage one of my best friends: the bar chart. Bar charts are great because people already know how to read them. This means there's no learning curve for your audience to face to get to the information you want to provide. Rather than spending their time deciphering how to read the graph, they can spend it understanding the information it shows. There also more likely to spend time on a visual that doesn't make them feel uncomfortable. Here's another way to visualize this data:

Note that the actual numbers aren't so important here - they were somewhat subjective to begin with - so I opted not to show a numerical scale at all. What is important is the relative distance from where we consider ourselves to be currently and where we want to be (as close to "we've solved every problem" as possible). I've drawn attention to this gap by showing the opportunity that remains outlined in blue.

The overarching lesson is this: don't fall victim to choosing sexy over utility when it comes to data viz for telling a story. When your audience tells you something is hard to read, or you find yourself explaining the visual more than discussing the information it shows, listen and adjust!

If interested, my Excel file is here. Leave a comment to let me know what you think!

Friday, December 16, 2011

the cost of christmas

Each year around this time, the US financial institution PNC produces the "Christmas Price Index," in which they calculate the cost of Christmas based on the items in the 12 Days of Christmas carol. I guess it's a sort of merrier (at least in theme) version of the Consumer Price Index and is meant to provide some economic insight into how the price of goods changes from year to year.

This year, they've layered on an interactive layer of glitz: the Christmas Price Index Express. Fast Company describes it as "A game-enhanced site with a handmade feel, the Index Express appears as a magical train that carries visitors through an alpine world to collect each of the 12 gifts. But it's essentially an elaborate interactive infographic, where the data points come to life with animation and sound." (Fast Company article) Whatever it is, it takes forever to load and I wasn't patient enough to spend time on the Index Express (where there are literally bells and whistles), rather, I clicked through the site long enough to find what I really wanted to get my hands on: the underlying data.

PNC certainly didn't make the data easy to extract. After painstakingly copying and pasting data from each of the 13 pages (total cost of Christmas plus one page for each day of Christmas) and reformatting to get a dataset I could do something with, I had myself an Excel spreadsheet with 28 years of 12 days of Christmas cost. Next challenge: visualize it and see what gems of wisdom we can acquire.

Often, there is much to be learned by looking at how not to visualize data. So before we get to how I'd visualize the cost of Christmas, let's look at a few less-than-optimal visualizations of this data and discuss their limitations.

First, the stacked bar chart. I often see data like this (multiple series over time) displayed this way. Unfortunately, this usually isn't a great approach. Stacked bar charts are tricky, because once you get past the first series, there is no longer a consistent baseline to compare the other series. Here's what it looks like with this data:

In the above, we can see how the total price of Christmas has changed over time and also see what the major contributors to the total price are. But if I want to understand how the different components have changed over time, that's tough with this visual. Are all goods changing in the same way, or are some getting more expensive while others have become cheaper? It's really difficult to tell with this graph.

So what if we unstack the bars so that we do have a consistent baseline for each series. Here's what we get:

This clearly doesn't work here - there's way too much going on. But even with fewer series (picture just the first 5, for example), this format is hard to read. It puts a lot of onus on the audience to spend time staring at it and looking for interesting things to pull out. That's too much work, when we can make the interesting things more obvious so our audience doesn't have to search for them.

Let's see what this data looks like in a line chart:

This is getting better, but still may not be optimal. There are a lot of overlapping lines, especially at the bottom where a number of series have similar values. But the biggest drawback is that we don't get a good sense of how the total cost of Christmas has changed over time with this graph, which is kind of the meta point of the data and is probably interesting.

While we're on the topic of non-ideal graphs for this data, I can also picture some sort of horrible visualization with pie charts: one for each year showing the breakdown of Christmas items, perhaps even with the size of the pie scaled by the total cost of Christmas. This would take some time to build, so I'm not going to go through the effort, particularly given that pie charts are my enemy. Rather, I'll simply say: don't do this! Why? Check out this blog post for some background.

We've looked at some less than stellar graphical representations of this data; now let's turn our attention to something that I think might work a little better.

In any visualization exercise, one of the first things to do is determine what question(s) you want to answer. This will drive how you show the data: the goal is to show it in a way that makes it clear what questions you set out to answer and answers them in a straightforward manner. The problem is that this step is often skipped, resulting in graphs like the ones above. When you don't isolate what question(s) you want to answer and try to create a visual that will answer any question, you run the risk of not answering any single question very well.

With this data, I'm going to choose to answer a couple of questions: how has the price of Christmas changed over time? (both in aggregate and for the various items) and what proportion does each day contribute to the total cost of Christmas? The trick I'll employ to do this in a way that isn't overwhelming is to create a visual with multiple graphs (and words!) so we can answer these questions one at a time. Said in another way, I'm going to use my visual to tell a story with this data. Here's what it looks like:

The top left graph shows how the cost of Christmas has changed over time. The top right graph shows the 2011 cost breakdown per item so we can understand the contributors to the total cost. Finally, the mini-graphs at the bottom help us understand the drivers behind the total changes we see in the top left graph. I've put on my analyst hat and added some words to describe what I believe are the main takeaways that my audience shouldn't miss.

The bottom line: Christmas is getting more expensive. If you have a tight budget for your holiday party, for entertainment you may consider replacing your leaping lords and dancing ladies with milk maids and for decor swap your swans for hens to save a considerable amount of money!

In case you're interested, my full Excel spreadsheet with data and graphs can be found here.

Tuesday, November 29, 2011

the waterfall chart

A few weeks ago, Andy Kriebel did a makeover of one of my visual makeovers. His version was a waterfall chart. I like waterfall charts, so I thought it might be useful to do a post focusing on them: what they are, an example use case, and how to use what I like to consider "brute-force-excel" to create them.

I find waterfall charts to be useful when you are interested in visualizing a starting quantity, positive and negative changes to that quantity, and the resulting ending quantity.

For example, in my day job it's sometimes useful to visualize changes in the number of employees in a given team over a period of time, say, over the course of a year. The starting quantity is the beginning of period (e.g. beginning of year) headcount. In terms of changes, there are some things that increase headcount (new hires, transfers into the given group) and some things that decrease headcount (exits from the company, transfers to other groups). When all of these changes are applied, we are left with the ending (end of year) headcount. The waterfall chart that portrays this, then, could look something like this (note that all numbers are made up):

Some graphing applications (like Tableau, which Andy used) have built-in waterfall chart functionality. However if you're working in Excel, like me, that's not the case. Fret not, as all it takes is a little brute force to turn a bar chart into a waterfall chart. The secret? An invisible series and a little bit of math.

Let me show you an interim step between where I started in Excel and the final product:

I have a stacked bar chart with two series: the visible series is the data that I want to show: beginning headcount, hires, transfers in, transfers out, exits, and ending headcount. The invisible series acts like a sort of placeholder to help me line up my other data. Each addition begins at the uppermost point of the column that preceded it and builds upward. At the turning point from addition to deduction, the first deduction starts at the top of the prior bar and shows its value downward. Further deductions begin at the lowermost point of the column preceding it and pull the values further downward. Note that both the beginning and ending figures are anchored at the baseline, while the interim values float, showing the changes in total, piece by piece.

To get from this interim step to the final waterfall chart, simply right click on the invisible data series and reformat it so that there is no fill and no line. The horizontal lines connecting the bars require a little more brute force: those are lines I've drawn in Excel on my chart.

You can download the Excel file here in case you want to take a closer look (and see the math I used for the invisible columns).

A couple of notes on my personal preferences when it comes to waterfall charts:
  • Horizontal lines connecting the bars: I like how these draw the reader's eye across the graph from left to right and also think the lines help to make it clear that the starting point for the next change is where the last bar ended. I do recommend keeping the lines thin and light so they don't compete with the data visually.
  • Using multiple colors: I often see waterfall charts with the beginning and ending bars one color, the increases another color, and decreases a third color. I think that if the chart is labeled well and the bars have sufficient space between them, this additional segmentation mechanism is unnecessary. This comes down to personal preference as well as what you want to convey to the audience. If the distinction between the positive and negative changes is really important, you can call more attention to them by varying the color. As always, just make sure your use of color is an explicit decision (not chance or graphing application defaults) and draws your audience's attention to where you want it.
There are a couple of additional considerations to keep in mind when using waterfall charts:
  • Because the bars do not have a consistent baseline, our eyes don't do a great job of accurately comparing segments that are close in size, so I recommend labeling the values explicitly to aid in interpretation. Note that if there is an apparent difference in size and the specifics aren't important, you can omit these data labels (but then you should add a y-axis so the reader can interpret the data).
  • If there isn't an intrinsic order in the categories, order the increases and decreases (separately) by size (smallest to largest or largest to smallest).
What's your view on the waterfall chart? Can you think of other applications? Leave a comment with your thoughts!

Friday, November 18, 2011

visual battle: table vs graph

In a data visualization battle of table against graph, which will win?

The short answer (which may be less than satisfying) is: it depends. Mostly, it depends on who the audience is and how the data will be used. One important thing to know is that people interact very differently with these two types of visuals. Let's take a quick look at how and some use cases for each, then we'll look at a specific example from a recent WSJ article.

Tables, with their rows and columns of data, interact primarily with our verbal system. We read tables. When I have a table in front of me, I typically have my two index fingers out - I scan across rows, down columns, and I compare values. Tables are great when you have an audience who wants to do just that. Or if you have a diverse audience, where each wants to look at their own piece: a table can meet this need. Tables are also handy when you have many different units of measure, which can be difficult to pull off in an easy to read manner in a graph.

Graphs, on the other hand, interact with our visual system. It's a high bandwidth information flow from what our eyes see to the comprehension in our brain, which can be extremely powerful when done well. Graphs can present an immense amount of data quickly and in an easy-to-consume fashion; they are particularly useful when there is a point to be made in the shape of the data, or for showing how different things (variables) relate to each other.

Let's look at an example. There was an article posted recently in the Wall Street Journal online titled, "Young Workers Like Facebook, Apple, and Google" (article). With the article, came an "Interactive Graphic," a table listing the 150 companies included in the survey, relative rank, and the percentage of young worker respondents that voted for each. (Slight tangent: while I suppose the interactive label fits, I was a little surprised to find that the only way I could interact with the data was to sort each column in either ascending or descending order - I guess this would be useful if I were looking for a particular company, so I could alphabetize the list, but utility beyond that is limited.) Here's what the top of the table looked like:

Question: was it right of WSJ to include a table rather than a graph?

In this case, I think the answer is yes. The article spends time discussing Google in the top spot (making the article title seem somewhat incongruous to me...also interesting that they mention Google last out of the three companies called out in the title while it ranked first), but then also points out some other nuances, for example the decrease in financial sector rankings (though the year over year data is not provided to the user). My assumption is that they wanted to include all of the data so that users could look up specific companies of interest, or look at the top or bottom of the list. This hits the one of the table criteria that we described above: a diverse audience, each wanting to look up their own piece.

If, however, the primary goal is to make the point that Google is well ahead of the pack (which is the focus of the majority of the article), a graph would help us to visually tell the story more quickly and arguably more effectively than can be done with the table.

Question: what should we graph? Graphing all 150 companies is out of the question: there are too many and the tail will take up more space than the value seeing it will add. So we know we need to graph something less than all, but the question remains: where should we make the cutoff?

We can pick a clean number (this is likely the rationale behind the top 3 that WSJ mentions in title): top 5, top 10, top 20. But in doing so, we run the risk of including and excluding companies of very similar values (for example, if we were to graph the top 10, we'd include the CIA at 5.04% but exclude Nike, which is only 3 basis points lower, at 5.01%). This isn't to say this isn't acceptable, but to point out that it should be an explicit decision: you should understand the pros and cons of this approach and be accepting of the cons (vs. not recognizing that they exist).

Another option is to graph the data and then look for the natural breaks that occur and have our cutoff reflect this nuance in the data. Here's what it looks like if we graph the top 25 (quick & dirty):

Here, the y-axis is the % of respondents and the x-axis is company rank. I found it hard to see the difference in the length of bars plotting this direction, so also tried the horizontal bar chart:

I find it much easier to see the relative differences in this second iteration of the chart (somehwhat due to the compression of the bars, also it just seems easier to scan down vs. across to spot differences in bar length). Based on this, it looks like there are clear differences between 7th and 8th place, between 8th and 9th, between 11th and 12th, between 15th and 16th, and so on. We could make arguments for a number of different cutoffs. In this case, I'm going to decide to take the top 15, both because it's a clean number (I've always liked multiples of 5, not sure why) and because we see a drop between the 15th and 16th positions (it's also the point where we break the 4% mark: 4.04% respondents vs. 3.80%, which I can note in a footnote).  You could make an argument to make the cutoff in another place, but this is what I'm going to go with for the reasons that I've outlined.

So if I want a visual to highlight the point in the article that Google is ahead of the pack, here is what it could look like:

Main takeaway: when debating table vs. graph, ask yourself how the data will be used and consider your audience. Let the utility of the visual that is needed drive your decision.

Thursday, November 10, 2011

how to do it in Excel

One common piece of feedback I get after presenting on the topic of data visualization goes something like this: Wow, that was super useful. I'm never going to use pie charts again. But when it comes to the graphs, how do you actually make them look like that? I'm not Excel-savvy...help!

Pretty much everything I do is in Excel. I like to refer to it as "brute-force" Excel, because in many cases the graphing application doesn't make it so straightforward to get from plotting the data to the final product. So I thought I'd take a few minutes to walk step by step through an example to expose those who question their Excel expertise to some of my tricks.

The following example may look familiar; it's from the FlowingData Challenge earlier this year (original blog post here).

The full Excel file can be downloaded here.

What you require most to get from Excel's original graph to the one you actually are proud to present is patience and time. You'll improve your odds of success by leaving ample time for the visualization step: don't rush this important piece, as it's what your audience sees of all your hard work!

Thursday, November 3, 2011

visual makeover: income and expenses

When I present my storytelling with data class, the second half of the session is typically conducted as an interactive workshop. I ask participants to submit graphs that they have created or encountered and would like feedback on and pick a handful that we focus on in small groups. After the groups have dissected the visuals in light of the course learnings, we discuss together and I review my own makeovers of the selected visuals.

The following visual is one that we focused on in a past session. The audience was comprised of grant-makers from philanthropic organizations. Here is the original visual that was submitted:

Those who know me are familiar with my opinion on 3D. In short: don't do it! Here, not only are the bars 3D, but with different rotation set on each of the charts (I think in part due to the different placement of the legend). I guess to spice things up? Hmmm.... (don't do it!)

I believe there are two root issues that lead to all of the problems with these graphs:
  1. Not enough time was spent considering what's most critical to share with the audience. What do they need to know? Is it how income and expenses have changed over time? ...how they breakdown in a given period? ...how they relate to each other? Because no decision was made on which information is crucial (or at least that decision isn't reflected in what's shown), the visuals don't answer any of these questions very well. In other words, by trying to show too much, the visual isn't showing anything particularly effectively.
  2. Excel makes it easy to do bad things. Some of it is the default settings (gridlines, colors, trailing zeroes on axis labels); some of it was done on purpose (most notably, 3D...don't do it!)

The changes I recommended are as follows:
  • Strip out clutter: gridlines, extraneous axis label digits, 3D, meaningless color
  • I don't think the historical income/expenses are necessary
  • Add a story in words: help the audience understand what they should know
  • Make the title active vs. descriptive (use this prime real estate wisely!)
Here is what it looks like when these changes are made:

What do you think? One piece of feedback I received from the participants was concern that an audience might perceive rows and read across (comparing Program expenses to Grants income, for example), which doesn't make sense. I think this could be solved by drawing a light vertical line between the expenses and income graphs.

Here (as elsewhere), I present my makeover not as the right answer, but as one possible solution to a data visualization challenge by someone who knows a little about and takes care in the visual design of her data graphics. I've made the assumption here that the most recent year's breakdown of income and expenses is the most important. If that is not the case, then this is not the right visual. If income and expenses over time is also important, you could perhaps show something like the following.

If both the breakdown of income and expenses as well as how they've trended over time are important, I'd definitely recommend breaking them into two different visuals, as I've done above, and making the relevant point on each vs. trying to cram it all into one visual.

What is your view? Leave a comment with your thoughts!

Monday, October 31, 2011

visualizing student loan debt

The latest edition of the Economist (Oct 29-Nov 4) includes a short article on US student loans. It describes the increase in student loan debt now vs. ten years ago: aggregate student loan debt is expected to exceed the $1 trillion mark when the next official estimate comes out later this year, surpassing credit card borrowing. The assumption that student loan programs were structured around - that a graduate's future earnings flow will more than cover the costs of a degree - is being called into question given the extended period of unemployment in the current economic environment. The article advocates 1) the changing of bankruptcy laws to forgive student loan debt (they currently do not) and 2) the repricing of student loan debt to either institute mortgage-like repayments (on a fixed schedule) or a movement to income-based payment amounts, with the forgiveness of remaining debt after a given period (e.g. 20 years).

The article includes the following graph. Two questions: 1) Does it fit with the story? 2) What changes would you make? 

It's not a bad graph. It's clean and easy to read. But, like most, there are things about it I'd like to change. If it were my visual, here are the minor modifications I would make:
  • Simplify: remove segmentation. I'm not sure the distinction between public and private student loan debt is interesting or relevant. I'd get rid of the segmentation and just show the overall debt so as not to call undue emphasis to the public vs. private piece. If this is indeed relevant but not high priority, a small footnote could be added to state that "x% of student loan debt is public and the remainder is private" and that the percentage hasn't changed meaningfully over the past 10 years.
  • Simplify: label points directly. The graph is easy to read, but you still have to read it. We could make a couple minor changes to make taking in the information even less work. Rather than have the x-axis across the top, you could remove it and label the bars directly. This would take away the step where you look at the bar and then trace up to the axis to understand the number.
  • Focus attention on the important part. The main point I think the graph is meant to make is how much larger US student loan debt is now vs. ten years ago. Given this, I'd recommend switching the order of the bars so that the 2011 estimate comes first and attracts attention.
  • Cut the clutter. Remove the light blue background (it doesn't add informative value and makes the data stand out a little less) and remove the y-axis line or push it to the background by making it grey.
Here's what the graph looks like when these changes are made (note that I didn't have the underlying data, so estimated the figures visually from the graph provided in the article):

Let's also consider another option. Question: do we need a chart to show this information? One lesson I teach in my class is that when you only have one or a couple of numbers to highlight, often simple text is the best way to do this, because putting the numbers in a graph can cause them to lose some of their umph. Is that the case here? Let's take a look. Here's one way we could visualize the numbers directly vs. in a graph:

I think arguments can be made for either of the above approaches. I do think you get some value from seeing the magnitude of difference with the bars. What approach would you take?

Sunday, October 30, 2011

happy halloween & google trends

Which Halloween costumes top the list in the US this year?

Google search terms can give us some interesting insight into social phenomena like this: [angry birds costume] had been at the top spot and continues to steadily rise, but was recently usurped by [black swan costume]. Check out the Google blog post for the full story.

Happy Halloween!

Tuesday, October 11, 2011

a Google example: preattentive attributes

The topic of my short preso at the visual.ly meet up last week in Mountain View was preattentive attributes. I started by discussing exactly what preattentive attributes are (those aspects of a visual that our iconic memory picks up, like color, size, orientation, and placement on page) and how they can be used strategically in data visualization (for more on this, check out my last blog post). Next, I talked through a Google before-and-after example applying the lesson, which I'll now share with you here.

First, a little background: In 2010, my colleague Neal Patel undertook research on managers at Google. He set out to understand two primary things: 1) the impact that managers have on work-life and 2) what makes a good manager. To read more about this study and the findings, check out the New York Times article from earlier this year.

When Neal's research was complete and it was time to begin to socialize the study and findings, he and I locked ourselves in a room filled with whiteboards and began to brainstorm. One of the visualization challenges was the first part of the study: as one might expect, managers have varying degrees of influence over the different aspects of work-life, ranging from aspects that they are able to influence heavily to aspects that they influence little or not at all. Our aim was to show this in a way that was easy to understand.

One of the early iterations looked like the following (note that I've generalized the visuals significantly to be able to show them here).


Given that I've generalized most of the labeling, I'll walk you quickly through what you're looking at. At the top of the page, there are three categories: those work-life aspects that are 1) highly influenced by managers, 2) somewhat influenced by managers, and 3) not influenced by managers. The categories within these are the different work-life themes, for example feeling supported in career development or having the ability to innovate, and then each has more detail on what aspects of the given theme are influenced at the given level by managers.

Next, comes the graph. The y-axis is a quantitative measure of manager influence. The x-axis shows the different aspects of work-life, grouped by color into same thematic categories as referenced in the table above the graph. The height of the bars indicates what influence category each work-life aspect falls into (matching the table above it): highly influenced by managers, somewhat influenced, or not influenced.

This is a nice looking visual. But we can use preattentive attributes more effectively to make the point come across more quickly and enable the audience to more easily take in the information.

In fact, it is exactly those two things from my perspective that preattentive attributes can facilitate in a really powerful way when employed effectively: 1) to draw the audience's eye to the most important part of the visual and 2) to provide a visual hierarchy of information that will help make it clear to the audience how they should interact with the information that is being provided. You can think of preattentive attributes as your tools to help your audience get into your (the designer's) head.

Let's inspect the above visual with these two things in mind. One of the first questions I ask myself when I'm looking at a visual is where is my eye drawn? You can do this easily with your own visuals: look away for a moment, then back at the visual and take note of where your eye first focuses (it's generally the preattentive attributes that dictate this). When I do this with the above visual, my eye first sees the title, "Findings," and then is pulled to the color in the graph at the bottom. The color differentiates the various work-life themes, which is probably not the most important thing on the page, and yet the strong draw of the color gives a signal that it should be.

Now, let's look at the visual from a hierarchy-of-information standpoint. Besides the title and the color in the graph, the font is all of similar size and weight. What this means is that the audience must read through everything in order to be able to conclude for themselves what is important and where they should devote their attention. To be frank, most audiences won't take the time to do this. It's also not really fair of us to ask them to, when a few minor changes will make it clear.

The following mock-up is similar to where we ended up with the visual after our brainstorming session. Note that very little change has been made to the content: we already had the right information, it was just a matter of playing with the preattentive attributes to make it more accessible to our audience.

 Some work-life aspects are more influenced by managers
The only content changes were to the titles. One of my rules is to never waste the title line for a descriptor like "findings". Titles are typically at the top of the page, which means they are the first thing people encounter and they are often big and bold (and perhaps even blue!), which makes them even more attention grabbing. Use them to communicate the most important thing about the visual. Maybe it's the main finding. Or perhaps the call to action that the data informs. It's prime reas estate, so make it count.

Let's take a look at how preattentive attributes are working for us in this updated visual. First, from the where-is-your-eye-drawn standpoint: for me, it goes like this:
  • I can't help but read the main title because of its placement at the top of the page and because it's big and bold and blue.
  • Next, my eye catches the graph title (font is bigger than that which is around it, also the bold is a signal that it's important) and scans it so I know what I'm looking at.
  • Within the graph, my eyes are drawn to the dark blue bars, which are those work-life aspects that are most heavily influenced by managers, arguably the most important thing on the page, since these are the areas that can be most impacted by change.
  • As my eyes continue to move down the page, they are drawn to the dark blue in the table (color coordinating with the same influence category as in the graph so there is a visual tie connecting them that doesn't require reading).
From a visual-hierarchy standpoint, what I've outlined above is highlighted clearly as the highest priority information on the page. Everything else is secondary. It's there to add clarity and additional information, but note how much more scan-able the second version of the visual is compared to the first.

The lesson is this: use preattentive attributes like color, size, and placement on page with intention. Specifically, use them to 1) highlight the most important part(s) of the visual and 2) create a visual hierarchy of information. Your audience will appreciate that you are providing visual cues to help them interact with your data visualization and will be more generous in giving their time to it than a visual that feels like work to consume.

Tuesday, September 27, 2011

garage sale signs and data viz: the power of preattentive attributes

I was jogging the other morning and ran by a woman hanging a sign for a garage sale. Her advertisement was penned on a piece of yellow 8x11 paper, uniformly golfball-sized letters describing the detail. In short: someone would pretty much have to stop their car, get out and walk up to the sign to know what it said. And after doing so, would need to read the entire sign to find out the most relevant parts of the detail: if it was in an area of interest, or at a time that would suit.

This was obviously a poor sign. The only thing it had going for it was that the yellow paper was eye catching. But I imagine that only those in search of garage sales would think of stopping to pay it more attention; the sign was clearly not going to be read by the majority of passersby.

This led me down a thought path: what makes a good garage sale sign? I had a hypothesis. After arriving home, I looked up images of garage sale signs with my favorite search engine. Here's a sample:

It seems to me that one of the things that makes for a good garage sale sign is one of the same things that makes for a good data visualization: strategic use of preattentive attributes.

"Preattentive attributes" in the world of information visualization is a fancy descriptor for aspects of a visual that hit our iconic memory. Iconic memory is what happens in our brain before short term memory kicks in, before we even really know that we're thinking. Iconic memory is tuned to pick up preattentive attributes: things like color, size, added marks, and spacial position [learn more].

In the lessons I teach on data visualization, I discuss using preattentive attributes mainly with two goals in mind: 1) directing the audience's eye and 2) establishing a visual hierarchy of information. In both cases, the point is that if you use preattentive attributes well (especially color), your audience can't help but focus on the important part(s) of the message. By playing on their iconic memory, you're making it so they are seeing what you want them to see before they even know they are seeing it. Which is a crazy powerful thing.

I have a good Google before-and-after example that's been genericized that I'll post later this week. If you're too excited to wait, I'll be discussing it (and more on preattentive attributes) at the Visual.ly meet up on Thursday in Mountain View [see details].

Tuesday, September 6, 2011

visual.ly meet up

If you live in the Bay Area (or have plans of being there in late September), you may be interested in the visual.ly meet up taking place on September 29th (sign up is here; do it soon if interested, as spots are filling up quickly). I will be one of the speakers and will discuss leveraging preattentive attributes to make great data visualizations, highlighting an example from our research on Google's People Analytics team. Hope to see you there!

Thursday, September 1, 2011

visualize this

Nathan Yau writes one of my favorite data visualization blogs, FlowingData. His recently published book has been sitting on my shelf untouched for much too long. Earlier this week, I decided to remedy that.

His book is Visualize This. Subtitle: The FlowingData Guide to Design, Visualization, and Statistics. It's written in the first person and is super accessible, full of examples and anecdotes to make the lessons real. The book includes references to a lot of publicly available data and also has links to each dataset used, so the reader can follow along through the steps that are explained.

After starting with an introduction on telling stories with data (obviously near and dear to my heart), the book jumps into the practical question of how. There are step by step instructions for scraping data from websites, using Python to reformat it, and the strengths and weaknesses of various out of the box applications and programming languages for analyzing and visualizing data.

By his own words, Nathan's book is "example-driven and written to give you the skills to take a graphic from start to finish." It accomplishes this goal. The middle chapters each focus on a different kind of visualization problem: visualizing patterns over time, visualizing proportions, visualizing relationships, spotting differences, and visualizing spatial relationships. Yau follows a thorough, hands on approach. For example, in the chapter focused on time series, he goes through what to look for, the best types of graphs to use in different scenarios, how to load the data into and plot in R, and how to fine tune the visual using Illustrator. Relevant statistical methods are incorporated as makes sense, for example, smoothing and estimation.

While there is some very solid foundational material, the majority of the book is focused on the practical question of how to actually analyze and visualize the data. It seemed to me most tailored to the person who is looking to move beyond Excel and the like and get started using R and Illustrator (with some time devoted to interactive graphics as well).

Throughout, Nathan's graphics are beautiful and accessible - great examples of effective data visualization. He follows the rules he sets forth in every one:

  • explain encodings,
  • label axes,
  • keep your geometry in check,
  • include your sources, and
  • consider your audience.

The final chapter focuses on designing with a purpose. He says he always assumes that people are showing up to his graphics blindly and puts the onus on himself as the designer to prepare the audience with the relevant context and insights. "After you learn what your data is about, explain those details in your data graphic. Highlight the interesting parts so your readers know where to look. A plain graph can be cool for you, but without context, the graph is boring for everyone else."

Well said, Mr. Yau!

Monday, August 29, 2011

crushing on your data viz

Oh, what a month. Those who know me understand my posting hiatus...my mind and energy have been elsewhere. But I'm starting to refocus it back on normal life-stuff, for example getting caught up on some data viz reading. 

I was perusing the Tableau Visual Guidebook and came across a snippet I appreciate; it followed the descriptions of all of the different things you can do to format your data viz using Tableau: 

Do you like your viz? After all of this arduous, tedious and difficult tweaking, you better have a little crush on your viz. If not, it may be time to break up and start over.

I like this idea of crushing on one's data viz. I find myself saying this again and again, but plotting data in a graphing program should be the first step in data visualization, not the last. After doing that, here are the typical steps I find myself going through and questions I routinely ask to get to the final ready-for-consumption visual:

1. Assess the chart type
  • Is it a pie chart? If yes, read this post. If no, move to the next bullet.
  • Is there a more straightforward way to present the data? I often find myself doing an "is-this-better?" comparison, where I'll have my working version of the graph, then try graphing it differently and do a side by side comparison to see which is the easiest to interpret. Going through a few rounds of this can help ensure you've got a chart that someone else will be able to read. 
  • Don't leave the details in question: make your chart legible by giving it a title and labeling all axes.

    2. Highlight the important stuff
    • Use preattentive attributes (e.g. color, size) to create a visual hierarchy of information and draw your audience's eye to where they should focus their attention. Use color sparingly and strategically.
    • Here's a fun test: look away from your visual and then back to it. Where is your eye drawn? This is likely where your audience's eye will be drawn as well, so if it isn't in the right place, revisit how you're using your preattentive attributes (especially color).

    3. Get rid of the clutter
    • Cut anything superfluous: every bit of reduction in noise makes the signal of your data stand out more. For example, assess whether you need gridlines (here is a post on this).
    • Push things like footnotes, data sources, as of dates to the background by making them grey, small, and positioned in lower attention areas, like the bottom of the page; this way they are there for reference but don't detract from the key parts of your visual.

      4. Assess the overall visual
      • Does the data viz facilitate the story I want to tell, or the data discovery I want my audience to make? Here is an example where this is done well.
      • A good test of this is to hand your visual to a friend or colleague who is unfamiliar with it. Give them 10-20 seconds (and no context) and have them tell you what they see. If it isn't what you're hoping, it's time to revisit your design.

      The folks at Tableau are spot on. This takes time and patience. After doing all of this work (irrespective of the specific graphing application), you should think your data visualization is just about the best thing on the planet. Or be so sick of it you never want to see it again. :-)

      Here are some links to previous posts with before-and-afters that walk through different parts of the above process. I totally found myself crushing on each of these after spending so much time with each:

      I think the crush is good evidence that sufficient time has been spent on a very important step of the analytical process: communicating your findings visually to others.

      Monday, August 1, 2011

      gridlines are gratuitous

      How often do you use the gridlines on a chart to read the data?

      Not very often.

      And yet there they are, prominently, when you plot your data with most graphing applications. I've said this before, and I will say it again: plotting data in a graphing application like Excel should be your first step in the data visualization process, not your last!

      Gridlines typically act as nothing more than clutter, unnecessarily competing for attention with your data. Don't let them. In the event that gridlines are important for being able to read the data you are presenting, push them to the background by making them a light shade of grey. In most cases, I'd argue that your audience isn't going to make use of the gridlines at all. If this is the case, remove them completely.

      Let's see what this looks like in practice through the chart progression below.

      The first chart is what I get when I plot my data in Excel (using my mac).

      In the second chart, I stripped out a bit of clutter by eliminating the chart border and reducing the labels and tick marks on the x-axis. I also pushed the axes and gridlines to the background by making them grey and tied the title of the graph visually to the trend line by making the title the same shade of blue. I justified the graph title and y-axis title at upper leftmost because in Western cultures most people read from left to right, top to bottom; this makes it so the audience encounters how to read the graph before they get to the actual data. This is looking better, right? The data stands out more than in the initial version, where there was no visual hierarchy to help direct our attention.

      In the final graph, I removed the gridlines altogether. Note that the data stands out the most in this version, because it isn't competing visually with the gridlines for your attention.

      The lesson is this: if your audience isn't going to use gridlines to read the data, get rid of them! At the very least, push them to the background. At best, they aren't particularly helpful. At worst, they distract from your data.

      Don't let your visuals fall victim to this unnecessary graphing application clutter!
      Data source: http://seer.cancer.gov/

      Friday, July 29, 2011

      porn & cake

      Get your attention?

      The following data cake pic was posted over at Chart Porn a couple of weeks ago (originally from Epic. graphic). I couldn't help but share.

      While all are important, one might guess that my favorite step is presentation (yes, I like to make things pretty, cakes included). But that's not the case. My favorite is the final step to knowledge: information is worthless if we don't learn something and act on (eat!) it.

      Tuesday, July 26, 2011

      lessons in innovation

      Earlier this week, Google published Think Quarterly, an online magazine of sorts that provides "a snapshot of what Google and other industry leaders are thinking about and inspired by today." The topic of the current issue is innovation.

      While the focus isn't data visualization, many of the lessons shared can be applied in this space. For example:
      • In The 8 Pillars of Innovation, SVP of advertising Susan Wojcicki discusses iteration as the way to strive for consistent innovation, not instant perfection, and looking for ideas everywhere. I appreciate the concept she introduces of "sparking with imagination, fueling with data."
      • Head of Americas Sales, Dennis Woodside, talks about how audiences today want and expect "something more sophisticated, more considerate" than they have in the past. In Route to 2015, he's talking about marketing and advertising, but I would argue the same trend is happening when it comes to information visualization. His 4 B's are also applicable: be found, be engaging, be relevant, be accountable.
      • "The most original innovations come from mucking about, not from thinking hard" (Russell Davies, Practical Magic). It's often that sort of mucking about with a dataset that leads to new insights you wouldn't have found with a hypothesis-driven approach.
      • In Next Gen Innovators, Sarah Ohrvall calls out data aggregation as the trend driving the most exciting innovations in digital media in her opinion. She says: "information can be used to improve your daily life and improve the world around you" and calls out that the more people know about the impact their behavior has, the more they will change their behavior based on this knowledge. 
      These are just some highlights intended to pique your interest; I highly recommend checking out the full publication. See where you can apply the innovation lessons presented.

      I'll wrap this up with some words from Susan Wojcicki: never fail to fail. In data visualization (as in life), learn from the things that don't work, adjust accordingly, and try again.

      Friday, July 22, 2011

      vacation stress: visualized

      Love this. Reminds me of the clever Facebook breakup visualization that David McCandless did last year. Though my personal sample size (1) is small, this is an accurate reflection based on my empirical evidence...

      Wednesday, July 20, 2011

      death to pie charts

      I hate pie charts. 

      I mean, really hate them.

      Those who have heard me speak on data visualization will have learned that the only thing I hate more than a pie chart is a 3D, exploding pie chart - they are the absolute worst - but the plain vanilla pie charts are pretty bad, too. Here's a recent one from TechCrunch, which is intended to show how much they cover start-ups versus big companies (full article):

      I'll start with the lesser evil of the above visual: meaningless color. The pie above is what happens if you put the data in Excel and say "chart data". I've said this before and I'll say it again: graphing your data with a tool like Excel should be the first step in your design process, not your last! In TechCrunch's pie, the color itself doesn't represent anything, it's simply used as a categorical differentiator. One unintended side effect is the optical illusion you get with a darker colored slice appearing larger than a same-size slice of a ligher color.

      My strong opinion is that color should always be an explicit choice and should be used strategically to draw the audience's eye. This preattentive power is being wasted here. If you must use a pie chart, at least make the slices the same color and highlight only the one or two you want to draw attention to. Or if you don't want to highlight a particular slice, but rather are intending the visual to aid in information discovery, you may consider something like the following:

      Hopefully you can see that this still isn't a very good visualization. The labels are messy. Only a few things are immediately apparent: General Consumer Web is the biggest piece, there are a lot of small slices.

      My main beef with pie charts like the one above (and in general) is this: our eyes aren't good at attributing quantitative value to two dimensional spaces. In English: pie charts are really hard for people to read! When segments are close in size, it'd difficult (if not impossible) to tell which is bigger. When they aren't close in size, the best you can do is determine that one is bigger than the other, but you can't judge by how much. To get over this, you can add data labels, as they've done in the TechCrunch version. But I'd still argue the visual isn't worth the space it takes up.

      What should you do instead? My typical advice would be to replace a pie chart with a horizontal bar chart, organized from greatest to least or vice versa (unless there is some intrinsic value in the categories, in which case that should be followed). With bar charts, our eyes compare the end points. Because they are aligned at a common baseline, it’s very easy to assess relative size. This makes it easy to see not only which segment is the largest (for example), but also how incrementally larger it is than the other segments. Here's what this looks like with the TechCrunch data:

      One might argue that you lose something in the transition from pie to bar. The unique thing you get with a pie chart that is absent in a bar chart is the concept of there being a whole, and thus, parts of a whole. But if the visual is difficult to read, is it worth it? Ultimately, it's up to the designer of the visual. My advice is as follows:
      1. Don't use pie charts.
      2. If you find yourself unable to follow #1, keep in mind the challenges with pie charts: if relative sizes are important, you'll need to include data labels. Also be aware of impact of color on 2D space (darker looks larger); don't let your tool decide your color scheme. 
      Personally, I will continue to avoid pie charts.

      Sunday, July 17, 2011

      what makes good data visualization?

      Here is David McCandless' take: a balance of interestingness, function, form, and integrity.

      My personal view is similar, but I articulate it differently (and I've found that exactly how I articulate it changes over time as I continue to learn and iterate). Lately, I've been reading up on general principles of design to expand how I think about data visualization. In design language, I would say that effective data visualization should leverage the following:

      • Affordances: In the field of design, experts speak of things having affordances - characteristics that reveal how they're to be used. A teapot has a handle. A door that you push has a push plate. The design of an object should, in and of itself, suggest how the object should be used. The same is true of your graphs, tables, and slides. Lead your audience through your visual – make it easy on them! Provide a visual hierarchy of information, these are visual cues for your audience so they know where to direct their attention.
      • Accessibility: Designs should be usable by people of diverse abilities. Example of good design by this measure are Apple products: my mother can barely send an email, but put her iPhone or iPad in her hand and it's so intuitive that she doesn't feel overwhelmed by the technology. Work to make your data visualizations similarly straightforward and easy to use. Don't overcomplicate. Use text to label, introduce, explain, reinforce, highlight, recommend, and tell a story.
      • Aesthetics: People perceive more aesthetic designs as easier to use than less aesthetic designs whether they are or not. Specifically, studies have shown that more aesthetic designs are perceived as easier to use, more readily accepted and used over time, promote creative thinking and problem solving, and foster positive relationships, making people more tolerant of problems with design (this is crazy, right? leverage it!). Use a pleasant color palette (personally, I tend to do everything in shades of grey with strategic, explicit use of bright blue to draw my audience's eye). Bring a sense of visual organization to your design (preserve margins, align things visually), showing attention to detail and a general respect for your work and for your audience.

      What do you think of these descriptions of effective information design? What makes good data visualization from your perspective? Leave a comment with your thoughts.

      Friday, July 15, 2011


      I just came across this graphic over at Chart Porn. What story would you tell with this data?

      Wednesday, July 13, 2011

      visual.ly is live

      A few months ago, I came across the visual.ly site, which at that point was a temporary landing page with a lot of sexy looking graphics where you could input your email to be notified when the full site launches. I received that notification this morning, and it's certainly creating a lot of buzz: I've had a number of friends and colleagues forward me the announcement and ask for a review. 

      Visual.ly says it is the world's largest community for exploring, sharing, creating, and promoting data visualizations. I have mixed feelings so far based on the detail I've perused. It seems like describing the graphics there as "data visualizations" might be somewhat of a misnomer; perhaps "information graphics" would be a better description? A number of the visuals I've looked at contained no data at all (example).

      One thing the images do seem to mostly have in common is their visual bling - they look exciting at first glance due in many cases to color and complexity. I worry about this, as sexy can be good for grabbing an audience's attention, but to maintain it, the visual needs to be clear and straightforward: I'm not sure all of the content there meets the mark on this latter piece. If it works as it appears is planned, this should self-correct over time, with popular visuals rising to the top and vice versa through the wisdom of crowds. I just hope the crowd is wise enough to value utility over sexy.

      There are some stellar graphics there for sure. I've included a few of my most and least favorites from what I've looked at so far at the bottom of this post.

      There seem to be some technical difficulties (I've had a lot of instances of pages timing out, visuals not loading, and buttons not following through on what they claim they will do for me), but expect that these are painpoints that the crew at visual.ly is actively working to fix.

      I'm interested to see whether this site will take off. Take a look. Leave a comment with your thoughts!

      cole's faves (based on what I've looked at so far):
      view original

      view original

      going to give cole nightmares (notice a theme?):
      view original
      view original

      view original

      view original