Tuesday, November 29, 2011

the waterfall chart

A few weeks ago, Andy Kriebel did a makeover of one of my visual makeovers. His version was a waterfall chart. I like waterfall charts, so I thought it might be useful to do a post focusing on them: what they are, an example use case, and how to use what I like to consider "brute-force-excel" to create them.

I find waterfall charts to be useful when you are interested in visualizing a starting quantity, positive and negative changes to that quantity, and the resulting ending quantity.

For example, in my day job it's sometimes useful to visualize changes in the number of employees in a given team over a period of time, say, over the course of a year. The starting quantity is the beginning of period (e.g. beginning of year) headcount. In terms of changes, there are some things that increase headcount (new hires, transfers into the given group) and some things that decrease headcount (exits from the company, transfers to other groups). When all of these changes are applied, we are left with the ending (end of year) headcount. The waterfall chart that portrays this, then, could look something like this (note that all numbers are made up):

Some graphing applications (like Tableau, which Andy used) have built-in waterfall chart functionality. However if you're working in Excel, like me, that's not the case. Fret not, as all it takes is a little brute force to turn a bar chart into a waterfall chart. The secret? An invisible series and a little bit of math.

Let me show you an interim step between where I started in Excel and the final product:

I have a stacked bar chart with two series: the visible series is the data that I want to show: beginning headcount, hires, transfers in, transfers out, exits, and ending headcount. The invisible series acts like a sort of placeholder to help me line up my other data. Each addition begins at the uppermost point of the column that preceded it and builds upward. At the turning point from addition to deduction, the first deduction starts at the top of the prior bar and shows its value downward. Further deductions begin at the lowermost point of the column preceding it and pull the values further downward. Note that both the beginning and ending figures are anchored at the baseline, while the interim values float, showing the changes in total, piece by piece.

To get from this interim step to the final waterfall chart, simply right click on the invisible data series and reformat it so that there is no fill and no line. The horizontal lines connecting the bars require a little more brute force: those are lines I've drawn in Excel on my chart.

You can download the Excel file here in case you want to take a closer look (and see the math I used for the invisible columns).

A couple of notes on my personal preferences when it comes to waterfall charts:
  • Horizontal lines connecting the bars: I like how these draw the reader's eye across the graph from left to right and also think the lines help to make it clear that the starting point for the next change is where the last bar ended. I do recommend keeping the lines thin and light so they don't compete with the data visually.
  • Using multiple colors: I often see waterfall charts with the beginning and ending bars one color, the increases another color, and decreases a third color. I think that if the chart is labeled well and the bars have sufficient space between them, this additional segmentation mechanism is unnecessary. This comes down to personal preference as well as what you want to convey to the audience. If the distinction between the positive and negative changes is really important, you can call more attention to them by varying the color. As always, just make sure your use of color is an explicit decision (not chance or graphing application defaults) and draws your audience's attention to where you want it.
There are a couple of additional considerations to keep in mind when using waterfall charts:
  • Because the bars do not have a consistent baseline, our eyes don't do a great job of accurately comparing segments that are close in size, so I recommend labeling the values explicitly to aid in interpretation. Note that if there is an apparent difference in size and the specifics aren't important, you can omit these data labels (but then you should add a y-axis so the reader can interpret the data).
  • If there isn't an intrinsic order in the categories, order the increases and decreases (separately) by size (smallest to largest or largest to smallest).
What's your view on the waterfall chart? Can you think of other applications? Leave a comment with your thoughts!

Friday, November 18, 2011

visual battle: table vs graph

In a data visualization battle of table against graph, which will win?

The short answer (which may be less than satisfying) is: it depends. Mostly, it depends on who the audience is and how the data will be used. One important thing to know is that people interact very differently with these two types of visuals. Let's take a quick look at how and some use cases for each, then we'll look at a specific example from a recent WSJ article.

Tables, with their rows and columns of data, interact primarily with our verbal system. We read tables. When I have a table in front of me, I typically have my two index fingers out - I scan across rows, down columns, and I compare values. Tables are great when you have an audience who wants to do just that. Or if you have a diverse audience, where each wants to look at their own piece: a table can meet this need. Tables are also handy when you have many different units of measure, which can be difficult to pull off in an easy to read manner in a graph.

Graphs, on the other hand, interact with our visual system. It's a high bandwidth information flow from what our eyes see to the comprehension in our brain, which can be extremely powerful when done well. Graphs can present an immense amount of data quickly and in an easy-to-consume fashion; they are particularly useful when there is a point to be made in the shape of the data, or for showing how different things (variables) relate to each other.

Let's look at an example. There was an article posted recently in the Wall Street Journal online titled, "Young Workers Like Facebook, Apple, and Google" (article). With the article, came an "Interactive Graphic," a table listing the 150 companies included in the survey, relative rank, and the percentage of young worker respondents that voted for each. (Slight tangent: while I suppose the interactive label fits, I was a little surprised to find that the only way I could interact with the data was to sort each column in either ascending or descending order - I guess this would be useful if I were looking for a particular company, so I could alphabetize the list, but utility beyond that is limited.) Here's what the top of the table looked like:

Question: was it right of WSJ to include a table rather than a graph?

In this case, I think the answer is yes. The article spends time discussing Google in the top spot (making the article title seem somewhat incongruous to me...also interesting that they mention Google last out of the three companies called out in the title while it ranked first), but then also points out some other nuances, for example the decrease in financial sector rankings (though the year over year data is not provided to the user). My assumption is that they wanted to include all of the data so that users could look up specific companies of interest, or look at the top or bottom of the list. This hits the one of the table criteria that we described above: a diverse audience, each wanting to look up their own piece.

If, however, the primary goal is to make the point that Google is well ahead of the pack (which is the focus of the majority of the article), a graph would help us to visually tell the story more quickly and arguably more effectively than can be done with the table.

Question: what should we graph? Graphing all 150 companies is out of the question: there are too many and the tail will take up more space than the value seeing it will add. So we know we need to graph something less than all, but the question remains: where should we make the cutoff?

We can pick a clean number (this is likely the rationale behind the top 3 that WSJ mentions in title): top 5, top 10, top 20. But in doing so, we run the risk of including and excluding companies of very similar values (for example, if we were to graph the top 10, we'd include the CIA at 5.04% but exclude Nike, which is only 3 basis points lower, at 5.01%). This isn't to say this isn't acceptable, but to point out that it should be an explicit decision: you should understand the pros and cons of this approach and be accepting of the cons (vs. not recognizing that they exist).

Another option is to graph the data and then look for the natural breaks that occur and have our cutoff reflect this nuance in the data. Here's what it looks like if we graph the top 25 (quick & dirty):

Here, the y-axis is the % of respondents and the x-axis is company rank. I found it hard to see the difference in the length of bars plotting this direction, so also tried the horizontal bar chart:

I find it much easier to see the relative differences in this second iteration of the chart (somehwhat due to the compression of the bars, also it just seems easier to scan down vs. across to spot differences in bar length). Based on this, it looks like there are clear differences between 7th and 8th place, between 8th and 9th, between 11th and 12th, between 15th and 16th, and so on. We could make arguments for a number of different cutoffs. In this case, I'm going to decide to take the top 15, both because it's a clean number (I've always liked multiples of 5, not sure why) and because we see a drop between the 15th and 16th positions (it's also the point where we break the 4% mark: 4.04% respondents vs. 3.80%, which I can note in a footnote).  You could make an argument to make the cutoff in another place, but this is what I'm going to go with for the reasons that I've outlined.

So if I want a visual to highlight the point in the article that Google is ahead of the pack, here is what it could look like:

Main takeaway: when debating table vs. graph, ask yourself how the data will be used and consider your audience. Let the utility of the visual that is needed drive your decision.

Thursday, November 10, 2011

how to do it in Excel

One common piece of feedback I get after presenting on the topic of data visualization goes something like this: Wow, that was super useful. I'm never going to use pie charts again. But when it comes to the graphs, how do you actually make them look like that? I'm not Excel-savvy...help!

Pretty much everything I do is in Excel. I like to refer to it as "brute-force" Excel, because in many cases the graphing application doesn't make it so straightforward to get from plotting the data to the final product. So I thought I'd take a few minutes to walk step by step through an example to expose those who question their Excel expertise to some of my tricks.

The following example may look familiar; it's from the FlowingData Challenge earlier this year (original blog post here).

The full Excel file can be downloaded here.

What you require most to get from Excel's original graph to the one you actually are proud to present is patience and time. You'll improve your odds of success by leaving ample time for the visualization step: don't rush this important piece, as it's what your audience sees of all your hard work!

Thursday, November 3, 2011

visual makeover: income and expenses

When I present my storytelling with data class, the second half of the session is typically conducted as an interactive workshop. I ask participants to submit graphs that they have created or encountered and would like feedback on and pick a handful that we focus on in small groups. After the groups have dissected the visuals in light of the course learnings, we discuss together and I review my own makeovers of the selected visuals.

The following visual is one that we focused on in a past session. The audience was comprised of grant-makers from philanthropic organizations. Here is the original visual that was submitted:

Those who know me are familiar with my opinion on 3D. In short: don't do it! Here, not only are the bars 3D, but with different rotation set on each of the charts (I think in part due to the different placement of the legend). I guess to spice things up? Hmmm.... (don't do it!)

I believe there are two root issues that lead to all of the problems with these graphs:
  1. Not enough time was spent considering what's most critical to share with the audience. What do they need to know? Is it how income and expenses have changed over time? ...how they breakdown in a given period? ...how they relate to each other? Because no decision was made on which information is crucial (or at least that decision isn't reflected in what's shown), the visuals don't answer any of these questions very well. In other words, by trying to show too much, the visual isn't showing anything particularly effectively.
  2. Excel makes it easy to do bad things. Some of it is the default settings (gridlines, colors, trailing zeroes on axis labels); some of it was done on purpose (most notably, 3D...don't do it!)

The changes I recommended are as follows:
  • Strip out clutter: gridlines, extraneous axis label digits, 3D, meaningless color
  • I don't think the historical income/expenses are necessary
  • Add a story in words: help the audience understand what they should know
  • Make the title active vs. descriptive (use this prime real estate wisely!)
Here is what it looks like when these changes are made:

What do you think? One piece of feedback I received from the participants was concern that an audience might perceive rows and read across (comparing Program expenses to Grants income, for example), which doesn't make sense. I think this could be solved by drawing a light vertical line between the expenses and income graphs.

Here (as elsewhere), I present my makeover not as the right answer, but as one possible solution to a data visualization challenge by someone who knows a little about and takes care in the visual design of her data graphics. I've made the assumption here that the most recent year's breakdown of income and expenses is the most important. If that is not the case, then this is not the right visual. If income and expenses over time is also important, you could perhaps show something like the following.

If both the breakdown of income and expenses as well as how they've trended over time are important, I'd definitely recommend breaking them into two different visuals, as I've done above, and making the relevant point on each vs. trying to cram it all into one visual.

What is your view? Leave a comment with your thoughts!