Thursday, February 20, 2014

a little math on non-zero baselines

I had a friendly exchange with a blog reader over the past week related to my recent post highlighting some Pew Research makeovers. In this post, I made some comments regarding the use of a non-zero baseline: specifically, that it's not ok to have a non-zero baseline with a bar chart (see related blog post), but that you can get away with it in a line graph. The question was regarding how shifting to a non-zero baseline impacts the slopes of the lines in a line graph.

That's a very good question.

One I hadn't given any thought to before.

But now that I was thinking about it, I worried I'd been recommending something incorrect.

This must have really been weighing on me, because the night after I received the initial email asking about the impact on slopes, I had a dream where I was doing the math to show that it's actually ok. Next challenge: to see whether I could replicate my proof in reality. The short answer is yes.

Rescaling from a y-axis that begins at zero to one that does not begin at zero actually doesn't impact the slope of the lines. To demonstrate this, I sketched out an example, leveraging some lessons learned back in 7th grade algebra:


On the left hand side, I've plotted a line that connects the data points 6, 7, and 19 (to which I've given x-coordinates of 1, 2, and 3, respectively). In this initial version, the y-axis ranges from a minimum of zero to a max of twenty. The calculations for the slopes of the two lines that connect these points is shown below the graph.

On the right hand side, I've plotted the same points on a scale from 5 to 20, reducing the y-axis coordinates by 5 each to reflect this change in scale. Below the graph, we see the math for the slopes of the two lines connecting these points. The slope of each line is the same as it was initially. In other words, using a non-zero baseline does not impact the slope of the lines in a line graph.

Just to be sure I didn't inadvertently pick lucky numbers in this example, I did a second example with points (5, 12), (10, 17), and (20, 42) on a full y-axis scale from 0 to 50, and then one from 5 to 50 (again, reducing the y-coordinates appropriately to reflect this rescaling). I found the same thing: the slopes of the lines remain the same between the graph with the zero baseline and the one that's been rescaled. When I thought about this some more later, it seemed obvious - of course the slope of the lines doesn't change, because I'm not changing the points relative to each other, rather I'm changing their location relative to the x-axis.

But the conversation didn't end there. When I shared this with blog reader, Roberto, he responded with a couple of graphs to help illustrate his points. The first shows the original line (blue) on the primary y-axis (ranging from 0 to 20) and the line rescaled onto a secondary axis (red; with axis ranging from 5 to 20).


The next graph shows the same initial line on the primary axis (blue) and the line rescaled onto a secondary axis (red) that ranges from 5 to 105.


It's true that the absolute perception of steepness changes with the changing axis range. You see this when comparing either line that's plotted on the secondary axis to the original. But I'm not convinced that the relative slope between the two segments of the line are impacted, rather these appear to move together as the axis range changes.

To make sure I'm not promoting anything inappropriate, I consulted a couple other sources/experts. Alberto Cairo said when possible and to avoid confusion, retain the zero baseline. He suggested when this isn't feasible, you can create two line graphs rather than one, where the one with the zero baseline can be a small inset without the scale (just the baseline) in one corner of the larger graph, where you've zoomed in. This is an interesting solution, and one I plan to try out when the next opportunity presents itself.

I also consulted Stephen Few's Show Me the Numbers, where his description of zero-based scales reads as follows:
When you set the bottom of your quantitative scale to a value greater than zero, differences in values will be exaggerated visually in the graph. You should generally avoid starting your graph with a value greater than zero, but when you need to provide a close look at small differences between large values, it is appropriate to do so. Make sure you alert your readers that the graph does not give an accurate visual representation of the values so that your readers can adjust their interpretation of the data accordingly.
He follows this up with an example zoomed in line graph with the following warning: "Attention: The dollar scale along the vertical axis is narrow to reveal the subtle, yet steady rise in sales since July."

So the bottom line is: you can have a non-zero baseline in line graphs (which can be useful when the numbers you want to show are some distance away from zero), but I (and other experts) caution the use of care when doing so. You want to take context into account and make sure you aren't zooming in a way that visually overemphasizes minor differences. Also, make it clear to your reader that you aren't utilizing the full scale. Agree/disagree? Have other ideas for addressing this challenge? Leave a comment with your thoughts.

Big thanks to Roberto for his thought-provoking comments (please feel free to jump in if I've mischaracterized anything; also, for those who might be interested, Roberto's Excel gallery can be found here). Thanks also to Alberto for taking the time to read my draft post and lend his thoughts.

15 comments:

  1. It depends in part on the audience's knowledge of a metric's normal variability. If you're talking annual precipitation trends with climatologists, you're probably pretty safe. If it's a metric that's completely new to the audience, however, I would use more caution.

    Not that this is a rule, but when presenting trends without a zero baseline I would be tempted to get rid of the x axis line so as not to encourage that vertical comparison.

    Another factor is one that I heard Tufte talk about when I attended his roadshow several years ago: the amount of data you're graphing. With or without a zero baseline, if you're following a metric over a long period, your audience will have a much better context in which to judge if the recent rise or dip is significant for that metric.

    Jeff Harrison

    ReplyDelete
  2. Fully agree that line charts are different than bars in regards to the zero-baseline for most types of data ($, %, etc.) Curious about your thoughts (and if you could go back to Albert on this one, his as well) on how this applies to Likert-scaled questions. I've gotten pushback from some and know others who support my thinking that it's equally OK to do this with survey data (i.e., not show the full range of response options for a 1-5, 1-7, 1-# scale on the y-axis) in order to better illustrate trends across time and amongst the various lines displayed. Would love to add your perspective to our conversation. Thanks!

    ReplyDelete
    Replies
    1. Hi Anand, thanks for your comment! I'm not sure this exactly answers your question, but if I were visualizing Likert-scale data over time I would probably graph a summary metric rather than an individual line for each response option (for example, % favorable, which could be the % agreeing + % strongly agreeing), which I think would skirt your debate altogether. Or if this doesn't make sense for what you're thinking about, maybe you can share an example?

      Delete
    2. I like using what's in the left chart rather than the one on the right in this example: https://www.dropbox.com/s/xfcs6r575o083li/Likert%20Learning%20Levels%20DataViz%20Example-%20Baseline%20Question.jpg

      Delete
  3. You aren't changing the slope of the line because, mathematically, you haven't changed the data. I think it is entriely appropriate to adjust the scale, including starting at a non-zero value, to properly tell the story (while also avoiding overemphasis on minor changes - as you mentioned). If you don't change the scale, sometimes you can get some ugly looking charts with a lot of white space at the bottom.

    ReplyDelete
  4. I always tell my students that when you break the axis on a line chart it's like using a magnifying glass so be careful how much you magnify. If you over magnify then you will distort the picture. I also tell them, as you point out, to never break the axis on a bar chart. When I see that I assume the author is lying to me. The other one that I see people do frequently is break the axis on an area chart. This is the same fundamental problem with breaking a bar chart. Maybe people think that because it's a line that it's ok, but the purpose of the shading underneath is to compare height. So any chart type the uses length/height as a quantitative measure needs to have a zero base line (bar, area, lollipop, histogram, etc.). Also in the same category with the line chart is the dot plot (depending on how it's used).

    As for the math, it's really very simple. The slope is calculated based on the x and y values and those values will remain the same regardless of what the axis looks like. I've heard people describe the difference as the "physical slope" (i.e. the math equation that you drew) vs. the geometric slope (the appearance of the line). Great discussion topic!

    ReplyDelete
    Replies
    1. All makes sense. Thanks for sharing your perspective, Jeff!

      Delete
  5. While I agree with your conclusions, I do not entirely agree with the proof. In the re-scaled figure (to the right) of your analysis, you have changed the x and y co-ordinates from their original values of (1,6), (2,7) and (3,19) to (1,1), (2,2) and (3,14) respectively. While the arithmetic gives you the same answers, the reality is that if you were to present the two sets of co-ordinates as a data table, you will be inadvertently changing the absolute values in the data. So indeed, you may end up distorting the actual data even if your slopes are not distorted. My take is that if you are going to re-scale the y-axis, go ahead and do so (with all the caveats and covering notes suggested above) but leave the co-ordinates as they are. Again, re-scaling the y-axis in your figure above does not, for example, change the co-ordinates of the first point from (1,6) to (1,1) as doing so could mean that whereas $1 buys 6 eggs in the initial instance, the same dollar will buy only 1 egg after re-scaling.

    ReplyDelete
    Replies
    1. I absolutely agree. I reduced the y-axis coordinates in this example solely for purpose of calculating and comparing the slopes, you wouldn't actually change the coordinates when you rescale the axis. Thanks for your comment!

      Delete
  6. I generally set my axes to start at zero, but the one circumstance when I don't is when I employ sparklines in a dashboard. Given that they are intended to be very small, if I started them at zero then the trend line they're intended to highlight would often be unnoticeable. I also tend to place value labels at the start and end of the line so the viewers of the chart understand the scale.

    ReplyDelete
  7. Great topic of conversation! One suggestion would be to display the percentage variance in the data label. This would be the variance from the prior data point (period-over-period). This would basically be the same as displaying the mathematical slope of the line, and you would not have to be as concerned with the visual slope. As you have concluded, the visual slope can be distorted by the baseline OR by the size of the chart.

    The only consistency will be the mathematical slope, but it is difficult to display that in a chart. The x and y axis will have to be the same relative length, and this is not always possible when you have two different measures on the axis. For example, months on the x-axis and dollars on the y-axis.

    You would then have to actually measure the angle of the lines and adjust the height and width of the plot area to properly display the mathematical slope.

    Therefore, displaying the percentage variance between points in the data labels would compensate for these issues. This is much easier to do in Excel 2013 with the new "Value From Cells" label option. This allows you to select a range of cells to use as the data labels. So you can calculate the percentage variance in a row/column in your data set and then display them as labels in the chart. One thing to note, the "Value From Cells" feature only appears in the options menu when you are using a 2007+ file format (.xlsx/m). It does not appear if you are using a .xls file type.

    ReplyDelete
  8. Cole,

    Thanks for doing this. It is important that we remember accuracy and perception of what we present are important. It is very easy to manipulate a graph to prove a point.

    ReplyDelete
  9. Of course you're not changing the slope of the line within its native data space when you change the baseline, but when you map that space to the printed page (or computer monitor or whatever) you most certainly will change the distance between points on the printed page -- and that is what defines the slope we perceive.

    William S. Cleveland refers to this as "banking," and he claims that perceptual experiments have shown that adjusting the aspect ratio of your graph -- regardless of your baseline -- such that your curve banks to a 45-degree angle leads to your data being interpreted in the most optimal way. Cleveland's "Visualizing Data" is well worth a read if you haven't read it already.

    ReplyDelete
  10. Cole's blog brings back memories: when I started in a mathematical high school, the very first week our algebra teacher made us draw a polynomial graph (y=x^3, I think) and then about fifty variations of it: y= f(x-1), y=f(x+1), y=f(x) -1, etc - all possible combinations of sign, placement, value, and arithmetic operation. After that, moving the graph up-down-left-right and compressing/expanding it bacame automatic.

    ReplyDelete