Distribution charts

Review of a typical chart

For many applications in drawing charts, we choose a field to group the data by, and at least one metric, which is usually an aggregation function, common ones being sum(), avg(), avg(), max(), count() or min().. For example sales by product line, we group the sum of sales amounts (different invoice lines) by the dimension, product line.  Typical dimensions are product line, month, customer, site. This is opposed to typical fields we use in expressions in aggregations functions such is costs, sales, quantity, or measurements such as voltage.

In the chart about the length of the bar is determined by the expression sum(Sales)

Enter Distribution charts

The need for a distribution chart arises when we look at quality control, age buckets in demographics, or debt ageing. A lecturer showing the students after a test how many students scored which marks, is a typical (bell curve) distribution chart. We use a field that we usually add up (an amount, quantity, or the value of some measurement) as a dimension, and then count the number of occurences.

A distribution chart with equal bucket range (except for the last one). To allow correct sorting, an extra legend was added. 

The need for bucketing or rounding

Very often, depending on the precision used, we find that there are as many values as there records – not very helpful. If we want to draw a distribution chart of the lengths of students in a classroom, and we use the precision as millimetres, it may be that no two students will have the same length. What we want to do to get an idea, is to group the ones that have more or less the same length together.

One option is to simply lower the precision by rounding the values to a level that makes sense. At other times, we may wish to retain the original high precision value, and create a new rounded field, useful in distribution charts. When we have values such as 1.334, 1.456, 1.765 we may wish to round to the nearest 1/10th (0.1). At other times we have values like 3,400 ; 2,800 etc. Then we round to the closest 500, or closest 1,000. At this point the word “bucketing” rather than “rounding” may be more approrpiate. The nice thing about bucketing is that all buckets need not cover equal ranges. A very long tail distribution will be impractical to visualise if we insist on keeping bucket ranges consistent.

We achieve this in the code by using a nested if(). Our if statement will look like this:

..,

if(value>1000,1000,if(value>500, 500, if(value>400, 400, if(value>300,300,0)))) as [Value Bucket],

Note there is no need to say ..if(value < 1000 and value >500   because that part won’t be evaluated in the first place, if the value was greater than 1,000.

Note that I exploited the power of not having to have buckets with equal ranges. I also lumped the potentially long tail (everything over 1,000) into one bucket.

This almost immediatly raises another question. People expect to see distribution charts sorted by the value that each point on the x-axis represents.  One way is to train or signify to the intended audience, that each label on the horizontal is meant to be read as the boundary of that bucket. But that will be awkward. It would be nice (not to mention clear) to call the first bucket “1,000+” or something like that.

Another way is to create two values, using the same if() bucketing logic. One is the sort order, and the other is the text that will appear on the axis. QlikView allows a developer to specify a sort expression, that may be different or refer to field(s) other than the dimension field. But a more elegant version of this, is to give each field a display value, as well as a behind-the-scenes numerical value. This is achieved by the dual function.

 

A typical long tail distribution.

Leave a Reply

Your email address will not be published. Required fields are marked *