HISTOGRAM: The Breathless Horror and Disgust -- by JD Smith

There is something at work in my soul which I do not understand . . . there is a love for the marvellous, a belief in the marvellous, intertwined in all my projects, which hurries me out of the common pathways of men, even to the wild sea and unvisited regions I am about to explore.     --Mary Shelley, FRANKENSTEIN

[Editor's Note: It would not be putting too fine a point on the subject to say that the following tutorial, written by JD Smith, is famous. And justifiably so. It is the only coherent and complete explanation of the Histogram command extant. Read it several times. You will learn more each time you do. And, maybe, if you read it enough, the secrets of the command will be revealed to you. Stranger things have happened.]

You've seen it... scattered about old tomes of code like some forgotten incantation drawn from an ancient tongue, summoning unspoken forces. Perhaps you felt a deep, visceral uneasiness in the presence of its occult construction, arcane invocation, and dark power. Though it may turn your stomach, HISTOGRAM is a monster which, when tamed and brought into your service, can be a powerful ally in your quest to avoid the dreaded and oft-inefficient IDL for loop. With it you can sort and select, trim and smooth, index and decimate: almost the full range of array manipulations, including many you probably didn't think could be done efficiently in IDL.

Although a full account of HISTOGRAM's many uses would occupy a large volume, here I'll try to build a foundation on which many novel variations can be built.

Remedial Review

A histogram, as we all know (at least those of us who didn't spend Algebra II passing notes and perfecting spitball ranging techniques) represents nothing more than a fancy way to count: a series of numbers in an input vector is divvied up into bins according to their value, and the (fixed) size of the histogram bin. For each input value which falls into a given bin, a single "drop in the bucket" is added. Let's start with a simple, familiar example:

  IDL>  print,histogram(indgen(10),BINSIZE=5)
             5           5

Indeed, 5 values each fell in the two bins of size 5. What about:

  IDL>  print,histogram(indgen(10))
             1           1           1           1           1           1
             1           1           1           1

Here you see the default is a binsize of 1 -- one drop fell into each bin. You can of course also take histograms of floating point values:

  IDL> print,histogram(randomu(sd,100),BINSIZE=.2,MIN=0.)
            20          20          20          18          22

Looks about right. If this were all there was to HISTOGRAM, there wouldn't be much use in a tutorial. You'll soon see this is far from the case.

We Can Build Him Faster

A surprising point worth mentioning about HISTOGRAM is that it's very fast: much faster than direct for loops, and usually noticeably faster than equivalent solutions using, e.g. WHERE (where they exist). Sometimes these savings result from a more efficient algorithm, e.g. avoiding unnecessary repeated linear searches through an input vector. But even when the number of operations (e.g. indexing) is equivalent, HISTOGRAM can outperform other IDL built-ins -- it's very well optimized.

Another thing to remember: although the for loop is quite slow in IDL, if the number of iterations is kept small, and the work done per iteration is large, you won't feel the looping penalty. HISTOGRAM is often very helpful in this respect. A typical rule of thumb: if you're looping over each data element individually, there's (probably) a faster way to do it.

Bins and Buckets

A slight detour brings us through the littered landscape strewn with various inputs to HISTOGRAM which work together to specify that all important ingredient: the size of the histogram bin.

MAX
The maximum value of the array to consider. Can change from the input -- see NBINS.
MIN
The minimum value of the array to consider.
BINSIZE
The size of the bin, which defaults to 1. It's worth noting that the binsize is constant throughout the histogram.
NBINS
A relative of BINSIZE, NBINS is something of a misnomer. The relation HISTOGRAM uses to compute the bin size if passed NBINS is:

    BINSIZE=(MAX-MIN)/(NBINS-1)

and if NBINS is specified, MAX is changed to be (independent of any value passed as MAX):

    MAX=NBINS*BINSIZE+MIN

As such, it's probably better to avoid NBINS, if you care about the MAX value staying put. A better relation which would leave MAX as is and give you exactly NBINS bins between MIN and MAX:

    BINSIZE=(MAX-MIN)/NBINS
OMIN|OMAX
These output keywords give you the min and max value used to construct the histogram. Useful if you don't specify them directly.

Remember that MIN and MAX will default to the minimum and maximum of the array (except for byte data, which defaults to MIN=0). It's very helpful to use OMIN if you don't specify the minimum, so you know what the first bin corresponds to.

When Good HISTOGRAM's Go Bad

An important point to keep in mind is that HISTOGRAM creates one bin for each unit bin size between the minimum and maximum, even if no such data are present in the input! This means that histograms of sparse arrays can be very wasteful:

   IDL> h=histogram(MIN=0,[100000000L])
   IDL> print,n_elements(h) 
      100000001

That's a lot of zeroes, just to tell us we have one data point in the 100,000,000th bin! In some cases, you can partially mitigate this problem by not specifying MIN directly, using the OMIN output instead:

   IDL> h=histogram(OMIN=om,[100000000L])
   IDL> print,n_elements(h) 
              1
   IDL> print,om
      100000000

But more often for sparsely populated data this won't help:

   IDL> h=histogram(OMIN=om,[0L,100000000L])
   IDL> print,n_elements(h) 
      100000001

You must always balance the speed gains of HISTOGRAM against its potentially large memory usage in the case of very sparse arrays. The sparseness of course depends on the size of your bin (e.g. if I had made a bin size of 100,000,001 above, there would have only been one bin), but for the typical case of integers with binsize=1, it's easy to understand the data sparseness as "what fraction of integers, on average, are present in the data between its min and max?" Below you'll see how to compute this.

Reverse Indices

This little HISTOGRAM jewel, obtained by setting the REVERSE_INDICES keyword, is actually its most useful and least understood aspect. It tells you, for each bin in the histogram, the actual indices of the data which fell into that bin. The format of the reverse indices vector is sadly quite obtuse, turning many away from its use, but in fact it's really quite simple. In a call like:

   IDL> data=fix(randomu(101,25)*12)
   IDL> h=histogram(data,OMIN=om,REVERSE_INDICES=ri)

The ri vector is actually (and quite disturbingly) two vectors in one, concatenated together. I call them the i-vector and the o-vector:

   =======================================================
           The `i' vector          The `o' vector
   ri = iiiiiiiiiiiiiiiiiiioooooooooooooooooooooooooooooo
        |-----------------||----------------------------|
        |0              nh||nh+1             nh+total(h)|
   =======================================================

The last index of the i-vector, nh, is the number of elements in the histogram. The i-vector is actually an index into the o-vector portion of the reverse indices list itself: a list which indexes itself! Scary, yes, but that's just unusual bookkeeping. It's much simpler than all that. The easiest way to think of it is as follows: each bin of the histogram has a short list of zero or more indices associated with it -- picture each drop in the bucket painted with the index of the data to which it corresponds:

   |   |   |   |   |6  |   |   |   |   |   |   |   |
   |   |5  |   |   |10 |   |   |   |   |   |   |3  |
   |7  |12 |   |   |15 |   |   |   |   |   |   |9  |
   |11 |20 |   |16 |19 |   |   |   |   |   |8  |17 |
   |14 |21 |1  |24 |23 |0  |2  |4  |   |18 |13 |22 |
   +---+---+---+---+---+---+---+---+---+---+---+---+    
     0   1   2   3   4   5   6   7   8   9  10  11    

The o-vector contains these indices, in order, and the i-vector just shows us where to go to get them. E.g., to get the indices from the first bin of the histogram, we use:

   IDL> print, ri[ri[0]:ri[1]-1]
              7          11          14
   IDL> print, data[ri[ri[0]:ri[1]-1]]
          0       0       0       
   IDL> print,ri[0],ri[1]-1
             13          15

and from the 5th bin:

   IDL> print, ri[ri[4]:ri[5]-1]
              6          10          15          19          23
   IDL> print, data[ri[ri[4]:ri[5]-1]]
          4       4       4       4       4

That is, adjacent values in the i-vector part of ri specify the range of elements in the o-vector part of ri which contain the indices of data present in that bin. In the first example, there were 3 data elements which fell in the first bin (all, as you'd anticipate, with a value of 0). What if no data fall in a given bin?

   IDL> print,where(h eq 0)
              8
   IDL> print,ri[8],ri[9]
             31          31

In this case, you see that the two adjacent elements of the i-vector are the same: they don't span any elements of the o-vector. Typically, you need to test for this to avoid a null index range. Something like this.

   IDL> if ri[4] ne ri[5] then print, data[ri[ri[4]:ri[5]-1]] else print, 'No data in bin 4'
   IDL> if ri[8] ne ri[9] then print, data[ri[ri[8]:ri[8]-1]] else print, 'No data in bin 8'
          4       4       4       4       4
        No data in bin 8

Note that the smallest value in the i-vector is nh+1, i.e. the first index of the o-vector, one more than the number of elements in the histogram. This is just because i and o are glued together like that -- had the i-vector and o-vector been kept separate, the former would have started at 0.

The main point to remember: HISTOGRAM actually has three very useful outputs, not one: the histogram itself, h, the reverse index self-index vector i, and the original index vector o. Sometimes they're useful together, and sometimes only one of the three holds the key to solving a particular problem.

Flexing your HISTOGRAM Muscle

The best way to learn HISTOGRAM techniques is to dive right into a set of working examples. Don't forget to come up for air!

Using the h-vector

Perhaps the most common set of operations in IDL is building, reformatting, dissecting, and applying lists of indices to arrays of various dimensionality (see, for instance the dimensional juggling tutorial). HISTOGRAM is no stranger to these index slinging operations. A simple example taken from the manual:

Problem: Increment each element of a vector listed in a index vector by 1.

   IDL> inds=fix(randomu(sd,8)*3)
   IDL> print,inds
          2       2       0       1       2       0       1       2
   IDL> vec=intarr(3)
   IDL> vec=histogram(inds,INPUT=vec,MIN=0)
   IDL> print,vec
              2           2           4

This works even if individual indices are mentioned several times (in which case the explicit vec[inds]=vec[inds]+1 does not work). INPUT is a convenience, equivalent to vec=vec+histogram(...). You simply can't do this with normal indexing, without a for loop. You needn't just increment by one either:

Problem: Increment all elements of vec with indices mentioned at least twice by PI.

   IDL> vec=vec+(histogram(inds,MIN=0,MAX=n_elements(vec)-1) ge 2)*!PI

Problem: Find the value intersection of two vectors, a.k.a. which values are present in both a and b?

   IDL> a=fix(randomu(sd,8)*20)
   IDL> b=fix(randomu(sd,8)*20)
   IDL> print,a,b
          6       7      16      13      15      13      19       7
         16       5       8      17       7       7       4       7
   IDL> print,where(histogram(a,OMIN=om) gt 0 AND histogram(b,MIN=om) gt 0)+om
              7          16

I.e., we look for which buckets contain at least one drop for both vectors, using OMIN to save some work (since there can be no overlap below the minimum of either array). You could also easily use this method to find values repeated exactly twice, or present in one but not another (the set difference).

Problem: Remove some elements, listed in random order, from a vector.

   IDL> vec=randomu(sd,10)
   IDL> remove=[3,7,2,8]
   IDL> keep=where(histogram(remove,MIN=0,MAX=n_elements(vec)-1) eq 0,cnt)
   IDL> if cnt ne 0 then vec=vec[keep]
   IDL> print,keep
              0           1           4           5           6           9

We've used HISTOGRAM and WHERE to simply generate a list of kept indices.

Problem: Find a multi-dimensional histogram of a set of input coordinate tuples (e.g. (x,y,z)).

This general class of problems (solved in 2D by RSI's HIST_2D, and in up to eight dimensions by HIST_ND) also has other applications. The key trick is to scale all data into integer bins yourself, e.g.:

    IDL> h=(data-min(data))/binsize

and to convert these multi-dimensional scaled "indices" into a single index (exactly analogous to converting a (column, row) index of an array into a single running index). In 3D this looks like:

  IDL> h=h0+nbins[0]*(h1+nbins[1]*h2)
  IDL> ret=histogram(h,MIN=0,MAX=total_bins-1)

where h0,h1, and h2 are the scaled data "indices" from the first, second and third dimensions, and total_bins is the total number of bins in your grid (just the product of the bin sizes). You can see how to generalize to higher dimensions, and how to apply the same technique to do other work with data sets of more than one dimension.

Problem: Determine how sparse a set of input data is.

   IDL> data=fix(randomu(sd,1000)*10000)
   IDL> h=histogram(data)
   IDL> print,total(h ne 0)/n_elements(h) 
       0.0953624

I.e., very close to 1 in 10 integers are present. This works even if your data are very clustered and have a tendency towards repeated values.

Using the o-vector

The o-vector portion of the reverse indices vector is essential for its intended use: gathering together those elements of the input data vector which fall in an individual bin. With it, you can easily perform operations on data which have been divided up into bins of any given size. And unlike linear search methods, you can precompute the histogram with reverse indices, and subsequently perform many different operations on different data sub-groups using the same histogram, for a substantial speedup.

Problem: Find the individual median in each quartile of a data set.

   IDL> data=randomu(sd,100)*100
   IDL> h=histogram(data,BINSIZE=25, REVERSE_INDICES=ri)
   IDL> med=fltarr(4)
   IDL> for j=0L,3L do if ri[j+1] gt ri[j] then $
          med[j]=median(data[ri[ri[j]:ri[j+1]-1]])
   IDL> print,med
         13.3847      39.0525      68.0970      87.5475

This pattern forms the standard usage of the o-vector component of the reverse indices vector: loop over histogram bins, and, if non-empty, do something to the data which fell into that bin.

The reverse indices don't have to be used to index the original array, much as an index list returned by SORT doesn't have to be used to sort the vector it was passed:

Editor's Note: Please note that quartiles can only be calculated like this if you data is uniformly distributed. The proper definition of quartile would suggest that you are going to have four bins and the same number of points would be in each bin. That is, the median would separate the data set into two bins of equal number of points. Then, taking the median of those two sub-bins would result in the two quartiles (25% and 75%). In practice, say you wanted to draw a box and whisker plot, you might do something like this if you had a non-uniform distribution of points.

   data=randomu(sd,100)*100
   minVal = min(data)
   maxVal = max(data)
   medianVal = median(data,/even)

   ; Find the quartiles.
   qtr_25th = Median(data[Where(data LE medianVal, countlowerhalf)])
   qtr_75th = Median(data[Where(data GT medianVal, countupperhalf)])

Problem: Total data together according to a separate list of indices, e.g.:

   inds=[3 , 1, 0, 4, 2, 3, 3, 2, 4, 0]
   data=[.1,.8,.6,.4,.2,.9,.7,.3,.5,.2]
   vec[0]=data[2]+data[9]
   vec[1]=data[1]
   ....

This is easy, using the o-vector -- simply use the reverse index of the inds vector as an index to another vector -- data:

   IDL> h=histogram(inds,reverse_indices=ri,OMIN=om)
   IDL> for j=0L,n_elements(h)-1 do if ri[j+1] gt ri[j] then $
          vec[j+om]=total(data[ri[ri[j]:ri[j+1]-1]])

For large histograms, there are even more efficient ways to do this with very short or no loops (e.g. using a histogram of the histogram). Additonal information on this topic can be found in the article on array decimation.

Problem: Threshold a vector, setting all elements between 20 and 60, inclusive, to zero.

   IDL> data=fix(randomu(sd,100)*100)
   IDL> h=histogram(data,OMIN=om,REVERSE_INDICES=ri)
   IDL> data[ri[ri[20-om>0]:ri[61-om>0]-1]]=0.

What's this? We've clearly violated the time honored r[i]:r[i+1] convention, and skipped right through 41 bins worth of o-vector indices in one go. Since the o-vector contains adjacent groups of indices one right after another, this is perfectly legal. You'll get into trouble if there are no elements between the limits (i.e. if ri[20-om>0] eq ri[61-om>0]), but you can test for that.

Problem: Throw away data with more than twice the median repeat count rate.

   data=fix(randomn(sd,100)*5)
   h=histogram(data,REVERSE_INDICES=ri)
   keep=where(h le 2*median(h))
   num=h[keep]
   for n=1,max(num) do begin 
      wh=where(num ge n,cnt)
      if cnt eq 0 then continue
      if n_elements(k_inds) eq 0 then k_inds=ri[ri[keep[wh]]+n-1] else $
         k_inds=[k_inds,ri[ri[keep[wh]]+n-1]]
   endfor
   data=data[k_inds]    

Here we've chosen to loop over bin count and not directly over the reverse indices, to reduce the size of the loop (with its costly concatenation). The simpler form using the standard pattern is:

   for i=0,n_elements(keep)-1 do begin
      k=keep[i]
      if ri[k+1] gt ri[k] then $
         if n_elements(k_inds) eq 0 then k_inds=ri[ri[k]:ri[k+1]-1] $
         else k_inds=[k_inds,ri[ri[k]:ri[k+1]-1]]
   endfor
   data=data[k_inds]

If you need to preserve the ordering of the kept elements, accumulate a throw-away index list similarly (e.g. bad=where(h ge 2*median(h)), and then use the method from the previous section to generate an ordered kept-index list from it.

Using the i-vector

The i-vector is the least used portion of HISTOGRAM's output (aside from the trivial case when it's used to index the o-vector), but it can do some very neat things.

Problem: Construct a vector with values from one vector and a repeat count from another (so called chunk indexing). E.g., turn:

  IDL> d=[4,5,6,8]
  IDL> n=[2,1,3,4]

into

   new_d=[4,4,5,6,6,6,8,8,8,8]

This is easy with the i-vector:

   IDL> h=histogram(total(n,/CUMULATIVE)-1,/BINSIZE,MIN=0,REVERSE_INDICES=ri)
   IDL> i=ri[0:n_elements(h)-1]-ri[0]
   IDL> print,d[i]
          4       4       5       6       6       6       8       8       8
          8

What have we done here? Well, we've taken the cumulative sum of n:

   IDL> print,total(n,/CUMULATIVE)-1
         1.00000      2.00000      5.00000      9.00000

created a histogram from it, and used the fact that the i-vector contains multiple repeated indices for all the empty bins:

   IDL> print,i
              0           0           1           2           2           2
              3           3           3           3

Note that this subject is discussed in considerably more depth in a related article on array decimation.

Applying the Current

Don't worry if everything doesn't come together at once. No monster was built in a day. Just keep tinkering with the examples, examining the by-products along the way, and soon you'll be terrorizing unsuspecting villagers the world-round with your creations.

Google
 
Web Coyote's Guide to IDL Programming