Whatever it is called, it is usually

1. When doing hypothesis tests, the loss of information when dividing continuous variables into categories typically translates into losing power.

2. The loss of information involved in choosing bins to make a histogram can result in a misleading histogram.

Example: The following three graphs are all histograms of the same data (the times between successive eruptions of the Old Faithful geyser in Yellowstone National Park). The first has five bins, the second seven bins, and the third 14 bins.

3. Collecting continuous data by categories can also cause headaches later on. Good and Hardin

4. Wainer, Gessaroli, and Verdi

5. There are times when continuous data must be dichotomized, for example in deciding a cut-off for diagnostic criteria. When this is the case, it is important to choose the cut-off carefully, and to consider the sensitivity, specificity, and positive predictive value.

Notes:

1. "Binning" is also used to refer to processes used in data mining and analytics. In those fields, which usually deal with large data sets and aim to discover patterns, carefully developed algorithms and validating with holdout subsamples can create a more rigorous process than the types of discretizing discussed on this web page.

2. One situation in which it may be necessary is when comparing new data with existing data where only the categories are know, not the values of the continuous variable. Categorizing may also sometimes be appropriate for explaining an idea to an audience that lacks the sophistication for the full analysis. However, this should only be done when the full analysis has been done and justifies the result that is illustrated by the simpler technique using categorizing. For an example, see Gelman and Park (2008), Splitting a predictor at the upper quarter or third, American Statistician 62, No. 4, pp. 1-8. See also footnote 1 above.

3. See http://psych.colorado.edu/~mcclella/MedianSplit/ for a demo illustrating this in the case when a continuous predictor in regression is dichotomized using a median split. Also see Van Belle (2008) Statistical Rules of Thumb, pp. 139 - 140 for more discussion and references.

4. Some software has a "kernel density" feature that can give an estimate of the distribution of data. This is usually better than a histogram. The problem with bins in a histogram is the reason why histograms are not good for checking model assumptions.

5. Good and Hardin (2006) Common Errors in Statistics, pp. 28 - 29.

6. Wainer, Gessaroli, and Verdi (2006). Finding What Is Not There through the Unfortunate Binning of Results: The Mendel Effect, Chance Magazine, Vol 19, No.1, pp. 49 -52. Essentially the same article appears as Chapter14 in Wainer (2009) Picturing the Uncertain World, Princeton University Press.

7. In addition to the references listed at the end of the linked page, see also Susan Ott's Bone Density page for a graphical discussion of the cut-offs for osteoporosis and osteopenia.