ZETETIC GLEANINGS

ISBN-13: 978-0-9728457-0-0
E-book: 836 pages
Language: English
Published: Third Edition (2005)

Buy on Amazon (Hardcopy)

There are a great many named theoretical distributions but a large proportion of them have no known justification for their existence. There are two main reasons for this. First, it is very easy to create a "new" theoretical distribution or family of distributions. For many years any such "creation" was sufficient to achieve an academic publication. Second, no one ever, to our knowledge, acted upon the most elementary principles that might afford some justification for a proposed suggestion: namely, does the new distribution provide a good fit to some observed data that cannot be fitted by the standard, better known distributions? The result is that periodicals are full of suggested theoretical distributions that in many cases are of breathtaking worthlessness. This set of two books attempts to remedy this situation:

1. A very large number of theoretical distributions is repeatedly fitted to some 200 observed distributions.
2. I do not hesitate to conclude that a distribution is without value in fitting. Examples are McKay's Bessel function distributions, Fisher's quartic exponential distribution, most of Johnson's system, a distribution due to Ramberg et al.
3. I emphasize some theoretical distributions of major usefulness in fitting which do not seem to be so well known among practitioners. Examples are the immensely powerful Kapteyn system, the Burr distribution, the Evered distribution, and the Craig system.
4. My goal throughout is to enable any practitioner to be able to recognize the most likely possibilities as theoretical distributions for his or her observed dist and to eliminate the unlikely ones without waste of time.

There are 197 real-world observed distributions that are used as examples of fitting in these two volumes. Many of them are analyzed several times, using different theoretical distributions. A mere listing of all of the 197 examples would serve no major useful purpose but the complete listing is given in vol. 2, Index #1, on pp. 783 - 787. Here is an example of this index:

absences of male employees
acceptors for TB control
accidents of London bus drivers
accidents to female workers
age at death, Quaker women
ages of Peruvian women, birth of first child
ages of women married in Copenhagen
Ameria maritima, Thomas
artillery shots at target
ash content, peat samples
authors of chemical papers
barometric heights
batting averages
batting averages, best
billiards
blossom midge
bold-face listings, telephone directory
Bose's data
boys in families with three children
breaking strength of cotton
bush clover data, Beall & Rescia

Some theoretical distributions - particularly mixed and generalized distributions - have no corresponding real-world examples. For these distributions, therefore, it was necessary to construct examples. Constructed examples were also used to illustrate heterogeneity and to show some characteristics of Pearson Types VI and XI. All 39 such made-up examples are listed in vol. 2, Index #2, p. 788.

There are 69 theoretical distributions for which at least one fit is given in these two volumes. (This includes dissection, which is not a distribution but which is a method of accomplishing fits.) For these 69 distributions there are 632 fits. All of the information about these fits is given in vol. 2, Index #3, pp. 789 - 796. Here it is of obvious interest to ask which theoretical distributions had the largest numbers of fits. For the tail of the distribution with >= 10 fits we find:

exponential, Kapteyn, log series 10
Johnson's SB 11
Waring 12
Yule 13
Tripathi & Gurland, Type IV 14
dissection 16
Burr 19
Evered 23
gamma (Type III) 30
Beta (Type I) 35
Normal 37
Lognormal 38
Carver 53
Katz 62
Ord 67

Finally, it may be noted that the Pearson family produced 120 fits.

Quotes from the Books

The combination of reward - publication - for creating a new theoretical distribution plus no standards whatsoever of relevance produced the present glut of named but forgettable theoretical distributions. For many years any such 'creation' was sufficient to achieve an academic publication so periodicals are full of suggested theoretical distributions that are in many cases of breathtaking worthlessness. We say it is easy to 'create' such a new distribution or family. (p.558)

The important question from the point of view of this book - the only question - is whether the new distribution is needed, or is helpful, or is useful for the statistician." (p.559)

Some Examples

1. Section 121 of the book is on McKay's Bessel Function distributions. This Section ends with the sentence: "Therefore, we are quite confident that no one is likely to find a need for the Bessel function distribution that cannot be better met by a different theoretical distribution."
2. Discussing a theoretical distribution proposed by Ramberg et al in Technometrics, vol. 20; p. 591: "So what did the authors accomplish by showing that their proposed new distribution fits the data on the coefficients of friction? In our opinion, nothing whatsoever." "The distribution is totally outside the mainstream of useful theoretical distributions. The fact that the distribution fits some observed data means essentially nothing."
3. On Fisher's fourth-degree exponential distribution: "Seeing how different this distribution is from all the others we have dealt with perhaps makes it clearer why, on the one hand, it is strongly bimodal, and why, on the other hand, it is of little use." (p. 630)
4. "In fact, the proponents of this distribution emphasize the bimodality as the major advantage of the curve. But, as we have made clear previously, we totally and completely reject this point of view. Bimodality signifies just one thing: heterogeneity. The researcher who prepared the distribution has included different things in one distribution. In making a frequency distribution, say, of the number of seeds in fruits he has included apples, oranges, and pomegranates in one distribution and then marvels at the bi- or tri-modality. We cannot over- emphasize this point. Bimodality invariably and always means that there are two or more populations in one distribution. We leave open the distant possibility that somewhere in this fascinatingly diverse universe there is a genuinely bimodal distribution that is not the result of heterogeneity. However, this distribution, if it exists, has certainly not been presented in any publication known to us." (p. 644)
5. Section 126 discusses some data given by Matz to which he fits the quartic exponential distribution. "This is as unprepossessing a distribution as one is likely ever to see. It appears to have four modes and the long sequences of flat frequencies suggest additional heterogeneity." "Data such as these should be left in total obscurity until the researcher succeeds in defining a single population." "It is not worth wasting further time and space on these wretched data. As the physician Pauli once said about an article, 'It is not even wrong.

Coverage of the book, theoretical distributions and fits:

1. The summary statement is made in the Preface that this book covers 69 theoretical distributions, which the book illustrates by means of 632 fits to almost 200 observed real-world frequency distributions.

2. The theoretical distributions are quite thoroughly described and even specialists will find many with which they will be unfamiliar. Consider first, discrete distributions. There are ten theoretical distributions with first-order difference equations with linear coefficients. These are all discussed and illustrated with examples of their application: including the geometric, Poisson, log series, Yule, Waring, the Bardwell & Crow family, the Katz family, the Evered family, and the Tripathi & Gurland family. If you count you will find only nine. This is because there is an additional, minor, distribution that is unnamed but is covered and illustrated. I note that the Evered family was so named by T. J. Olney, then a PhD candidate and now an Associate Professor. He named it after my wife, Lisa J. Evered. It turns out that it is one of the most useful of these ten distributions with no less than 23 fits given in this book. (There are 62 fits of the Katz family, 10 of the log series, 12 of the Waring and 13 of the Yule.)

3. J. K. Ord wrote a difference equation that started at r = 0. This represents a whole family of discrete distributions which includes the hypergeometric distribution, the beta-binomial distribution, and the beta- Pascal distribution. There are 67 fits of the Ord family in this book.

4. H. C. Carver wrote a difference equation model intended to be completely analogous to Pearson's differential equation for theoretical continuous distributions. His goal was to be able to use discrete methods in calculating difficult continuous models but his model is of great importance in fitting discrete observed distributions as well. There are 53 fits of the Carver model in this book.

5. We give detailed discussions and fits for 12 generalized distributions, including Polya-Aeppli,Thomas, and Neyman Type A. In addition, we give a similar presentation for three mixed distributions, including Sichel's, Fisher's mixed Poisson, and the discrete lognormal distribution (the Poisson distribution mixed by the lognormal distribution).

6. Vol. 2 deals with continuous distributions. Chapter 8 covers theoretical lifetime distributions: exponential, Weibull, gamma, and Raleigh distributions. It also considers the problem of censored data and the use of hazard rates in identifying theoretical distributions. There are 10 fits of the exponential, 7 fits of the Weibull, and 30 fits of the gamma distribution (Pearson Type III).

7. Chapter 9 considers theoretical distributions of income and wealth. The Pareto distribution has pride of place here, mostly the first kind of Pareto distribution. In addition, the gamma distribution is often used here, sometimes called Amoroso's distribution. The lognormal distribution is also considered. We consider Champernowne's distribution and the special case called the sech2 distribution which is also a special case of the Burr distribution and is closely related to the logistic distribution. Here we use the U.S. income distribution of 1918, the distribution of incomes in Bohemia in 1933, and five different samples given by Kloek and Van Dijk (Netherlands), and the distribution of incomes for Norwegian townsmen in 1930.

8. Chapter 10 considers the central limit theorem and the normal distribution (with 37 fits). It then deals with the lognormal distribution (with 38 fits). Then there is a careful presentation of the immensely powerful system proposed by Kapteyn, an astronomer by profession. However, his comments were far more to the point than are many of the comments of professional statisticians. Thus, he proposed his system because he maintained that Pearson's system was only descriptive and he demanded explanations. He also gives one of the earliest correct remarks about multimodality, namely that it means that two or more different kinds of individuals are being included in one distribution. (We give 10 fits of the Kapteyn transformations.) Finally, in this Chapter we also present the Johnson system. It is my opinion that the Johnson system is not particularly useful except that the lognormal distribution is a member of his system. (We have 11 fits of Johnson's SB and 9 fits of his SU.)

9. Chapter 11 is devoted to the Pearson system. Each of the 12 types is discussed (not including the normal distribution, which would make 13 types). It is sufficient here to note that there are 120 fits of the Pearson theoretical distributions in this book.

10. Chapter 12 is entitled "A Miscellany of Models" and presents the powerful Burr family, McKay's Bessel function distribution, various elaborations of Pearson's basic differential equation, some Carver analogues, some error distributions (normal, Laplace's first law of error, Subottine's distribution, and Sales Valle's distribution), the quartic exponential distribution, and generalized Rayleigh distributions. We note that there are 19 fits of the Burr distribution.

11. It is important to emphasize that both in vol. 1 and vol. 2 we have given examples of the use of dissection to explain multimodality. There are 16 fits shown that are achieved by dissecting the observed distribution into two or more theoretical distributions. Section 142, the last Section of this book, is a particularly outstanding example, giving the dissection of three different distributions of human mortality. We also will note that Section 141 uses dissection into two theoretical distributions to explain the U.S. income distribution of 1972 and I have not seen such a neat explanation previously.

"David Miller has drawn from his considerable practical experience to give us a most helpful tool for understanding what frequency distributions are about. The book provides valuable insights into a large number of frequency distributions. As a scientist I found the book to be both exciting and helpful. There is some math and numerous examples which supplement the detailed verbal descriptions, but the math presented is generally enlightening rather than tedious. I found that this book provided me with what I consider to be useful understanding in a short period of time.The book is in 2 thick volumes and it is a bargain." -E. W. Johnson, Urbana, IL