|
|
There are a great many named
theoretical distributions but
a large proportion of them have
no known justification for their
existence. There are two main
reasons for this. First, it is
very easy to create a "new"
theoretical distribution or family
of distributions. For many years
any such "creation"
was sufficient to achieve an academic
publication. Second, no one ever,
to our knowledge, acted upon the
most elementary principles that
might afford some justification
for a proposed suggestion: namely,
does the new distribution provide
a good fit to some observed data
that cannot be fitted by the standard,
better known distributions? The
result is that periodicals are
full of suggested theoretical
distributions that in many cases
are of breathtaking worthlessness.
This set of two books attempts
to remedy this situation:
1. A very large number of theoretical
distributions is repeatedly fitted
to some 200 observed distributions.
2. I do not hesitate to conclude
that a distribution is without
value in fitting. Examples are
McKay's Bessel function distributions,
Fisher's quartic exponential distribution,
most of Johnson's system, a distribution
due to Ramberg et al.
3. I emphasize some theoretical
distributions of major usefulness
in fitting which do not seem to
be so well known among practitioners.
Examples are the immensely powerful
Kapteyn system, the Burr distribution,
the Evered distribution, and the
Craig system.
4. My goal throughout is to enable
any practitioner to be able to
recognize the most likely possibilities
as theoretical distributions for
his or her observed dist and to
eliminate the unlikely ones without
waste of time.
|
|
|
|
|
|
|
|
There are 197 real-world
observed distributions that are used as
examples of fitting in these two volumes.
Many of them are analyzed several times,
using different theoretical distributions.
A mere listing of all of the 197 examples
would serve no major useful purpose but
the complete listing is given in vol.
2, Index #1, on pp. 783 - 787. Here is
an example of this index:
absences of male employees
acceptors for TB control
accidents of London bus drivers
accidents to female workers
age at death, Quaker women
ages of Peruvian women, birth of first
child
ages of women married in Copenhagen
Ameria maritima, Thomas
artillery shots at target
ash content, peat samples
authors of chemical papers
barometric heights
batting averages
batting averages, best
billiards
blossom midge
bold-face listings, telephone directory
Bose's data
boys in families with three children
breaking strength of cotton
bush clover data, Beall & Rescia
Some theoretical distributions - particularly
mixed and generalized distributions -
have no corresponding real-world examples.
For these distributions, therefore, it
was necessary to construct examples. Constructed
examples were also used to illustrate
heterogeneity and to show some characteristics
of Pearson Types VI and XI. All 39 such
made-up examples are listed in vol. 2,
Index #2, p. 788.
There are 69 theoretical distributions
for which at least one fit is given in
these two volumes. (This includes dissection,
which is not a distribution but which
is a method of accomplishing fits.) For
these 69 distributions there are 632 fits.
All of the information about these fits
is given in vol. 2, Index #3, pp. 789
- 796. Here it is of obvious interest
to ask which theoretical distributions
had the largest numbers of fits. For the
tail of the distribution with >= 10
fits we find:
exponential, Kapteyn, log series 10
Johnson's SB 11
Waring 12
Yule 13
Tripathi & Gurland, Type IV 14
dissection 16
Burr 19
Evered 23
gamma (Type III) 30
Beta (Type I) 35
Normal 37
Lognormal 38
Carver 53
Katz 62
Ord 67
Finally, it may be noted that the Pearson
family produced 120 fits. |
|
|
|
|
|
|
|
Quotes from the Books
The combination of reward - publication - for creating a new theoretical distribution
plus no standards whatsoever of relevance produced the present glut of named
but forgettable theoretical distributions. For many years any such 'creation'
was sufficient to achieve an academic publication so periodicals are full of
suggested theoretical distributions that are in many cases of breathtaking worthlessness.
We say it is easy to 'create' such a new distribution or family. (p.558)
The important question from the point of view of this book - the only question
- is whether the new distribution is needed, or is helpful, or is useful for
the statistician." (p.559)
Some Examples
1. Section 121 of the book is on McKay's Bessel Function distributions. This
Section ends with the sentence: "Therefore, we are quite confident that
no one is likely to find a need for the Bessel function distribution that cannot
be better met by a different theoretical distribution."
2. Discussing a theoretical distribution proposed by Ramberg et al in Technometrics,
vol. 20; p. 591: "So what did the authors accomplish by showing that their
proposed new distribution fits the data on the coefficients of friction? In
our opinion, nothing whatsoever." "The distribution is totally outside
the mainstream of useful theoretical distributions. The fact that the distribution
fits some observed data means essentially nothing."
3. On Fisher's fourth-degree exponential distribution: "Seeing how different
this distribution is from all the others we have dealt with perhaps makes it
clearer why, on the one hand, it is strongly bimodal, and why, on the other
hand, it is of little use." (p. 630)
4. "In fact, the proponents of this distribution emphasize the bimodality
as the major advantage of the curve. But, as we have made clear previously,
we totally and completely reject this point of view. Bimodality signifies just
one thing: heterogeneity. The researcher who prepared the distribution has included
different things in one distribution. In making a frequency distribution, say,
of the number of seeds in fruits he has included apples, oranges, and pomegranates
in one distribution and then marvels at the bi- or tri-modality. We cannot over-
emphasize this point. Bimodality invariably and always means that there are
two or more populations in one distribution. We leave open the distant possibility
that somewhere in this fascinatingly diverse universe there is a genuinely bimodal
distribution that is not the result of heterogeneity. However, this distribution,
if it exists, has certainly not been presented in any publication known to us."
(p. 644)
5. Section 126 discusses some data given by Matz to which he fits the quartic
exponential distribution. "This is as unprepossessing a distribution as
one is likely ever to see. It appears to have four modes and the long sequences
of flat frequencies suggest additional heterogeneity." "Data such
as these should be left in total obscurity until the researcher succeeds in
defining a single population." "It is not worth wasting further time
and space on these wretched data. As the physician Pauli once said about an
article, 'It is not even wrong.
|
|
|
|
|
|
|
|
Coverage of the book, theoretical distributions and fits:
1. The summary statement is made in the Preface that this book covers 69 theoretical
distributions, which the book illustrates by means of 632 fits to almost 200
observed real-world frequency distributions.
2. The theoretical distributions are quite thoroughly described and even specialists
will find many with which they will be unfamiliar. Consider first, discrete
distributions. There are ten theoretical distributions with first-order difference
equations with linear coefficients. These are all discussed and illustrated
with examples of their application: including the geometric, Poisson, log series,
Yule, Waring, the Bardwell & Crow family, the Katz family, the Evered family,
and the Tripathi & Gurland family. If you count you will find only nine.
This is because there is an additional, minor, distribution that is unnamed
but is covered and illustrated. I note that the Evered family was so named by
T. J. Olney, then a PhD candidate and now an Associate Professor. He named it
after my wife, Lisa J. Evered. It turns out that it is one of the most useful
of these ten distributions with no less than 23 fits given in this book. (There
are 62 fits of the Katz family, 10 of the log series, 12 of the Waring and 13
of the Yule.)
3. J. K. Ord wrote a difference equation that started at r = 0. This represents
a whole family of discrete distributions which includes the hypergeometric distribution,
the beta-binomial distribution, and the beta- Pascal distribution. There are
67 fits of the Ord family in this book.
4. H. C. Carver wrote a difference equation model intended to be completely
analogous to Pearson's differential equation for theoretical continuous distributions.
His goal was to be able to use discrete methods in calculating difficult continuous
models but his model is of great importance in fitting discrete observed distributions
as well. There are 53 fits of the Carver model in this book.
5. We give detailed discussions and fits for 12 generalized distributions,
including Polya-Aeppli,Thomas, and Neyman Type A. In addition, we give a similar
presentation for three mixed distributions, including Sichel's, Fisher's mixed
Poisson, and the discrete lognormal distribution (the Poisson distribution mixed
by the lognormal distribution).
6. Vol. 2 deals with continuous distributions. Chapter 8 covers theoretical
lifetime distributions: exponential, Weibull, gamma, and Raleigh distributions.
It also considers the problem of censored data and the use of hazard rates in
identifying theoretical distributions. There are 10 fits of the exponential,
7 fits of the Weibull, and 30 fits of the gamma distribution (Pearson Type III).
7. Chapter 9 considers theoretical distributions of income and wealth. The
Pareto distribution has pride of place here, mostly the first kind of Pareto
distribution. In addition, the gamma distribution is often used here, sometimes
called Amoroso's distribution. The lognormal distribution is also considered.
We consider Champernowne's distribution and the special case called the sech2
distribution which is also a special case of the Burr distribution and is closely
related to the logistic distribution. Here we use the U.S. income distribution
of 1918, the distribution of incomes in Bohemia in 1933, and five different
samples given by Kloek and Van Dijk (Netherlands), and the distribution of incomes
for Norwegian townsmen in 1930.
8. Chapter 10 considers the central limit theorem and the normal distribution
(with 37 fits). It then deals with the lognormal distribution (with 38 fits).
Then there is a careful presentation of the immensely powerful system proposed
by Kapteyn, an astronomer by profession. However, his comments were far more
to the point than are many of the comments of professional statisticians. Thus,
he proposed his system because he maintained that Pearson's system was only
descriptive and he demanded explanations. He also gives one of the earliest
correct remarks about multimodality, namely that it means that two or more different
kinds of individuals are being included in one distribution. (We give 10 fits
of the Kapteyn transformations.) Finally, in this Chapter we also present the
Johnson system. It is my opinion that the Johnson system is not particularly
useful except that the lognormal distribution is a member of his system. (We
have 11 fits of Johnson's SB and 9 fits of his SU.)
9. Chapter 11 is devoted to the Pearson system. Each of the 12 types is discussed
(not including the normal distribution, which would make 13 types). It is sufficient
here to note that there are 120 fits of the Pearson theoretical distributions
in this book.
10. Chapter 12 is entitled "A Miscellany of Models" and presents
the powerful Burr family, McKay's Bessel function distribution, various elaborations
of Pearson's basic differential equation, some Carver analogues, some error
distributions (normal, Laplace's first law of error, Subottine's distribution,
and Sales Valle's distribution), the quartic exponential distribution, and generalized
Rayleigh distributions. We note that there are 19 fits of the Burr distribution.
11. It is important to emphasize that both in vol. 1 and vol. 2 we have given
examples of the use of dissection to explain multimodality. There are 16 fits
shown that are achieved by dissecting the observed distribution into two or
more theoretical distributions. Section 142, the last Section of this book,
is a particularly outstanding example, giving the dissection of three different
distributions of human mortality. We also will note that Section 141 uses dissection
into two theoretical distributions to explain the U.S. income distribution of
1972 and I have not seen such a neat explanation previously.
|
|
|
|
|
|
|