Classification and Discrimination in Models for Ordered Data
- Klassifikation und Diskrimination in Modellen geordneter Daten
Katzur, Alexander Florian; Kamps, Udo (Thesis advisor); Cramer, Erhard (Thesis advisor); Richter, Wolf-Dieter (Thesis advisor)
Aachen : Publikationsserver der RWTH Aachen University (2015)
Dissertation / PhD Thesis
Aachen, Techn. Hochsch., Diss., 2015
The theory of classification and discrimination has gained major attention in the scientific literature since R. A. Fisher (1936) introduced his well-known discriminant function for a data set of three species of iris. Discrimination aims at separating objects from different classes, and classification at assigning objects with unknown class origin to their respective class, based on an observable vector of characteristics. It is assumed that the number of different classes is known. In probability theory, a common assumption is that each class has a different underlying (multivariate) distribution and that the characteristic vector of an object from one class follows this distribution. This assumption allows to rate the methods of discrimination and of classification by means of the expected costs that are caused by a decision, or, as a special case of that, by means of the probability of misclassification. Throughout this thesis, two classes are considered. The classification rule that minimizes the expected costs is the Bayes procedure, and the corresponding discriminant function that separates the two classes is the ratio of the densities of the underlying distributions. In this thesis, methods of discrimination as well as of classification are analyzed for advanced models of ordered data, particularly for sequential order statistics (SOSs) with known baseline distribution, but the given results may also be interpreted in terms of generalized order statistics and the Pfeifer record model with known baseline distribution. These models of ordered data are useful in many practical applications, e.g., when it comes to the modelling of component or machine failure times, or of records in sports. The thesis makes frequent use of the fact that the distribution of the first r SOSs with known baseline distribution forms an exponential family in the model parameters. Consequently, several methods and results are at first presented for general exponential families and then applied to SOSs. In various cases, the special structure of SOSs allows for further inference, especially concerning exact distributional results. The Bayes classification procedure and its expected costs are analyzed, when the underlying distributions of the classes are members of the same exponential family. Especially, in the case of SOSs, a closed form expression for those expected costs is provided, based on the hypoexponential distribution. In the case of unknown class a-priori probabilities, a bisection algorithm is stated to obtain the minimax procedure. Furthermore, some methods of classification are proposed, each based on a divergence measure. In the case of SOSs, several results regarding those methods are proven and, additionally, a simulation study is provided to get an impression of the performance of these classification procedures. As a by-product, an assertion regarding the quantiles of the gamma distribution is obtained. Moreover, a classification procedure, when each class has an underlying set of distributions rather than only one distribution, is proposed. It is assumed here that those distributions form a left-sided Kullback-Leibler ball. In this context, minimal enclosing Kullback-Leibler balls of a set of distributions that are members of the same exponential family are analyzed. Several illustrations and a simulation study are provided, to rate the performance of the method, when the underlying exponential family is the model of SOSs. Furthermore, an interesting connection between the minimal enclosing left-sided Kullback-Leibler ball of several distributions and a generalized Chernoff information between those distributions is shown. The scenario, where the parameters of the underlying distributions of the classes are unknown and are therefore replaced by their maximum likelihood estimates, is investigated as well. In the model of SOSs, prior information on the ordering of the unknown parameters is used to construct modified classification rules. The performance of those rules is again analyzed by means of a simulation study. Moreover, the clustering of so far unclassified data, with underlying distributions that originate from the same exponential family, is examined. Clustering approaches for exponential families have gained major attendance in the fields of speech or image recognition. Several known results are transferred to the setting of this thesis and analyzed for SOSs. For example an agglomerative hierarchical clustering approach for the model of SOSs is provided and the mixture maximum likelihood approach is discussed briefly. Finally, some tests for class membership for SOSs are introduced. This thesis furthermore provides a relationship between SOSs and a sub model of multivariate normal distributions.