Data analysis, machine learning and knowledge discovery. Revised versions of selected papers presented at the 36th annual conference of the German Classification Society, Gesellschaft für Klassifikation, GfKl 2012, Hildesheim, Germany, August 1--3, 2012 (Q2441281)

From MaRDI portal
scientific article
Language Label Description Also known as
English
Data analysis, machine learning and knowledge discovery. Revised versions of selected papers presented at the 36th annual conference of the German Classification Society, Gesellschaft für Klassifikation, GfKl 2012, Hildesheim, Germany, August 1--3, 2012
scientific article

    Statements

    Data analysis, machine learning and knowledge discovery. Revised versions of selected papers presented at the 36th annual conference of the German Classification Society, Gesellschaft für Klassifikation, GfKl 2012, Hildesheim, Germany, August 1--3, 2012 (English)
    0 references
    24 March 2014
    0 references
    This book is a collection of selected papers presented in the 36th annual conference of the German Classification Society in August 2012. The topics vary from statistics and data analysis, classification in marketing, biostatistics and bioinformatics to interdisciplinary domains such as the applications of machine learning in music and the workshop on library and information science. The book is organized in seven parts focused on: (1) classification, cluster analysis, factor analysis and model selection, (2) machine learning techniques applied on social networks, (3) data analysis and classification in marketing, (4) in finance, (5) in biostatistics and bioinformatics, (6) in interdisciplinary domains such as music, education and psychology and (7) on classification and subject indexing in library and information science. The first chapter introduces hot deck methods for the imputation of missing data using the available values, called donors. The authors discuss the advantages of donor restriction in the context of limiting the over-usage risk. The analysis of main effects is also included. The second chapter presents an approach based on factors of influence to determine the most dangerous districts in Dortmund, Germany. The authors discuss the quality and regression results for the factors of influence and propose a danger index using a classification of offence and the corresponding maximum penalties. The third chapter focuses on the benchmarking of classification algorithms on high performance computing clusters. R packages including an efficient parallelization scheme, BatchJobs and BatchExperiment, are presented in detail. The fourth chapter discusses visualizations for categorical data and presents a variety of R packages that could fulfill the task. The vcd and vcdExtra libraries are presented in detail and include mosaic plots, association plots, sieve plots, double decker plots and agreement plots. For these options the authors also discuss which the best suited approaches for various tasks are. The fifth chapter focuses on the problem of determining the optimal number of clusters using the biological example of determining the number of bee species. Methods like the Pearson-version of Hubert's gamma and bootstrap stability selection are discussed. The sixth chapter introduces the two step linear discriminant analysis (LDA) in a classification context. First, the separable models are presented, followed by the error bound. The methods and their efficiencies are discussed for the classification of EEG data. The seventh chapter presents the application of a new validation criterion on the predictive validity of tracking decisions. Commencing with an overview of the development of the validation criterion, the authors fully describe the approach on a representative sample of Luxembourgish secondary school students. The eighth chapter introduces the \(DD\alpha\) classification, a fast, non-parametric classification of \(d\) dimensional objects, on asymmetric and fat tailed data. The chapter starts with the analysis of data depth comparing the zonoid and the location depths and the Tukey and random Tukey ones. Next, their performance and dynamics are studied on simulated data. In the ninth chapter, the authors present a non-parametric invariant method for the automatic classification of multidimensional objects. The \(\alpha\) procedure based on a geometric representation of two learning classes is described in detail. The chapter concludes with an analysis of the procedure on simulated data. The tenth chapter proposes a parallelization of support vector machines (SVMs) on large datasets. First, the bagging-like SVMs are introduced, followed by a stepwise bagging procedure. The chapter concludes with the evaluation of the proposed method on seven large datasets. The eleventh chapter introduces a novel resampling method in the context of clustering analysis, the soft bootstrapping. First, resampling by weighting of the observations is presented. Next, the validation of soft bootstrapping clustering is described and the chapter concludes with a comparison of resampling methods. The twelfth chapter focuses on a dual scaling classification and its application in archeometry. Following a description of binary classification based on dual scaling, the authors describe in detail the applicability of the method for the classification of stamped tiles from the Roman province Germania Superior. The thirteenth chapter presents the gamma-Hadron separation applied on data from MAGIC-telescopes operating in stereoscopic mode. The authors focus on the threshold optimization problem in the context of variable signal-background data and no reliability on the labeled training data. The second part of the book presents machine learning and knowledge discovery approaches applied on social network data. It commences with a chapter on inductive learning for cooperative query answering. For the proposed method, the effect of the query length and the length of the rule's body on the response time are discussed in detail. The effect of the size of the knowledge base on the execution time is also investigated. The fifteenth chapter discusses stream clustering techniques applied on large datasets. Following an overview of conventional clustering techniques, the authors present the steps for data stream clustering and conclude with a comparison of different clustering methods. The sixteenth chapter is focused on prediction algorithms used for blog feedback. The authors start with an introduction of domain specific concepts followed by a formulation of the problem. The efficiency of various algorithms such as the MLP, kNN, RBFnet or REP tree are tested on data collected form Hungarian blogs. The seventeenth chapter introduces spectral clustering and discusses the choice of the Gaussian parameter. The authors commence with an interpretation with PDE tools and continue with the description of the link between the Gaussian affinity and the heat kernel. Next, the clustering property is linked to the heat equation; the discretization with finite elements is presented next. The chapter concludes with a geometrical example. The eighteenth chapter discusses the error propagation in classifier chains for multi-label classification. The authors commence with a description of the concepts involved and then discuss the effect of the length of the chain, its order, and the accuracy of the classification results. The chapter concludes with two experiments on real and synthetic data. The nineteenth chapter introduces a statistical comparison of classifiers for multi objective feature selection applied on instrument recognition. Following an overview of statistical tests used in music classification, the authors present a comparative analysis between RF, SVMs and NB. The third part of the book focuses on data mining and classification methods applied in marketing. The twentieth chapter discusses the use of decision support systems (DSS) for brand positioning. Starting with the conceptual model, the authors present the selected measures and conclude with an analysis of real data. The twenty-first chapter describes the use of multinomial SVMs for recommendations for repeating buying scenarios. The authors introduce the multinomial manifold, the geodesic distance and the multinomial kernel as components of the SVM. Next, the approach is tested on real data and the results are discussed in detail. The twenty-second chapter proposes the use of fuzzy clustering for predicting changes in market segments based on customer behavior. First, the analysis of gradual changes is introduced, followed by an in-depth description of local measures and covariance matrices. The chapter concludes with an example of the proposed method; the authors include also a discussion of cluster volume and cluster alignment. The twenty-third chapter introduces the symbolic cluster ensemble based on co-association matrices versus noisy variables and outliers. It consists of an overview of ensemble learning for symbolic data followed by results on simulation studies. In the twenty-fourth chapter the authors combine \(k\)-means, fuzzy \(c\)-means and latent class analysis for image feature selection for market segmentation. Following an overview of clustering and feature selection methods, the authors proceed to the evaluation of the method on real and synthetic data. The twenty-fifth chapter focuses on the validity of conjoint analyses with applications on commercial studies. Following a presentation of databases of recent commercial CAs, the authors discuss next the validity and variance approaches. The twenty-sixth chapter presents the use of stochastic programming for solving product line design optimization problems. Following an overview of \textit{R. Kohli} and \textit{R. Sukumar} [``Heuristics for product-line design using conjoint analyses'', Manage. Sci. 36, No. 12, 1464--1478 (1990), \url{http://www.jstor.org/stable/2661545}] and \textit{W. Gaul} et al. [``Gewinnorientierte Produktliniengestaltung unter Berücksichtigung des Kundennutzens'', Zeitschrift für Betriebswirtschaftslehre, 65, 835--855 (1995)], the authors describe in detail the method based on stochastic programming. The chapter concludes with results and applications of the new approach. The fourth part of the book is dedicated to data analysis in finance. The twenty-seventh chapter uses logistic regression with variable selection to determine the discriminative power of credit scoring systems trained on independent samples. Starting with an overview of dependent structures and sampling algorithms, the authors present four algorithms: two version with ``one month per client sampling'', the ``each month as a different sample'' and ``\(\mathrm{mod} h\) sampling''. The chapter concludes with experimental results of the four approaches. The twenty-eighth chapter describes a practical method based on the optimization of a target function for determining longevity and premature death risk aversion in households. Following a set of definitions and a description of the assumptions, the authors present the optimization approach. Next, the choice of scheme and model calibration are introduced. The chapter concludes with a numerical example of various gap-filling schemes. The twenty-ninth chapter focuses on the correlation of outliers in multivariate data. The chapter commences with an overview of correlations -- the exceedance correlation and the correlation of multivariate extremes. Next, the author presents an empirical analysis, methodology and results, which assumes that the Malahanobis distance is greater than the quantile of the chi-square distribution. The thirtieth chapter focuses on the statistical power of specific back-test procedures using the value at risk (VaR) on loss functions as example. Following an overview of test based approaches for the frequency of failures for multiple VaR levels and for loss functions, the author continues with an empirical research on simulated data. The fifth part of the book focuses on the analysis of methods applied in biostatistics and bioinformatics. The thirty-first chapter proposes a rank aggregation method for candidate gene identification. The author first introduces the Borda and Copeland score and next presents the Kolde robust rank aggregation method coupled with Spearman's foot rule and the Canberra distance. The chapter concludes with the analysis of experimental data. The thirty-second chapter introduces unsupervised dimension reduction methods for protein sequence classification. The authors discuss principal component analysis, isomap, \(t\)-SNE (\(t\)-distributed stochastic neighbor embedding) and Interpol methods on two synthetic datasets. The thirty-third chapter introduces three transductive versions of set covering machines used for classification problems in a molecular high-throughput setting. Following an overview of the approach and general notions of set covering machines with data dependent rays, the authors proceed with the description of the experimental setup and conclude with results obtained on artificial datasets. The sixth part of the book contains data analysis approaches on music, education and psychology examples. The thirty-fourth chapter describes the tone onset detection using an auditory model. The methods described include feature evaluation and classification of sounds, intensity and notes. The chapter concludes with a comparison of auditory-image-based and original-signal-based onset detection. The thirty-fifth chapter presents a unifying framework for ground penetrating radar (GPR) image reconstruction. The authors describe the componentwise decomposition of the images and conclude with an evaluation of the robustness to noise and its applicability to real-world data. The thirty-sixth chapter introduces the use of LDA, MDA, SVMs, random forests and boosting for the recognition of musical instruments in intervals and chords. First, the features are introduced, and next the authors discuss the classification task, its steps and the results in terms of a blockwise approach of common features. The thirty-seventh chapter presents the use of ANOVA for psychology data. Starting with three flavors of ANOVA, the authors evaluate the type I error rates and the power of the tests. In the thirty-eighth chapter the authors use statistical tests such as the Kolmogorov-Smirnov test to test models of medieval settlement locations. The chapter introduces the least cost distance and the use of accessibility maps to test the hypothesis that the medieval settlements are close to the linear targets what was believed from the analysis of the accessibility maps. The thirty-ninth chapter introduces a structural framework for the selection of statistical techniques. Following a description of the methodology, the author presents methods included in the framework and a brief mode of action. The fortieth chapter focuses on using alignment methods for folk music classification. The authors present first a music representation approach and continue with the description of the edit distance and the use of \(n\)-gram models. The forty-first chapter focuses on regression approaches such as weighted linear models and non-compensatory heuristic strategies in modeling compensatory and non-compensatory judgment formation. The results are presented on simulated datasets. In the forty-second chapter, the authors present a sensitivity analysis for the mixed coefficients of a multinomial logit model. Focusing on the PISA 2009 study, the authors compare the MNSQ and NMAR approaches and conclude that one cannot be favored over the other based on the ANOVA results. In the forty-third chapter, the author introduces confidence measures for music classification. Following an overview of definitions and requirements, the author introduces an approach for the estimation of confidence measures. Next, the feature extraction, the frame-wise and song-wise classifications are discussed. The forty-forth chapter is focused on the use of latent class models with random effects for investigating local dependence. First, the basics of learning space theory are introduced followed by an application of local independence and latent classes on the PISA 2009 study. Continuing the analysis of the PISA data in the forty-fifth chapter, the authors review the basic psychometric concepts incorporated in the study. The scaling procedure and the student score generation together with the proficiency scale construction and proficiency levels are discussed in detail. In the forty-sixth chapter, the music genre prediction by low-level and high-level characteristics is presented. Using a set of high-level and low-level harmonic features, classified using random forests, the authors propose a classification of last.fm tags. The seventh part of the book contains papers presented in the LIS workshop for classification and subject indexing in library and information science. The forty-seventh chapter proposes a clustering approach across union catalogs. The author uses a matching algorithm and includes its evaluation on real data. The forty-eighth chapter describes a text mining approach for ontology construction. The authors discuss in detail the identification of concept candidates, the finding of synonymous expressions and the use of automatic annotation. The chapter includes the presentation of the NanOn ontology and the evaluation of the approach. The forty-ninth chapter discusses data enrichment in the context of discovery systems using linked data. The authors present the architecture of the system describing both server and client side and also proposing a link with Wikipedia. The book is a very interesting collection of papers describing various approaches of data mining and machine learning on aspects from bioinformatics to music classification. It is an excellent addition to the field and it can be used as starting point for projects from undergraduate to post-graduate level.
    0 references
    hot deck methods
    0 references
    missing data
    0 references
    factors of influence
    0 references
    classification algorithms
    0 references
    algorithm parallelization
    0 references
    principal component analysis
    0 references
    visualizing categorical data
    0 references
    clustering
    0 references
    linear discriminant analysis
    0 references
    classification
    0 references
    feature selection
    0 references
    linear subspaces
    0 references
    support vector machines
    0 references
    soft bootstrapping
    0 references
    cluster analysis
    0 references
    resampling methods
    0 references
    threshold optimization
    0 references
    cooperative query answering
    0 references
    first order predicate logic
    0 references
    unsupervised learning
    0 references
    feedback prediction
    0 references
    error propagation in classifier chains
    0 references
    decision support system
    0 references
    trend detection
    0 references
    re-clustering of current objects
    0 references
    \(k\)-means
    0 references
    fuzzy \(c\)-means
    0 references
    latent class analysis
    0 references
    variable selection methods
    0 references
    stochastic programming
    0 references
    functional dependence between observations
    0 references
    Kolmogorov-Smirnov statistics
    0 references
    model calibration
    0 references
    power of tests
    0 references
    rank aggregation
    0 references
    unsupervised dimension reduction methods
    0 references
    set covering machine
    0 references
    component based image reconstruction
    0 references
    component-wise decomposition
    0 references
    semiautomatic ontology construction
    0 references
    data enrichment
    0 references

    Identifiers

    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references