###### StatScale Workshop 2019 Abstracts

# â€‹

Sylvain Arlot, Paris-Sud University

â€‹

Title: Consistent change-point detection with kernels

Abstract: We tackle the change-point problem with data belonging to a general set. We propose a penalty for choosing the number of change-points in the kernel-based method of Harchaoui and Cappe (2007). This penalty generalizes the one proposed for one dimensional signals by Lebarbier (2005).

â€‹

By showing a new concentration result in Hilbert spaces, we prove it satisfies a non-asymptotic oracle inequality. Furthermore, our procedure retrieves the correct number of change-points with high probability, provided the penalty is well chosen, and it estimates the change-points location at the optimal rate. As a consequence, when using a characteristic kernel, KCP detects all kinds of change in the distribution (not only changes in the mean or the variance), and it is able to do so for complex structured data (not necessarily in R^d). Most of the analysis is conducted assuming that the kernel is bounded; part of the results can be extended when we only assume a finite second-order moment.

Experiments on synthetic and real data illustrate the accuracy of our method, showing it can detect changes in the whole distribution of data, even when the mean and variance are constant.

Based upon joints works with Alain Celisse, Damien Garreau and Zaid Harchaoui.

â€‹â€‹

â€‹

â€‹

Haeran Cho, University of Bristol

Title: Localised pruning for data segmentation based on multiscale change point procedure

Abstract: The segmentation of a time series into piecewise stationary segments is an important problem both in time series analysis and signal processing. It requires the estimation of the total number of change points, which amounts to a model selection problem, as well as their locations. Many algorithms exist for the detection and estimation of multiple change points in the mean of univariate data, from those that find a global minimiser of an information criterion, to multiscale methods that achieve good adaptivity in the localisation of change points via multiple scanning. For the latter, the problem of duplicate or false positives needs to be tackled, for which we propose a localised application of Schwarz information criterion. As a generic methodology, this new approach is applicable with any multiscale change point methods fulfilling mild assumptions to prune down the set of candidate change point estimators. We establish the theoretical consistency of the proposed localised pruning method in estimating the number and locations of multiple change points under general conditions permitting heavy tails and dependence. In particular, we show that it achieves minimax optimality in change point localisation in combination with a multiscale extension of the moving sum-based procedure when there are a finite number of change points. Extensive simulation studies and real data applications confirm good numerical performance of the proposed methodology.

This is joint work with Claudia Kirch (OvGU Magdeburg).

â€‹â€‹

â€‹

â€‹

Solt Kovács, ETH Zurich

Title: Seeded binary segmentation and a novel non-parametric change point detection method

Abstract: Throughout the talk, we consider offline change point detection problems. First, we introduce seeded binary segmentation (SBS), a modification that eliminates weaknesses of classical binary segmentation in terms of estimation performance, but retains its advantages (e.g. fast computation independently of the number of change points and flexibility to be used in numerous change point detection setups).

In the second part of the talk we introduce a novel non-parametric change point detection method based on Random Forests. As Random Forests are computationally expensive to fit, we consider approaches that lead to good empirical performance with a relatively small number of fits. In particular, we will see the advantages of SBS in this non-parametric scenario as well.

For both SBS and the non-parametric change point detection method we will show extensive simulation results and also some theoretical results.

â€‹

â€‹

â€‹

Emilie Lebarbier, Université Paris-Nanterre

Title: A factor model approach for the joint segmentation of correlated series

Abstract: The objective of segmentation methods is to detect abrupt changes, called breakpoints, in the distribution of a signal. Such segmentation problems arise in many areas, as in biology, in climatology, in geodesy, .... The inference of segmentation models requires to search over the space of all possible segmentations, which is prohibitive in terms of computational time, when performed in a naive way. The Dynamic Programming (DP) strategy is the only one (with its pruned versions) that retrieves the exact solution in a fast way but only applies when the contrast (e.g. the log-likelihood) to be optimized is additive with respect to the segments. However, this is not the case in presence of some dependencies. When dealing with multiple series, it is likely that some dependence between series exists (as spatial correlation). The correlation between the series is supposed to be constant along time but is allowed to take an arbitrary form. We show that such a dependence structure can be encoded in a factor model. Thanks to this representation, the inference of the breakpoints can then be achieved via DP. A model selection procedure is proposed to determine both the number of breakpoints and the number of factors.

â€‹â€‹

â€‹

â€‹

Housen Li, Georg-August-Universität Göttingen

Title: The essential histogram for data exploration

â€‹

Abstract: The histogram is widely used as a simple, exploratory display of data, but it is usually not clear how to choose the number and size of bins. We construct a confidence set of distribution functions that optimally address the two main tasks of the histogram: estimating probabilities and detecting features such as increases and modes in the distribution. We define the essential histogram as the histogram in the confidence set with the fewest bins. Thus the essential histogram is the simplest visualization of the data that optimally achieves the main tasks of the

histogram. The only assumption we make is that the data are independent and identically distributed. We provide a fast algorithm for the essential histogram, and illustrate our methodology with examples.

This is a joint work with Axel Munk (Georg-August-Universitäat Göttingen), Hannes Sieling (Blue Yonder GmbH), and Guenther Walther (Stanford University)

â€‹

â€‹â€‹â€‹

George Michailidis, University of Florida

â€‹

Title: Sequential change-point detection in high-dimensional Gaussian graphical models

â€‹

Abstract: High dimensional piecewise stationary graphical models represent a versatile class for modelling time varying networks arising in diverse application areas, including biology, economics, and social sciences. There has been recent work in offline detection and estimation of regime changes in the topology of sparse graphical models. However, the online setting remains largely unexplored, despite its high relevance to applications in sensor networks and other engineering monitoring systems, as well as financial markets. To that end, this work introduces a novel scalable online algorithm for detecting an unknown number of abrupt changes in the inverse covariance matrix of sparse Gaussian graphical models with small delay. The proposed algorithm is based upon monitoring the conditional log-likelihood of all nodes in the network and can be extended to a large class of continuous and discrete graphical models. We also investigate asymptotic properties of our procedure under certain mild regularity conditions on the graph size, sparsity level, number of samples, and pre- and post-changes in the topology of the network. Numerical works on both synthetic and real data illustrate the good performance of the proposed methodology both in terms of computational and statistical efficiency across numerous experimental settings.

â€‹â€‹

â€‹

â€‹

Peter Rousseeuw, KU Leuven

Title: Detecting Deviating Data Cells

â€‹

Abstract: A multivariate dataset consists of n cases in d dimensions, and is often stored in an n by d data matrix. It is well-known that real data may contain outliers. Depending on the situation, outliers may be (a) undesirable errors which can adversely affect the data analysis, or (b) valuable nuggets of unexpected information. In statistics and data analysis the word outlier usually refers to a row of the data matrix, and the methods to detect such outliers only work when at least half the rows are clean. But often many rows have a few contaminated cell values, which may not be visible by looking at each variable (column) separately. We propose the first method to detect deviating data cells in a multivariate sample which takes the correlations between the variables into account. It has no restriction on the number of clean rows, and can deal with high dimensions. Other advantages are that it provides predicted values of the outlying cells, while imputing missing values at the same time. We illustrate the method on several real data sets, where it uncovers more structure than found by purely columnwise methods or purely rowwise methods. The proposed method can help to diagnose why a certain row is outlying, e.g. in process control.

â€‹â€‹

â€‹

Rajen Shah, University of Cambridge

â€‹

Title: Goodness-of-fit testing in high-dimensional generalized linear models

â€‹

Abstract: We propose a family of tests to assess the goodness-of-fit of a high-dimensional generalized

linear model. Our framework is flexible and may be used to construct an omnibus test or directed

against testing specific non-linearities and interaction effects, or for testing the significance of

groups of variables. The methodology is based on extracting left-over signal in the residuals from

an initial fit of a generalized linear model. This can be achieved by predicting this signal from the

residuals using modern powerful regression or machine learning methods such as random forests

or boosted trees. Under the null hypothesis that the generalized linear model is correct, no signal

is left in the residuals and our test statistic has a Gaussian limiting distribution, translating to

asymptotic control of type I error. Under a local alternative, we establish a guarantee on the

power of the test. We illustrate the effectiveness of the methodology on simulated and real data

examples by testing goodness-of-fit in logistic regression models.â€‹

â€‹

â€‹

Yi Yu, University of Warwick

â€‹

Title: Localizing Changes in High-Dimensional Vector Autoregressive Processes

â€‹

Abstract: Autoregressive models capture stochastic processes in which past realizations determine the generative distribution of new data; they arise naturally in a variety of industrial, biomedical, and financial settings. Often, a key challenge when working with such data is to determine when the underlying generative model has changed, as this can offer insights into distinct operating regimes of the underlying system. This paper describes a novel dynamic programming approach to localizing changes in high-dimensional autoregressive processes and associated error rates that improve upon the prior state of the art. When the model parameters are piecewise constant over time and the corresponding process is piecewise stable, the proposed dynamic programming algorithm consistently localizes change points even as the dimensionality, the sparsity of the coefficient matrices, the temporal spacing between two consecutive change points, and the magnitude of the difference of two consecutive coefficient matrices are allowed to vary with the sample size. Furthermore, initial, coarse change point localization estimates can be improved via a computationally efficient refinement algorithm that offers further improvements on the localization error rate. At the heart of the theoretical analysis lies a general framework for high-dimensional change point localization in regression settings that unveils key ingredients of localization consistency in a broad range of settings. The autoregressive model is a special case of this framework. A byproduct of this analysis are new, sharper rates for high-dimensional change point localization in linear regression settings that may be of independent interest. https://arxiv.org/abs/1909.0635â€‹â€‹