##### STATSCALE SEMINARS

##### PREVIOUS SEMINARS 2020

4th December 2020 - Priyanga Dilini Talagala (University of Moratuwa)

Title: Anomaly Detection in Streaming Time Series Data

Abstract: The first part of the talk introduces a framework that provides early detection of anomalous series within a large collection of nonstationary streaming time-series data. We define an anomaly as an observation that is very unlikely given the recent distribution of a given system. The proposed framework first calculates a boundary for the system’s typical behaviour using extreme value theory. Then a sliding window is used to test for anomalous series within a newly arrived collection of series. The model uses time series features as inputs, and a density-based comparison to detect any significant changes in the distribution of the features. We show that the proposed algorithm can work well in the presence of noisy nonstationarity data within multiple classes of time series.

The HDoutliers algorithm is a powerful unsupervised algorithm for detecting anomalies in high-dimensional data, with a strong theoretical foundation. However, it suffers from some limitations that significantly hinder its performance level, under certain circumstances. The second part of the talk introduces an algorithm that addresses these limitations. We define an anomaly as an observation where its k-nearest neighbour distance with the maximum gap is significantly different from what we would expect if the distribution of k-nearest neighbours with the maximum gap is in the maximum domain of attraction of the Gumbel distribution. An approach based on extreme value theory is used for the anomalous threshold calculation. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our algorithms. These frameworks are implemented in the open source R packages oddstream and stray.

20th November 2020 - Florian Pein (University of Cambridge)

Title: About the loss function for cross-validation in change-point regression

Abstract: Cross-validation is a major tool in non-parametric regression, in high-dimensional regression and in machine learning for model selection, for tuning parameter selection and for accessing estimation accuracy. Contrarily, in change-point regression cross-validation was not used

much. A main reason is the large interest in estimating the number of change-point accurately, but cross-validations is focusing on minimizing the prediction error. Thus, it is widely believed that cross-validation has a tendency to overestimate the number of change-points. However, recently Zou et al. (2020) have showed that the cross-validation procedure COPPS is estimating the number of change-points consistently under certain assumptions. In this work, we show that cross-validation using L2-loss can be problematic. It has not only a tendency to overestimate the number of change-points in some examples, it also underestimates the number of change-points in other examples. Consequently, even L2-consistency cannot be guaranteed. Those flaws can be explained by the fact that we have no information to identify where a change-point is between two observations and hence out-of-sample prediction errors can be large around change-points. We will discuss these points theoretically and in simulated examples. We will then propose a modified cross-validation criterion for which consistent estimation of the change-points can be showed again. Moreover, we will argue and verify by simulations that cross-validation using L1-loss can be good alternative.

This is joint work with Rajen Shah.

6th November 2020 - Alex Aue (UC Davis)

Title: Random matrix theory aids statistical inference in high dimensions

Abstract: The first part of the talk is on bootstrapping spectral statistics in high dimensions. Spectral statistics play a central role in many multivariate testing problems. It is therefore of interest to approximate the distribution of functions of the eigenvalues of sample covariance matrices. Although bootstrap methods are an established approach to approximating the laws of spectral statistics in low-dimensional problems, these methods are relatively unexplored in the high-dimensional setting. The aim of this talk is to focus on linear spectral statistics (LSS) as a class of "prototype statistics" for developing a new bootstrap method in the high-dimensional setting. In essence, the method originates from the parametric bootstrap, and is motivated by the notion that, in high dimensions, it is difficult to obtain a non-parametric approximation to the full data-generating distribution. From a practical standpoint, the method is easy to use, and allows the user to circumvent the difficulties of complex asymptotic formulas for LSS. In addition to proving the consistency of the proposed method, I will discuss encouraging empirical results in a variety of settings. Lastly, and perhaps most interestingly, simulations indicate that the method can be applied successfully to statistics outside the class of LSS, such as the largest sample eigenvalue and others.

The second part of the talk briefly highlights two-sample tests in high dimensions by discussing ridge-regularized generalization of Hotelling's T^2. The main novelty of this work is in devising a method for selecting the regularization parameter based on the idea of maximizing power within a class of local alternatives. The performance of the proposed test procedures will be illustrated through an application to a breast cancer data set where the goal is to detect the pathways with different DNA copy number alterations across breast cancer subtypes.

23rd October 2020 - Yoav Zemel (University of Cambridge)

Title: Probabilistic approximations to optimal transport

Abstract: Optimal transport is now a popular tool in statistics, machine learning, and data science. A major challenge in applying optimal transport to large-scale problems is its excessive computational cost. We propose a simple subsampling scheme for fast randomized approximate computation of optimal transport distances on finite spaces. This scheme operates on a random subset of the full data and can use any exact algorithm as a black-box back-end, including state-of-the-art solvers and entropically penalized versions. We give non-asymptotic deviation bounds for its accuracy in the case of discrete optimal transport problems, and show that in many important instances, including images (2D-histograms), the approximation error is independent of the size of the full problem. We present numerical experiments demonstrating very good approximation can be obtained while decreasing the computation time by several orders of magnitude.

We will also discuss further, recently obtained results on the limiting distribution of the optimal transport plan.

9th October 2020 - Solt Kovacs (ETH Zurich)

Title: Optimistic search strategy: change point detection for large-scale data via adaptive logarithmic queries

Abstract: Change point detection is often formulated as a search for the maximum of a gain function describing improved fits when segmenting the data. Searching through all candidate split points on the grid for finding the best one requires O(T) evaluations of the gain function for an interval with T observations. If each evaluation is computationally demanding (e.g. in high-dimensional models), this can be computationally infeasible. Instead, we propose “optimistic” strategies with O(log T) evaluations exploiting specific structure of the gain function. Towards solid understanding of our strategies, we investigate in detail the classical univariate Gaussian change in mean setup. For some of our proposals we prove asymptotic minimax optimality for single and multiple change point scenarios, for the latter in combination with the computationally efficient seeded binary segmentation algorithm. In simulations we demonstrate competitive estimation performance with significantly reduced computational complexity. Our search strategies generalize far beyond the theoretically analyzed univariate setup. As a promising example, we demonstrate massive computational speedup in change point detection for high-dimensional Gaussian graphical models. This talk is based on joint work with Housen Li (University of Göttingen), Lorenz Haubner (ETH Zurich), Axel Munk (University of Göttingen) and Peter Bühlmann (ETH Zurich).

17th July 2020 - Tobias Kley (University of Bristol)

Title: A new approach for open-end sequential change point monitoring

Abstract: We propose a new sequential monitoring scheme for changes in the parameters of a multivariate time series. In contrast to procedures proposed in the literature which compare an estimator from the training sample with an estimator calculated from the remaining data, we suggest to divide the sample at each time point after the training sample. Estimators from the sample before and after all separation points are then continuously compared calculating a maximum of norms of their differences. For open-end scenarios our approach yields an asymptotic level $\alpha$ procedure, which is consistent under the alternative of a change in the parameter. By means of a simulation study it is demonstrated that the new method outperforms the commonly used procedures with respect to power and the feasibility of our approach is illustrated by analyzing two data examples. This is joint work with Josua Gösmann and Holger Dette.

3rd July, 2020 - Claudia Kirch (Otto-von-Guericke University)

Title: Functional change point detection for fMRI data

Abstract: Functional magnetic resonance imaging (fMRI) is now a well-established technique for studying the brain. However, in many situations, such as when data are acquired in a resting state, the statistical analyzes depends crucially on stationarity which could easily be violated. We introduce tests for the detection of deviations from this assumption by making use of change point alternatives, where changes in the mean as well as covariance structure of functional time series are considered. Because of the very high-dimensionality of the data an approach based on a general covariance structure is not feasible, such that computations will be conducted by making use of a multidimensional separable functional covariance structure. Using the developed methods, a large study of resting state fMRI data is conducted to determine whether the subjects undertaking the resting scan have nonstationarities present in their time courses. It is found that a sizeable proportion of the subjects studied are not stationary. This is joint work with Christina Stoehr (Ruhr-Universität Bochum) and John Aston (University of Cambridge).

19th June, 2020 - Martin Tveten (Dept. of Mathematics, University of Oslo)

Title: Scalable changepoint and anomaly detection in cross-correlated data

Abstract: In the seminar, I will present ongoing work in collaboration with the Statscale group on detecting changes or anomalies in the mean of a subset of variables in cross-correlated data. The maximum likelihood solution of both problems scale exponentially in the number of variables, so not many variables are needed before an approximation is necessary. We propose an approximation in terms of a binary quadratic program and derive a dynamic programming algorithm for computing its solution in linear time in the number of variables, given that the precision matrix is banded. Our simulations indicate that little power is lost by using the approximation in place of the exact maximum likelihood, and that our method performs well even if the sparsity structure of the precision matrix estimate is misspecified. Through the simulation study, we also aim to understand when it is worth the effort to incorporate correlations rather than assuming all variables to be independent, and finding out how our method compares to competing methods in terms of power and estimation accuracy in a range of scenarios. Finally, results from an application of the method to detect known faults on a pump monitored by sensors will be shown.

5th June, 2020 - Yudong Chen (University of Cambridge)

Title: High-dimensional, multiscale online changepoint detection

Abstract: We introduce a new method for high-dimensional, online changepoint detection in settings where a p-variate Gaussian data stream may undergo a change in mean. The procedure works by performing likelihood ratio tests against simple alternatives of dif- ferent scales in each coordinate, and then aggregating test statistics across scales and coordinates. The algorithm is online in the sense that its worst-case computational complexity per new observation, namely O(p^2 log(ep)), is independent of the number of previous observations; in practice, it may even be significantly faster than this. We prove that the patience, or average run length under the null, of our procedure is at least at the desired nominal level, and provide guarantees on its response delay under the alternative that depend on the sparsity of the vector of mean change. Simulations confirm the practical effectiveness of our proposal. This talk is based on joint work with Tengyao Wang and Richard Samworth.