StatScale Workshop Talks

Monday 12th June

Jean-Philippe Vert, Mines ParisTech

Title: Learning on the symmetric group

Abstract:Many data can be represented as rankings or permutations, raising the question of developing machine learning models on the symmetric group. When the number of items in the permutations gets large, manipulating permutations can quickly become computationally intractable. I will discuss two computationally efficient embeddings of the symmetric groups in Euclidean spaces leading to fast machine learning algorithms, and illustrate their relevance on biological applications and image classification.

Tengyao Wang, University of Cambridge

Title: High-dimensional changepoint estimation via sparse projection

Abstract: Changepoints are a very common feature of Big Data that arrive in the form of a data stream. In this paper, we study high-dimensional time series in which, at certain time points, the mean structure changes in a sparse subset of the coordinates. The challenge is to borrow strength across the coordinates in order to detect smaller changes than could be observed in any individual component series. We propose a two-stage procedure called 'inspect' for estimation of the changepoints: first, we argue that a good projection direction can be obtained as the leading left singular vector of the matrix that solves a convex optimisation problem derived from the CUSUM transformation of the time series. We then apply an existing univariate changepoint detection algorithm to the projected series. Our theory provides strong guarantees on both the number of estimated changepoints and the rates of convergence of their locations, and our numerical studies validate its highly competitive empirical performance for a wide range of data generating mechanisms.

Chao Zheng, University of Melbourne

Title: A Nonparametric Procedure to Detect Spurious Discoveries with Sparse Signals

Abstract: In the past decades, many data mining and machine learning approaches have been proposed to identify a subset of covariates to associate with the response variable. However the discoveries by these approaches can be spurious when the dimensionality is large compare to the sample size. In this work we develop a statistical measure of maximum rank-based spurious correlation given a number of predictors. We derive the asymptotic distribution of such spurious correlation, and give its consistent estimation via multiplier bootstrapping procedures. We applied our methods to genomic analysis of detecting responsive biomarkers. The detection of spurious findings therefore provides a statistical explanation on why some identified markers by methods like FDR are not really significant and reveals the necessity of a two-stage or even multiple stage approach should be considered.

Tom Berrett, University of Cambridge

Title: Efficient multivariate entropy estimation via k-nearest neighbour distances

Abstract: Many widely-used statistical procedures, including methods for goodness-of-fit tests, feature selection and changepoint analysis, rely critically on the estimation of the entropy of a distribution. I will initially present new results on a commonly used generalisation of the estimator originally proposed by Kozachenko and Leonenko (1987), which is based on the k-nearest neighbour distances of a sample of independent and identically distributed random vectors. These results show that, in up to 3 dimensions and under regularity conditions, the estimator is efficient for certain choices of k, in the sense of achieving the local asymptotic minimax lower bound. However, they also show that in higher dimensions a non-trivial bias precludes its efficiency regardless of the choice of k. This motivates us to consider a new entropy estimator, formed as a weighted average of Kozachenko-Leonenko estimators for different values of k. A careful choice of weights enables us to reduce the bias of the first estimator and thus obtain an efficient estimator in arbitrary dimensions, given sufficient smoothness. Our results provided theoretical insight and have important methodological implications.

Nick Heard, Imperial College London

Title: Adaptive Sequential Monte Carlo for Multiple Changepoint Analysis

Abstract: Process monitoring and control requires detection of structural changes in a data stream in real time. This talk presents an efficient sequential Monte Carlo algorithm for learning unknown changepoints in continuous time. The method is intuitively simple: new changepoints for the latest window of data are proposed by conditioning only on data observed since the most recent estimated changepoint, as these observations carry most of the information about

the current state of the process. The method shows improved performance over the previous state of the art.

Another advantage of this simple algorithm is that it can be made adaptive, varying the number of particles according to the apparent local complexity of the target changepoint probability distribution. This saves computing time when changes in the changepoint distribution are negligible, and enables re-balancing of the importance weights of existing particles when a significant change in the target distribution is encountered.

Tuesday 13th June

Phil Jonathan, Shell

Title: Real-time data in real-world decision-making

Abstract: Almost every aspect of life can now be captured and stored in digital form. Ubiquitous electronic communication means that even everyday objects at home, at work, at play can share data. A world of real-time global-scale exchange of information between huge numbers of interconnected digital devices is becoming our reality. The availability of endless real-time measurements can in principle improve decision-making. This talk will take the form of a gentle overview of different ways in which digitalisation and real-time data are affecting the way statistical science and the suddenly in-vogue statistician (or "data scientist") can impact a global organisation like Shell. The talk will be illustrated by applications of statistical modelling involving real-time data for wind power forecasting, monitoring of seismic hazards, down-hole flow characterisation by acoustic sensing, remote sensing in carbon capture and storage, telematics for improved engine performance and driver experience, and trouble-shooting for manufacturing. The increased importance of reliable real-world model deployment (IT architecture, software, connectivity, interfaces) and assessment of model performance (outliers, validation, prediction, control) will be emphasised.

Guillem Rigaill, INRA

Title: Changepoint Detection in the Presence of Outliers

Abstract: Many traditional methods for identifying changepoints can struggle in the presence of outliers, or when the noise is heavy-tailed. Often they will infer additional changepoints in order to fit the outliers. To overcome this problem, data often needs to be pre-processed to remove outliers, though this is not feasible in applications where the data needs to be analysed online. We present an approach to changepoint detection that is robust to the presence of outliers. The idea is to adapt existing penalised cost approaches for detecting changes so that they use cost functions that are less sensitive to outliers. We argue that cost functions that are bounded, such as the classical biweight cost, are particularly suitable -- as we show that only bounded cost functions are robust to arbitrarily extreme outliers. We present a novel and efficient dynamic programming algorithm that can then find the optimal segmentation under our penalised cost criteria. Importantly, this algorithm can be used in settings where the data needs to be analysed online. We present theoretical bounds on the worst-case complexity of this algorithm, and show empirically that its average computational cost is linear in the amount of the data. We show the usefulness of our approach for applications such as analysing well-log data, detecting copy number variation, and detecting tampering of wireless devices.

Rebecca Killick, Lancaster University

Title: Online forecasting of locally stationary time series

Abstract: Within many fields online forecasting is an important statistical tool. Traditional statistical techniques often assume stationarity of the past in order to produce accurate forecasts. For data arising from the energy sector and others, this stationarity assumption is often violated but forecasts still need to be produced.

This talk will highlight the potential issues when moving from forecasting stationary to nonstationary data and propose a new estimator, the local partial autocorrelation function, which will aid us in forecasting locally stationary data. We introduce the lpacf alongside associated theory and examples demonstrating its use as a modelling tool. Following this we illustrate the new estimator embedded within a forecasting method and show improved forecasting performance using this new technique in an online setting.

Arnoldo Frigessi, University of Oslo

Title: Probabilistic preference learning with the Mallows rank model for incomplete data

Abstract: Ranking and comparing items is crucial for collecting information about preferences in many areas, from marketing to politics. The Mallows rank model is among the most successful approaches to analyse rank data, but its computational complexity has limited its use to a particular form based on Kendall distance. We develop new computationally tractable methods for Bayesian inference in Mallows models that work with any right-invariant distance. Our method performs inference on the consensus ranking of the items, also when based on partial rankings, such as top-k items or pairwise comparisons. When assessors are many and heterogeneous, we propose a mixture model for clustering them in homogeneous subgroups, with cluster-specific consensus rankings. We develop approximate stochastic algorithms that allow a fully probabilistic analysis. We make probabilistic predictions on the class membership of assessors based on their ranking of just some items, and predict missing individual preferences, as needed in recommendation systems. We test our approach using several experimental and benchmark data sets. I discuss scalability issues, when the numbers of assessors and/or items are very large. This is joint work with Valeria Vitelli, Marta Crispino, Øystein Sørensen, Elja Arjas, Sylvia Qinghua Liu.