StatScale Seminar Series 2021

STATSCALE SEMINARS

PREVIOUS SEMINARS 2021

3rd December 2021 - Housen Li (University of Göttingen)

Title: Distributional limits of graph cuts on discretized samples

Abstract - Graph cuts are well-established tools for clustering and classification analysis, with prominent applications found in a plethora of scientific fields, e.g. statistics, computer science and machine learning. In particular, they can be seen as a change point problem on graphs. Distributional limits are fundamental in understanding and designing statistical procedures on randomly sampled data, but no such results are known for graph cuts in the literature. To fill this gap, in this paper explicit limiting distributions for balanced graph cuts in general on a fixed but arbitrary discretization are provided. In particular, we show that Minimum Cut, Ratio Cut and Normalized Cut behave asymptotically normal as sample size increases. Besides, our results reveal an interesting dichotomy for Cheeger Cut: The limiting distribution is normal for a partition when the balancing term is differentiable while otherwise, the limiting distribution is a random mixture of normals (i.e. a mixture of normals with non-deterministic weights). We verify and support these theoretical findings by means of simulation, pointing out differences between the cuts and the dependency on the underlying distribution. This is a joint work with Axel Munk (University of Göttingen) and Leo Suchan (University of Göttingen).

19th November 2021 - Guo Yu (University of California, Santa Barbara)

Title: Reluctant interaction modeling in generalized linear models

Abstract- Analyzing contemporary high-dimensional datasets often leads to extremely large-scale interaction modeling problems, where the challenge is posed to identify important interactions among billions of candidate pairwise interactions. While several methods have recently been proposed to tackle this challenge, they are mostly designed by (1) focusing on linear models with interactions and (or) (2) assuming the hierarchy assumption. In practice, however, neither of these two building blocks has to hold. We propose an interaction modeling framework in generalized linear models (GLMs) which is free of any assumptions on hierarchy. The basic premise is a non-trivial extension of the reluctant interaction modeling framework in linear models (Yu, et al, 2019), where main effects are preferred over interactions if all else is equal, to the GLMs setting. The proposed method is easy to implement, and is highly scalable to large-scale datasets. Theoretically, we show that the proposed method successfully recovers all the important interactions with high probability. Both the favorable computational and statistical properties are demonstrated through comprehensive empirical studies.

5th November 2021 - Francesco Sanna Passino (Imperial College London)

Title: Mutually exciting point process graphs for modelling dynamic networks

Abstract- A new class of models for dynamic networks is proposed, called mutually exciting point process graphs (MEG), motivated by a practical application in computer network security. MEG is a scalable network-wide statistical model for point processes with dyadic marks, which can be used for anomaly detection when assessing the significance of previously unobserved connections. The model combines mutually exciting point processes to estimate dependencies between events and latent space models to infer relationships between the nodes. The intensity functions for each network edge are parameterised exclusively by node specific parameters, which allows information to be shared across the network. Fast inferential procedures using modern gradient ascent algorithms are exploited. The model is tested on simulated graphs and real world computer network datasets, demonstrating excellent performance. Joint work with Professor Nick Heard (Imperial College London).

22nd October 2021 - Ichiro Takeuchi (Nagoya Institute of Technology)

Title: More powerful and general conditional selective inference by parametric programming and its application to multi-dimensional change-point detection

Abstract- A conditional selective inference (SI) framework was introduced as a statistical inference method for selected features by Lasso (Lee et al., 2016). This framework allows us to derive the exact conditional sampling distribution of the selected test statistic when the selection event is characterized by a polyhedron. In fact, this framework is not only useful for Lasso but also generally applicable to a certain class of data-driven hypotheses. A common limitation of existing conditional SI studies is that the hypothesis selection event must be characterized in a simple tractable form such as a set of linear or quadratic inequalities. This limitation causes the so-called over-conditioning problem, which leads to the loss of the power. Furthermore, this limitation makes the conditional SI framework applicable only to relatively simple problems. To overcome this limitation, we proposed a new computational method for conditional SI using parametric programming (PP), which we call PP-based SI. PP-based SI allows us to avoid the aforementioned over-conditioning problem and apply the conditional SI framework to more complex problems. In this talk, after briefly reviewing the conditional SI framework, we introduce the PP-based SI approach, and show that it is possible to improve the power of conditional SI for Lasso and other feature selection methods. Furthermore, as an example of how PP-based SI can extend the applicability of conditional SI, we present our recent work on the conditional SI for multi-dimensional change-point detection.

This is joint work with my PhD student Vo Nguyen Le Duy.

References:[1] Duy et al. (NeurIPS2020)https://proceedings.neurips.cc/paper/2020/file/82b04cd5aa016d979fe048f3ddf0e8d3-Paper.pdf[2] Duy et al. (AIStats2021)http://proceedings.mlr.press/v130/nguyen-le-duy21a/nguyen-le-duy21a.pdf[3] Sugiyama et al. (ICML2021)http://proceedings.mlr.press/v139/sugiyama21a/sugiyama21a.pdf[4] Duy et al. (arXiv)https://arxiv.org/pdf/2010.01823.pdf

2nd July 2021 - Kevin Lin (University of Pennsylvania)

Title: Time-varying stochastic block models, with application to understanding the dynamics of gene co-expression

Abstract - Single-cell data enables us to investigate how gene co-expression patterns change as cells develop across time. While this question is multifaceted, we focus on understanding the theory behind a particular subtask in this talk: clustering nodes across many undirected labeled graphs indexed by time, also known as multilayer networks. Specifically, we discuss two stochastic block model (SBM) settings: one where the true connectivities among the clusters change over time while the true nodes' cluster memberships are held fixed, and another where both the SBMs' true node memberships and cluster connectivities vary smoothly across time. Our estimator is based on averaging the appropriately-debiased squared adjacency matrices followed by spectral clustering, and we demonstrate how our theoretical results improve upon the existing regimes-of-consistency (in terms of clustering error) as well as the rates-of-convergence in the literature. These results demonstrate the interplay among the number of nodes and graphs, the graph sparsity, and the rate-of-change in true cluster memberships or cluster connectivities across layers. We then demonstrate how our estimator performs empirically on single-cell data. This is a joint work with Jing Lei.

18th June 2021 - Runmin Wang (Southern Methodist University)

Title: Dating the Break in High Dimensional Data

Abstract - This talk is concerned with estimation and inference for the location of a change point in the mean of independent high-dimensional data. Our change point location estimator maximizes a new U-statistic based objective function, and its convergence rate and asymptotic distribution after suitable centering and normalization are obtained under mild assumptions. Our estimator turns out to have better efficiency as compared to the least squares based counterpart in the literature. Based on the asymptotic theory, we construct a confidence interval by plugging in consistent estimates of several quantities in the normalization. We also provide a bootstrap-based confidence interval and state its asymptotic validity under suitable conditions. Through simulation studies, we demonstrate favorable finite sample performance of the new change point location estimator as compared to its least squares based counterpart, and our bootstrap-based confidence intervals, as compared to several existing competitors. The usefulness of our bootstrapped confidence intervals are illustrated in a genomics data set.

4th June 2021 - Abolfazl Safikhani (University of Florida)

Title: Multiple Change Point Detection in Reduced Rank High Dimensional Vector Autoregressive Models

Abstract: In this talk, we discuss the problem of detecting and locating change points in high-dimensional Vector Autoregressive (VAR) models, whose transition matrices exhibit low rank plus sparse structure. We first address the problem of detecting a single change point using an exhaustive search algorithm and establish a finite sample error bound for its accuracy. Next, we extend the results to the case of multiple change points that can grow as a function of the sample size. Their detection is based on a two-step algorithm, wherein the first step, an exhaustive search for a candidate change point is employed for overlapping windows, and subsequently a backwards elimination procedure is used to screen out redundant candidates. The two-step strategy yields consistent estimates of the number and the locations of the change points. To reduce computation cost, we also investigate conditions under which a surrogate VAR model with a weakly sparse transition matrix can accurately estimate the change points and their locations for data generated by the original model. The effectiveness of the proposed algorithms and methodology is illustrated on both synthetic and real data sets. This is a joint work with Peiliang Bai and George Michailidis.

21st May 2021 - Hao Ni (University College London)

Title: Sig-Wasserstein Generative models to generate realistic synthetic time series.

Abstract: Wasserstein generative adversarial networks (WGANs) have been very successful in generating samples, from seemingly high dimensional probability measures. However, these methods struggle to capture the temporal dependence of joint probability distributions induced by time-series data. Moreover, training WGANs is computational expensive due to the min-max formulation of the loss function. To overcome these challenges, we integrate Wasserstein GANs with mathematically principled and efficient path feature extraction called the signature of a path. The signature of a path is a graded sequence of statistics that provides a universal and principled description for a stream of data, and its expected value characterises the law of the time-series model. In particular, we develop a new metric, (conditional) Sig-W1, that captures the (conditional) joint law of time series models, and use it as a discriminator. The signature feature space enables the explicit representation of the proposed discriminators which alleviates the need for expensive training. We validate our method on both synthetic and empirical dataset and our method achieved the superior performance than the other state-of-the-art benchmark methods. This is the joint work with Lukasz Szpruch (Uni of Edinburgh), Magnus Wiese (Uni of Kaiserslautern), Shujian Liao (UCL), Baoren Xiao (UCL).

7th May 2021 - Lynna Chu (Iowa State University)

Title: Sequential Change-point Detection for High-Dimensional and non-Euclidean Data

Abstract: In many modern applications, high-dimensional/non-Euclidean data sequences are collected to study complex phenomena over time and it is often of scientific significance to detect anomaly events as data is continually being collected. We study a nonparametric framework that utilizes nearest neighbor information among the observations and can be applied to various data types to detect changes in an online setting. We consider new test statistics under this framework that can detect anomaly events more effectively than the existing test with the false discovery rate controlled at the same level. Analytical formulas to determine the threshold of claiming a change are also provided, making the approach easily applicable for real data applications.

26th March 2021 - Hao Chen (UC Davis)

Title: A universal event detection framework for neuropixels data

Abstract: Neuropixels probes present exciting new opportunities for neuroscience, but such large-scale high- density recordings also introduce unprecedented challenges in data analysis. Neuropixels data usually consist of hundreds or thousands of long stretches of sequential spiking activities that evolve non-stationarily over time and are often governed by complex, unknown dynamics. Extracting meaningful information from the Neuropixels recordings is a non-trial task. Here we introduce a general-purpose, graph-based statistical framework that, without imposing any parametric assumptions, detects points in time at which population spiking activity exhibits simultaneous changes as well as changes that only occur in a subset of the neural population, referred to as “change-points”. The sequence of change- point events can be interpreted as a footprint of neural population activities, which allows us to relate behavior to simultaneously recorded high-dimensional neural activities across multiple brain regions. We demonstrate the effectiveness of our method with an analysis of Neuropixels recordings during spontaneous behavior of an awake mouse in darkness. We observe that change-point dynamics in some brain regions display biologically interesting patterns that hint at functional pathways, as well as temporally-precise coordination with behavioral dynamics. We hypothesize that neural activities underlying spontaneous behavior, though distributed brainwide, show evidence for network modularity. Moreover, we envision the proposed framework to be a useful off-the-shelf analysis tool to the neuroscience community as new electrophysiological recording techniques continue to drive an explosive proliferation in the number and size of data sets.

12th March 2021 - Sumanta Basu (Cornell University)

Title: Learning Financial Networks with Graphical Models of Time Series Data

Abstract: After the 2007-09 financial crisis, there has been a growing interest in measuring systemic risk, broadly defined as the risk of widespread failure of the entire financial system. In a highly interlinked financial market, a large body of recent works have proposed to use network connectivity amongst financial institutions to assess their systemic importance. In this work, we will present some graphical modeling techniques for learning interactions among the components of a large dynamic system from multivariate time series data, where the core idea is to learn from lead-lag relationships (commonly known as Granger causality) between time series in addition to their co-movements. In the context of modeling networks of interactions amongst financial institutions and measuring systemic risk, we will demonstrate how linear and quantile-based Granger causality analyses using vector autoregressive (VAR) models can provide insight. We will present some non-asymptotic statistical theory for our proposed algorithms, estimate these graphical models using stock returns of large financial institutions in U.S. and India, and demonstrate their usefulness in detecting systemically risky periods and institutions.

26th February 2021 - Eric Kolaczyk (Boston University)

Title: How hard is it to work with a network `average'?

Abstract: I will briefly summarize work with colleagues that places the question raised in the title into an appropriate context, with respect to the corresponding geometry and probability, in order to motivate principled notions of network averages with provably analogous behavior to that of your standard scalar and vector averages. Leveraging these results for the purposes of statistical inference in practice, however, requires addressing various computational challenges, particularly in the case of unlabeled networks.

12th February 2021 - Matteo Barigozzi (Università di Bologna)

Title: Quasi Maximum Likelihood Estimation and Inference of Large Approximate Dynamic Factor Models via the EM algorithm

Abstract: This paper studies Quasi Maximum Likelihood estimation of dynamic factor models for large panels of time series. Specifically, we consider the case in which the autocorrelation of the factors is explicitly accounted for and therefore the model has a state-space form. Estimation of the factors and of their loadings is implemented by means of the Expectation Maximization (EM) algorithm, jointly with the Kalman smoother. We prove that, as both the dimension of the panel n and the sample size T diverge to infinity: (i) the estimated loadings are sqrt(T)-consistent and asymptotically normal if sqrt(T) /n → 0; (ii) the estimated factors are sqrt(n)-consistent and asymptotically normal if sqrt(n)/T → 0; (iii) the estimated common component is min(sqrt(T), sqrt(n))-consistent and asymptotically normal regardless of the relative rate of divergence of n and T . Although the model is estimated as if the idiosyncratic terms were cross-sectionally and serially uncorrelated, we show that these mis-specifications do not affect consistency. Moreover, the estimated loadings are asymptotically as efficient as those obtained with the Principal Components estimator, whereas numerical results show that the loss in efficiency of the estimated factors becomes negligible as n and T increase. We then propose robust estimators of the asymptotic covariances, which can be used to conduct inference on the loadings and to compute confidence intervals for the factors and common components. In a MonteCarlo simulation exercise and an analysis of US macroeconomic data, we study the performance of our estimators and we compare them with the traditional Principal Components approach.

29th January 2021 - Holger Dette (Ruhr-Universitaet Bochum)
Title: Testing relevant hypotheses in functional time series via self-normalization

Abstract: In this paper we develop methodology for testing relevant hypotheses in a tuning-free way. Our main focus is on functional time series, but extensions to other settings are also discussed. Instead of testing for exact equality, for example for the equality of two mean functions from two independent time series, we propose to test a {\it relevant} deviation under the null hypothesis. In the two sample problem this means that an $L^2$-distance between the two mean functions is smaller than a pre-specified threshold. For such hypotheses self-normalization, which was introduced by Shao (2010) and is commonly used to avoid the estimation of nuisance parameters, is not directly applicable. We develop new self-normalized procedures for testing relevant hypotheses and demonstrate the particular advantages of this approach in the the comparisons of eigenvalues and eigenfunctions.

Reference:

Holger Dette, Kevin Kokot and Stanislav Volgushev (2020). Journal of the Royal Statistical Society Series B, vol. 82, issue 3, 629-660

Guo

Francesco

Holger

Lynna

Sumanta

Eric

Matteo

Hao

Runmin

Abolfazl

Kevin

Ichiro

Housen