MAIN | PROGRAM | INVITED SPEAKERS | ABSTRACTS | PROCEEDINGS

Sponsored by PASCAL2

ABSTRACTS OF THE TALKS


Peter Grünwald

MDL Tutorial

We give a self-contained tutorial on the Minimum Description Length (MDL) approach to modeling, learning and prediction. We focus on the recent (post 1995) formulations of MDL, which can be quite different from the older methods that are often still called 'MDL' in the machine learning and UAI communities.

In its modern guise, MDL is based on the concept of a `universal model'. We explain this concept at length. We show that previous versions of MDL (based on so-called two-part codes), Bayesian model selection and predictive validation (a variation of cross-validation) can all be interpreted as approximations to model selection based on 'universal models'. Modern MDL prescribes the use of a certain `optimal' universal model, the so-called `normalized maximum likelihood model' or `Shtarkov distribution'. This is related to (yet different from) Bayesian model selection with non-informative priors. It leads to a penalization of `complex' models that can be given an intuitive differential-geometric interpretation. Roughly speaking, the complexity of a parametric model is directly related to the number of distinguishable probability distributions that it contains. We also discuss some recent extensions such as the 'luckiness principle', which can be used if the Shtarkov distribution is undefined, and the 'switch distribution', which allows for a resolution of the AIC-BIC dilemma.

The talk assumes no prior knowledge of information theory. The menu is as follows:

1. Codes and Probability Distributions
2. Universal Coding
3. The Bayes, 2-part and Normalized Maximum LIkelihood Universal Model
4. MDL Model Selection
5. Relation to Bayes factor model selection and Cross-Validation
6. The Luckiness Principle, The Switch Distribution


Petri Myllymäki

Fast computation of NML for Bayesian networks

Bayesian networks are parametric models for multidimensional domains exhibiting complex dependencies between the dimensions (domain variables). A central problem in learning such models is how to regularize the number of parameters; in other words, how to determine which dependencies are significant and which are not. The normalized maximum likelihood (NML) distribution or code offers an information-theoretic solution to this problem. Unfortunately, computing it for arbitrary Bayesian network models appears to be computationally infeasible, but we show how it can be computed efficiently for certain restricted type of Bayesian networks.

P. Kontkanen and P. Myllymäki, A linear-time algorithm for computing the multinomial stochastic complexity. Information Processing Letters 103 (2007) 6 (September), 227–233.

P. Kontkanen and P. Myllymäki, MDL histogram density estimation. In Proc. 11th International Conference on Artificial Intelligence and Statistics (AISTATS 2007), Puerto Rico, March 2007.


Steven de Rooij

Nonparametric density estimation by switching

According to standard MDL and Bayesian model selection, we should (roughly) prefer the model that minimises overall prediction error. But if the goal is to predict well, it may well depend on the sample size which model is most useful to predict the next outcome. By re-interpreting the Bayesian prediction strategies associated with the models as "experts", we can use the various algorithms for "expert tracking" to improve model selection for prediction without introducing a substantial computational overhead.

T. van Erven, P.D. Grünwald, and S. de Rooij. Catching up faster in Bayesian model selection and model averaging. Advances in Neural Information Processing Systems 20 (NIPS 2007)


Janne Ojanen

Extensions to MDL denoising

Joint work with J. Heikkonen

The minimum description length principle in wavelet denoising can be extended from the standard linear-quadratic setting in several ways. We describe briefly three extensions: soft thresholding, histogram modeling and a multicomponent approach. The MDL hard thresholding approach based on the normalized maximum likelihood universal modeling can be extended to include soft thresholding shrinkage, which can be considered to give better results in some applications. In MDL histogram denoising approach the assumptions of the parametric density models for the data can be relaxed. The informative and noise components of the data are modeled with equal bin width histograms. The method can cope with different noise distributions. In multicomponent approach more than one non-noise components are included in the model, because it is possible that in addition to the random noise there may be other disturbing signal elements, or that the informative signal is comprised of several different components which we may want to observe, separate or remove. In these cases adding informative components in the model may result result in better performance than in the NML denoising approach.

V. Kumar, J. Heikkonen, J. Rissanen, and K. Kaski. Minimum description length denoising with histogram models. IEEE Transactions on Signal Processing 54(8), pages 2922–2928, 2006.


Tomi Silander

Sequential and factorized NML models

Currently the most popular model selection criterion for learning Bayesian networks is the Bayesian mixture with a conjugate prior. This method has recently been reported to be very sensitive to the choice of prior hyper-parameters. On the other hand, the general model selection criteria, AIC and BIC are derived through asymptotics and their behavior is suboptimal for small sample sizes.

In this work we introduce a new effective scoring criterion for learning Bayesian network structures, the factorized normalized maximum likelihood. This score features no tunable parameters thus avoiding the sensitivity problems of Bayesian scores. The new scoring method also suggests a parametrization of the Bayesian network that is based on the conditional normalized maximum likelihood predictive distribution.

J. Rissanen, and T. Roos, (2007). Conditional NML universal models, 2007 Information Theory and Applications Workshop (ITA-07), pp. 337–341.

T. Roos, T. Silander, P. Kontkanen, and P. Myllymäki, (2008). Bayesian network structure learning using factorized NML universal models, 2008 Information Theory and Applications Workshop (ITA-08).


Tong Zhang

Generalization Theory of Two-part Code MDL Estimator

I will present a finite-sample generalization analysis of two-part code MDL estimator. This method selects a model that minimizes the sum of the model description length plus the data description length given the model. It can be shown that under various conditions, optimal rate of convergence can be achieved through an extended family of two-part code MDL that over-penalize the model description length.

As an example, we apply MDL to learning sparse linear representations when the system dimension is much larger than the number of training examples. This is a problem that has attracted considerable attention in recent years. The generalization performance of a two-part code MDL estimator is calculated based on our theory, and it compares favorably to other methods such as 1-norm regularization.


Ioan Tabus

Segmentation of DNA sequences using Normalized Maximum Likelihood models for uncovering gene duplications

The normalized maximum likelihood (NML) model (Rissanen, 1996; Rissanen, 2001, Shtarkov, 1987) for a class of Markov sources (Tabus and Korodi, 2008) was recently used for the compression of full genomes, obtaining for the human genome the best existing compression results (Korodi and Tabus, 2007). We show that one of the underlying biological features that the compression algorithm implicitly uncovers is the existence of approximate gene duplication. We proposed a refined method based on the same NML models for the segmentation of DNA sequences for uncovering gene duplications (Tabus, Yang, and Astola, 2008). Several analysis tasks in genomic sequences involve preliminary segmentation or clustering of the data, which can be performed by a number of techniques, based on various similarity measures. Here we review and further pursue the application of MDL techniques for genomic sequence analysis. The process of sequence matching will be used for solving the problem of uncovering gene duplications with the help of a preliminary segmentation of a complex DNA locus, known to have evolved through a series of duplications.

G. Korodi, I. Tabus, ``Normalized maximum likelihood model of order-1 for the compression of DNA sequences'', in Proc. IEEE Data Compression Conference, DCC'07, pp. 33–42, Snowbird, 27-29 March 2007.

J. Rissanen, ``Fisher information and stochastic complexity'', IEEE Transactions on Information Theory, vol. IT-42, pp. 40–47, Jan. 1996.

J. Rissanen, ``Strong optimality of the normalized ML models as universal codes and information in data'', IEEE Transactions on Information Theory, vol. IT-47 (5), pp.\ 1712–1717, 2001.

Y.M. Shtarkov, ``Universal sequential coding of single messages''. Translated from Problems of Information Transmission, Vol. 23, No. 3, 3–17, July-September 1987.

I. Tabus, Y. Yang, J. Astola, ``Universal models with memory for genomic sequence analysis'', 3rd International Symposium on Communications, Control and Signal Processing, ISCCSP 2008, March 12–14, St. Julians, Malta, 2008.

I. Tabus, G. Korodi, ``Genome compression using normalized maximum likelihood models for constrained Markov sources'', IEEE Information Theory Workshop, Porto, Portugal, May 5–9, 2008.


Matthias Seeger

Information Consistency of Nonparametric Gaussian Process Methods

Joint work with S. Kakade and D. Foster

We present information consistency results for nonparametric sequential prediction with Gaussian processes. The connection to nonparametric MDL is through the prequential approach, as detailed in Gruenwald's 2007 book, Sect. 13.5. Our proof technique is elementary, making use of a convex duality previously useful to obtain PAC-Bayesian bounds. We also obtain precise information consistency rates for a wide range of kernels and input distributions, using kernel eigenvalue asymptotics. In all these cases, the linear expert space is an infinite-dimensional function space, but still very reasonable rates are obtained.

Seeger, Kakade, Foster, Information Consistency of Nonparametric Gaussian Process Methods, IEEE Transactions on Information Theory 54(5), 2376–2382 (2008)