Bayesian Reasoning and Machine Learning

Free

Description

Contents

Reviews

Language

English

ISBN

Unknown

Front Matter

Notation List

Preface

BRML toolbox

Contents

I Inference in Probabilistic Models

Probabilistic Reasoning

Probability Refresher

Interpreting Conditional Probability

Probability Tables

Probabilistic Reasoning

Prior, Likelihood and Posterior

Two dice : what were the individual scores?

Summary

Code

Basic Probability code

General utilities

An example

Exercises

Basic Graph Concepts

Graphs

Numerically Encoding Graphs

Edge list

Adjacency matrix

Clique matrix

Summary

Code

Utility routines

Exercises

Belief Networks

The Benefits of Structure

Modelling independencies

Reducing the burden of specification

Uncertain and Unreliable Evidence

Uncertain evidence

Unreliable evidence

Belief Networks

Conditional independence

The impact of collisions

Graphical path manipulations for independence

d-Separation

Graphical and distributional in/dependence

Markov equivalence in belief networks

Belief networks have limited expressibility

Causality

Simpson's paradox

The do-calculus

Influence diagrams and the do-calculus

Summary

Code

Naive inference demo

Conditional independence demo

Utility routines

Exercises

Graphical Models

Graphical Models

Markov Networks

Markov properties

Markov random fields

Hammersley-Clifford Theorem

Conditional independence using Markov networks

Lattice Models

Chain Graphical Models

Factor Graphs

Conditional independence in factor graphs

Expressiveness of Graphical Models

Summary

Code

Exercises

Efficient Inference in Trees

Marginal Inference

Variable elimination in a Markov chain and message passing

The sum-product algorithm on factor graphs

Dealing with Evidence

Computing the marginal likelihood

The problem with loops

Other Forms of Inference

Max-Product

Finding the N most probable states

Most probable path and shortest path

Mixed inference

Inference in Multiply Connected Graphs

Bucket elimination

Loop-cut conditioning

Message Passing for Continuous Distributions

Summary

Code

Factor graph examples

Most probable and shortest path

Bucket elimination

Message passing on Gaussians

Exercises

The Junction Tree Algorithm

Clustering Variables

Reparameterisation

Clique Graphs

Absorption

Absorption schedule on clique trees

Junction Trees

The running intersection property

Constructing a Junction Tree for Singly-Connected Distributions

Moralisation

Forming the clique graph

Forming a junction tree from a clique graph

Assigning potentials to cliques

Junction Trees for Multiply-Connected Distributions

Triangulation algorithms

The Junction Tree Algorithm

Remarks on the JTA

Computing the normalisation constant of a distribution

The marginal likelihood

Some small JTA examples

Shafer-Shenoy propagation

Finding the Most Likely State

Reabsorption : Converting a Junction Tree to a Directed Network

The Need For Approximations

Bounded width junction trees

Summary

Code

Utility routines

Exercises

Making Decisions

Expected Utility

Utility of money

Decision Trees

Extending Bayesian Networks for Decisions

Syntax of influence diagrams

Solving Influence Diagrams

Messages on an ID

Using a junction tree

Markov Decision Processes

Maximising expected utility by message passing

Bellman's equation

Temporally Unbounded MDPs

Value iteration

Policy iteration

A curse of dimensionality

Variational Inference and Planning

Financial Matters

Options pricing and expected utility

Binomial options pricing model

Optimal investment

Further Topics

Partially observable MDPs

Reinforcement learning

Summary

Code

Sum/Max under a partial order

Junction trees for influence diagrams

Party-Friend example

Chest Clinic with Decisions

Markov decision processes

Exercises

II Learning in Probabilistic Models

Statistics for Machine Learning

Representing Data

Categorical

Ordinal

Numerical

Distributions

The Kullback-Leibler Divergence KL( q|p )

Entropy and information

Classical Distributions

Multivariate Gaussian

Completing the square

Conditioning as system reversal

Whitening and centering

Exponential Family

Conjugate priors

Learning distributions

Properties of Maximum Likelihood

Training assuming the correct model class

Training when the assumed model is incorrect

Maximum likelihood and the empirical distribution

Learning a Gaussian

Maximum likelihood training

Bayesian inference of the mean and variance

Gauss-Gamma distribution

Summary

Code

Exercises

Learning as Inference

Learning as Inference

Learning the bias of a coin

Making decisions

A continuum of parameters

Decisions based on continuous intervals

Bayesian methods and ML-II

Maximum Likelihood Training of Belief Networks

Bayesian Belief Network Training

Global and local parameter independence

Learning binary variable tables using a Beta prior

Learning multivariate discrete tables using a Dirichlet prior

Structure learning

PC algorithm

Empirical independence

Network scoring

Chow-Liu Trees

Maximum Likelihood for Undirected models

The likelihood gradient

General tabular clique potentials

Decomposable Markov networks

Exponential form potentials

Conditional random fields

Pseudo likelihood

Learning the structure

Summary

Code

PC algorithm using an oracle

Demo of empirical conditional independence

Bayes Dirichlet structure learning

Exercises

Naive Bayes

Naive Bayes and Conditional Independence

Estimation using Maximum Likelihood

Binary attributes

Multi-state variables

Text classification

Bayesian Naive Bayes

Tree Augmented Naive Bayes

Learning tree augmented Naive Bayes networks

Summary

Code

Exercises

Learning with Hidden Variables

Hidden Variables and Missing Data

Why hidden/missing variables can complicate proceedings

The missing at random assumption

Maximum likelihood

Identifiability issues

Expectation Maximisation

Variational EM

Classical EM

Application to Belief networks

General case

Convergence

Application to Markov networks

Extensions of EM

Partial M step

Partial E-step

A failure case for EM

Variational Bayes

EM is a special case of variational Bayes

An example: VB for the Asbestos-Smoking-Cancer network

Optimising the Likelihood by Gradient Methods

Undirected models

Summary

Code

Exercises

Bayesian Model Selection

Comparing Models the Bayesian Way

Illustrations : coin tossing

A discrete parameter space

A continuous parameter space

Occam's Razor and Bayesian Complexity Penalisation

A continuous example : curve fitting

Approximating the Model Likelihood

Laplace's method

Bayes information criterion (BIC)

Bayesian Hypothesis Testing for Outcome Analysis

Outcome analysis

Hindep : model likelihood

Hsame : model likelihood

Dependent outcome analysis

Is classifier A better than B?

Summary

Code

Exercises

III Machine Learning

Machine Learning Concepts

Styles of Learning

Supervised learning

Unsupervised learning

Anomaly detection

Online (sequential) learning

Interacting with the environment

Semi-supervised learning

Supervised Learning

Utility and Loss

Using the empirical distribution

Bayesian decision approach

Bayes versus Empirical Decisions

Summary

Exercises

Nearest Neighbour Classification

Do As Your Neighbour Does

K-Nearest Neighbours

A Probabilistic Interpretation of Nearest Neighbours

When your nearest neighbour is far away

Summary

Code

Exercises

Unsupervised Linear Dimension Reduction

High-Dimensional Spaces – Low Dimensional Manifolds

Principal Components Analysis

Deriving the optimal linear reconstruction

Maximum variance criterion

PCA algorithm

PCA and nearest neighbours classification

Comments on PCA

High Dimensional Data

Eigen-decomposition for N<D

PCA via Singular value decomposition

Latent Semantic Analysis

Information retrieval

PCA With Missing Data

Finding the principal directions

Collaborative filtering using PCA with missing data

Matrix Decomposition Methods

Probabilistic latent semantic analysis

Extensions and variations

Applications of PLSA/NMF

Kernel PCA

Canonical Correlation Analysis

SVD formulation

Summary

Code

Exercises

Supervised Linear Dimension Reduction

Supervised Linear Projections

Fisher's Linear Discriminant

Canonical Variates

Dealing with the nullspace

Summary

Code

Exercises

Linear Models

Introduction: Fitting A Straight Line

Linear Parameter Models for Regression

Vector outputs

Regularisation

Radial basis functions

The Dual Representation and Kernels

Regression in the dual-space

Linear Parameter Models for Classification

Logistic regression

Beyond first order gradient ascent

Avoiding overconfident classification

Multiple classes

The Kernel Trick for Classification

Support Vector Machines

Maximum margin linear classifier

Using kernels

Performing the optimisation

Probabilistic interpretation

Soft Zero-One Loss for Outlier Robustness

Summary

Code

Exercises

Bayesian Linear Models

Regression With Additive Gaussian Noise

Bayesian linear parameter models

Determining hyperparameters: ML-II

Learning the hyperparameters using EM

Hyperparameter optimisation : using the gradient

Validation likelihood

Prediction and model averaging

Sparse linear models

Classification

Hyperparameter optimisation

Laplace approximation

Variational Gaussian approximation

Local variational approximation

Relevance vector machine for classification

Multi-class case

Summary

Code

Exercises

Gaussian Processes

Non-Parametric Prediction

From parametric to non-parametric

From Bayesian linear models to Gaussian processes

A prior on functions

Gaussian Process Prediction

Regression with noisy training outputs

Covariance Functions

Making new covariance functions from old

Stationary covariance functions

Non-stationary covariance functions

Analysis of Covariance Functions

Smoothness of the functions

Mercer kernels

Fourier analysis for stationary kernels

Gaussian Processes for Classification

Binary classification

Laplace's approximation

Hyperparameter optimisation

Multiple classes

Summary

Code

Exercises

Mixture Models

Density Estimation Using Mixtures

Expectation Maximisation for Mixture Models

Unconstrained discrete tables

Mixture of product of Bernoulli distributions

The Gaussian Mixture Model

EM algorithm

Practical issues

Classification using Gaussian mixture models

The Parzen estimator

K-Means

Bayesian mixture models

Semi-supervised learning

Mixture of Experts

Indicator Models

Joint indicator approach: factorised prior

Polya prior

Mixed Membership Models

Latent Dirichlet allocation

Graph based representations of data

Dyadic data

Monadic data

Cliques and adjacency matrices for monadic binary data

Summary

Code

Exercises

Latent Linear Models

Factor Analysis

Finding the optimal bias

Factor Analysis : Maximum Likelihood

Eigen-approach likelihood optimisation

Expectation maximisation

Interlude: Modelling Faces

Probabilistic Principal Components Analysis

Canonical Correlation Analysis and Factor Analysis

Independent Components Analysis

Summary

Code

Exercises

Latent Ability Models

The Rasch Model

Maximum likelihood training

Bayesian Rasch models

Competition Models

Bradley-Terry-Luce model

Elo ranking model

Glicko and TrueSkill

Summary

Code

Exercises

IV Dynamical Models

Discrete-State Markov Models

Markov Models

Equilibrium and stationary distribution of a Markov chain

Fitting Markov models

Mixture of Markov models

Hidden Markov Models

The classical inference problems

Filtering p(ht|v1:t)

Parallel smoothing p(ht|v1:T)

Correction smoothing

Sampling from p(h1:T|v1:T)

Most likely joint state

Prediction

Self localisation and kidnapped robots

Natural language models

Learning HMMs

EM algorithm

Mixture emission

The HMM-GMM

Discriminative training

Related Models

Explicit duration model

Input-Output HMM

Linear chain CRFs

Dynamic Bayesian networks

Applications

Object tracking

Automatic speech recognition

Bioinformatics

Part-of-speech tagging

Summary

Code

Exercises

Continuous-state Markov Models

Observed Linear Dynamical Systems

Stationary distribution with noise

Auto-Regressive Models

Training an AR model

AR model as an OLDS

Time-varying AR model

Time-varying variance AR models

Latent Linear Dynamical Systems

Inference

Filtering

Smoothing : Rauch-Tung-Striebel correction method

The likelihood

Most likely state

Time independence and Riccati equations

Learning Linear Dynamical Systems

Identifiability issues

EM algorithm

Subspace Methods

Structured LDSs

Bayesian LDSs

Switching Auto-Regressive Models

Inference

Maximum likelihood learning using EM

Summary

Code

Autoregressive models

Exercises

Switching Linear Dynamical Systems

Introduction

The Switching LDS

Exact inference is computationally intractable

Gaussian Sum Filtering

Continuous filtering

Discrete filtering

The likelihood p(v1:T)

Collapsing Gaussians

Relation to other methods

Gaussian Sum Smoothing

Continuous smoothing

Discrete smoothing

Collapsing the mixture

Using mixtures in smoothing

Relation to other methods

Reset Models

A Poisson reset model

Reset-HMM-LDS

Summary

Code

Exercises

Distributed Computation

Introduction

Stochastic Hopfield Networks

Learning Sequences

A single sequence

Multiple sequences

Boolean networks

Sequence disambiguation

Tractable Continuous Latent Variable Models

Deterministic latent variables

An augmented Hopfield network

Neural Models

Stochastically spiking neurons

Hopfield membrane potential

Dynamic synapses

Leaky integrate and fire models

Summary

Code

Exercises

V Approximate Inference

Sampling

Introduction

Univariate sampling

Rejection sampling

Multivariate sampling

Ancestral Sampling

Dealing with evidence

Perfect sampling for a Markov network

Gibbs Sampling

Gibbs sampling as a Markov chain

Structured Gibbs sampling

Remarks

Markov Chain Monte Carlo (MCMC)

Markov chains

Metropolis-Hastings sampling

Auxiliary Variable Methods

Hybrid Monte Carlo

Swendson-Wang

Slice sampling

Importance Sampling

Sequential importance sampling

Particle filtering as an approximate forward pass

Summary

Code

Exercises

Deterministic Approximate Inference

Introduction

The Laplace approximation

Properties of Kullback-Leibler Variational Inference

Bounding the normalisation constant

Bounding the marginal likelihood

Bounding marginal quantities

Gaussian approximations using KL divergence

Marginal and moment matching properties of minimising KL( p|q )

Variational Bounding Using KL( q|p )

Pairwise Markov random field

General mean field equations

Asynchronous updating guarantees approximation improvement

Structured variational approximation

Local and KL Variational Approximations

Local approximation

KL variational approximation

Mutual Information Maximisation : A KL Variational Approach

The information maximisation algorithm

Linear Gaussian decoder

Loopy Belief Propagation

Classical BP on an undirected graph

Loopy BP as a variational procedure

Expectation Propagation

MAP for Markov networks

Pairwise Markov networks

Attractive binary Markov networks

Potts model

Further Reading

Summary

Code

Exercises

End Matter

VI Appendix

Background Mathematics

Linear Algebra

Vector algebra

The scalar product as a projection

Lines in space

Planes and hyperplanes

Matrices

Linear transformations

Determinants

Matrix inversion

Computing the matrix inverse

Eigenvalues and eigenvectors

Matrix decompositions

Multivariate Calculus

Interpreting the gradient vector

Higher derivatives

Matrix calculus

Inequalities

Convexity

Jensen's inequality

Optimisation

Multivariate Optimisation

Gradient descent with fixed stepsize

Gradient descent with line searches

Minimising quadratic functions using line search

Gram-Schmidt construction of conjugate vectors

The conjugate vectors algorithm

The conjugate gradients algorithm

Newton's method

Constrained Optimisation using Lagrange Multipliers

Lagrange Dual

Bibliography

Index

The book hasn't received reviews yet.