Neural Network Learning: Theoretical Foundations

Martin Anthony and Peter L. Bartlett

This book describes recent theoretical advances in the study of artificial neural networks. It explores probabilistic models of supervised learning problems, and addresses the key statistical and computational questions. The book surveys research on pattern classification with binary-output networks, discussing the relevance of the Vapnik-Chervonenkis dimension, and calculating estimates of the dimension for several neural network models. A model of classification by real-output networks is developed, and the usefulness of classification with a `large margin' is demonstrated. The book explains the role of scale-sensitive versions of the Vapnik-Chervonenkis dimension in large margin classification, and in real prediction. It also discusses the computational complexity of neural network learning, describing a variety of hardness results, and outlining two efficient constructive learning algorithms. It is essentially self-contained, since it introduces the necessary background material on probability, statistics, combinatorics and computational complexity. It is suitable for researchers and graduate students in computer science, engineering, and mathematics.

Cambridge University Press, 1999. (Cambridge, New York)

1. Introduction

1.1. Supervised Learning

1.2. Artificial Neural Networks

1.3. Outline of the Book

1.4. Bibliographical Notes

Part one: Pattern Recognition with Binary-Output Neural Networks

2. The pattern recognition problem

2.1. The Learning Problem

2.2. Learning Finite Function Classes

2.3. Applications to Perceptrons

2.4. Restricted Model

2.5. Remarks

2.6. Bibliographical Notes

3. VC-dimension

3.1. Introduction

3.2. The Growth Function

3.3. The Vapnik-Chervonenkis Dimension

3.4. Bibliographical Notes

4. General upper bounds

4.1. Learning by Minimizing Sample Error

4.2. Uniform Convergence and Learnability

4.3. Proof of Uniform Convergence Result

4.4. Application to the Perceptron

4.5. The Restricted Model

4.6. Remarks

4.7. Bibliographical Notes

5. General lower bounds

5.1. Introduction

5.2. A Lower Bound for Learning

5.3. The Restricted Model

5.4. VC-dimension Quantifies Sample Complexity and Estimation Error

5.5. Remarks

5.6. Bibliographical Notes

6. Linear threshold networks

6.1. Feed-forward Neural Networks

6.2. Upper Bound

6.3. Lower Bounds

6.4. Bibliographical Notes

7. Geometric techniques

7.1. Introduction

7.2. The Need for Conditions on the Activation\\ Functions

7.3. A Bound on the Growth Function

7.4. Proof of the Growth Function Bound

7.5. More on Solution Set Components Bounds

7.6. Bibliographical Notes

8. VCdim for neural networks

8.1. Introduction

8.2. Function Classes that are Polynomial in their Parameters

8.3. Piecewise Polynomial Networks

8.4. Standard Sigmoid Networks

8.5. Remarks

8.6. Bibliographical Notes

Part two: Pattern Recognition with Real-Output Networks

9. Classification with real functions

9.1. Introduction

9.2. Large Margin Classifiers

9.3. Remarks

9.4. Bibliographical Notes

10. Covering Numbers and Convergence

10.1. Introduction

10.2. Covering Numbers

10.3. A Uniform Convergence Result

10.4. Covering Numbers in General

10.5. Remarks

10.6. Bibliographical Notes

11. Pseudo-dimension and fat-shattering

11.1. Introduction

11.2. The Pseudo-dimension

11.3. The Fat-shattering Dimension

11.4. Bibliographical Notes

12. Bounding covering numbers with dimensions

12.1. Introduction

12.2. Packing Numbers

12.3. Bounding with the Pseudo-dimension

12.4. Bounding with the Fat-shattering Dimension

12.5. Comparing the Two Approaches

12.6. Remarks

12.7. Bibliographical Notes

13. Sample complexity of classification learning

13.1. Large Margin SEM Algorithms

13.2. Large Margin SEM Algorithms as Classification Learning Algorithms

13.3. Lower Bounds for Certain Function Classes

13.4. Using the Pseudo-dimension

13.5. Remarks

13.6. Bibliographical Notes

14. Neural network dimensions

14.1. Introduction

14.2. Pseudo-dimension of Neural Networks

14.3. Bounds on the Fat-shattering Dimension in terms of Number of Parameters

14.4. Bounds on the Fat-shattering Dimension in terms of Size of Parameters

14.5. Remarks

14.6. Bibliographical Notes

15. Model Selection

15.1. Introduction

15.2. Model Selection Results

15.3. Proofs of the Results

15.4. Remarks

15.5. Bibliographical Notes

Part three: Learning Real-Valued Functions

16. Learning classes of real functions

16.1. Introduction

16.2. The Learning Framework for Real Estimation

16.3. Learning Finite Classes of Real Functions

16.4. A Substitute for Finiteness

16.5. Remarks

16.6. Bibliographical Notes

17. Uniform convergence for real classes

17.1. Uniform Convergence for Real Functions

17.2. Remarks

17.3. Bibliographical Notes

18. Bounding covering numbers

18.1. Introduction

18.2. Bounding with the Fat-shattering Dimension

18.3. Bounding with the Pseudo-dimension

18.4. Comparing the Different Approaches

18.5. Remarks

18.6. Bibliographical Notes

19. Sample complexity of learning function classes

19.1. Introduction

19.2. Classes with Finite Fat-shattering Dimension

19.3. Classes of Finite Pseudo-dimension

19.4. Results for Neural Networks

19.5. Lower Bounds

19.6. Remarks

19.7. Bibliographical Notes

20. Convex classes

20.1. Introduction

20.2. Lower Bounds for Non-Convex Classes

20.3. Upper Bound for Convex Classes

20.4. Remarks

20.5. Bibliographical Notes

21. Other learning problems

21.1. Loss Functions in General

21.2. Convergence for General Loss Functions

21.3. Learning in Multiple-Output Networks

21.4. Interpolation Models

21.5. Remarks

21.6. Bibliographical Notes

Part four: Algorithmics

22. Efficient learning

22.1. Introduction

22.2. Graded Function Classes

22.3. Efficient Learning

22.4. General Classes of Efficient Learning Algorithms

22.5. Efficient Learning in the Restricted Model

22.6. Bibliographical Notes

23. Learning as optimization

23.1. Introduction

23.2. Randomized Algorithms

23.3. Learning as Randomized Optimization

23.4. A Characterization of Efficient Learning

23.5. The `Hardness' of Learning

23.6. Remarks

23.7. Bibliographical Notes

24. The boolean perceptron

24.1. Introduction

24.2. Learning is Hard for the Simple Perceptron

24.3. Learning is Easy for Fixed Fan-in

24.4. Perceptron Learning in the Restricted Model

24.5. Remarks

24.6. Bibliographical Notes

25. Hardness results

25.1. Introduction

25.2. Linear Threshold Networks with Binary Inputs

25.3. Linear Threshold Networks with Real Inputs

25.4. Sigmoid Networks

25.5. Remarks

25.6. Bibliographical Notes

26. Constructive Algorithms

26.1. Introduction

26.2. Real Estimation with Convex Combinations of Basis Functions

26.3. Real Classification using Boosting

26.4. Bibliographical Notes

Appendix

A1. Useful Results

A1.1. Inequalities

A1.2. Tails of Distributions

A1.3. Sard's Theorem

Bibliography

Author index

Subject index

NIPS'98 Tutorial based on the book

Last update: Wed Apr 23 12:39:53 PDT 2003