Deep Learning: Foundations and Concepts

Deep Learning: Foundations and Concepts

深度学习：基础和概念
计算机科学技术基础学科

售价：

发货周期：国外库房发货,通常付款后3-5周到货！

作者

Bishop Christopher M.Bishop Hugh

出版社

Springer

出版时间

2024年01月27日

装帧

精装

ＩＳＢＮ

9783031454677

复制

页码

649

开本

10.00 x 7.00 x 1.44

语种

英文

版次

2024

综合评分

暂无评分

图书详情
目次
买家须知
书评（0）
权威书评（0）

图书简介

This book offers a comprehensive introduction to the central ideas that underpin deep learning. It is intended both for newcomers to machine learning and for those already experienced in the field. Covering key concepts relating to contemporary architectures and techniques, this essential book equips readers with a robust foundation for potential future specialization. The field of deep learning is undergoing rapid evolution, and therefore this book focusses on ideas that are likely to endure the test of time. The book is organized into numerous bite-sized chapters, each exploring a distinct topic, and the narrative follows a linear progression, with each chapter building upon content from its predecessors. This structure is well-suited to teaching a two-semester undergraduate or postgraduate machine learning course, while remaining equally relevant to those engaged in active research or in self-study. A full understanding of machine learning requires some mathematical background and so the book includes a self-contained introduction to probability theory. However, the focus of the book is on conveying a clear understanding of ideas, with emphasis on the real-world practical value of techniques rather than on abstract theory. Complex concepts are therefore presented from multiple complementary perspectives including textual descriptions, diagrams, mathematical formulae, and pseudo-code. Chris Bishop is a Technical Fellow at Microsoft and is the Director of Microsoft Research AI4Science. He is a Fellow of Darwin College Cambridge, a Fellow of the Royal Academy of Engineering, and a Fellow of the Royal Society. Hugh Bishop is an Applied Scientist at Wayve, a deep learning autonomous driving company in London, where he designs and trains deep neural networks. Before working at Wayve, he completed his MPhil in Machine Learning and Machine Intelligence in the Engineering Department at Cambridge University. "Chris Bishop wrote a terrific textbook on neural networks in 1995 and has a deep knowledge of the field and its core ideas. His many years of experience in explaining neural networks have made him extremely skillful at presenting complicated ideas in the simplest possible way and it is a delight to see these skills applied to the revolutionary new developments in the field." -- Geoffrey Hinton "This excellent and very educational book will bring the reader up to date with the main concepts and advances in deep learning with a solid anchoring in probability. These concepts are powering current industrial AI systems and are likely to form the basis of further advances towards artificial general intelligence." -- Yoshua Bengio "With the recent explosion of deep learning and AI as a research topic, and the quickly growing importance of AI applications, a modern textbook on the topic was badly needed. The "New Bishop" masterfully fills the gap, covering fundamental topics in linear algebra, probability theory, and function optimisation, learning algorithms for supervised and unsupervised learning, modern deep learning architecture families, as well as how to apply all of this to various application areas. The book is poised to have a similarly large impact as Chris Bishop’s 1995 book on neural networks." - Yann LeCun

Preface 3 1 The Deep Learning Revolution 19 1.1 The Impact of Deep Learning . . . . . . . . . . . . . . . . . . . . 20 1.1.1 Medical diagnosis . . . . . . . . . . . . . . . . . . . . . . 20 1.1.2 Protein structure . . . . . . . . . . . . . . . . . . . . . . . 21 1.1.3 Image synthesis . . . . . . . . . . . . . . . . . . . . . . . . 22 1.1.4 Large language models . . . . . . . . . . . . . . . . . . . . 23 1.2 A Tutorial Example . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.2.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.2.2 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.2.3 Error function . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.2.4 Model complexity . . . . . . . . . . . . . . . . . . . . . . 27 1.2.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 30 1.2.6 Model selection . . . . . . . . . . . . . . . . . . . . . . . . 32 1.3 A Brief History of Machine Learning . . . . . . . . . . . . . . . . 34 1.3.1 Single-layer networks . . . . . . . . . . . . . . . . . . . . 35 1.3.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . 36 1.3.3 Deep networks . . . . . . . . . . . . . . . . . . . . . . . . 38 2 Probabilities 41 2.1 The Rules of Probability . . . . . . . . . . . . . . . . . . . . . . . 43 2.1.1 A medical screening example . . . . . . . . . . . . . . . . 43 2.1.2 The sum and product rules . . . . . . . . . . . . . . . . . . 44 2.1.3 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 46 2.1.4 Medical screening revisited . . . . . . . . . . . . . . . . . 48 2.1.5 Prior and posterior probabilities . . . . . . . . . . . . . . . 49 2.1.6 Independent variables . . . . . . . . . . . . . . . . . . . . 49 2.2 Probability Densities . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.2.1 Example distributions . . . . . . . . . . . . . . . . . . . . 51 2.2.2 Expectations and covariances . . . . . . . . . . . . . . . . 52 2.3 The Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . 54 2.3.1 Mean and variance . . . . . . . . . . . . . . . . . . . . . . 55 2.3.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 55 2.3.3 Bias of maximum likelihood . . . . . . . . . . . . . . . . . 57 2.3.4 Linear regression . . . . . . . . . . . . . . . . . . . . . . . 58 2.4 Transformation of Densities . . . . . . . . . . . . . . . . . . . . . 60 2.4.1 Multivariate distributions . . . . . . . . . . . . . . . . . . . 62 2.5 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.5.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.5.2 Physics perspective . . . . . . . . . . . . . . . . . . . . . . 65 2.5.3 Differential entropy . . . . . . . . . . . . . . . . . . . . . . 67 2.5.4 Maximum entropy . . . . . . . . . . . . . . . . . . . . . . 68 2.5.5 Kullback-Leibler divergence . . . . . . . . . . . . . . . . . 69 2.5.6 Conditional entropy . . . . . . . . . . . . . . . . . . . . . 71 2.5.7 Mutual information . . . . . . . . . . . . . . . . . . . . . . 72 2.6 Bayesian Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 72 2.6.1 Model parameters . . . . . . . . . . . . . . . . . . . . . . . 73 2.6.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 74 2.6.3 Bayesian machine learning . . . . . . . . . . . . . . . . . . 75 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3 Standard Distributions 83 3.1 Discrete Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.1.1 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . 84 3.1.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . 85 3.1.3 Multinomial distribution . . . . . . . . . . . . . . . . . . . 86 3.2 The Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . 88 3.2.1 Geometry of the Gaussian . . . . . . . . . . . . . . . . . . 89 3.2.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.2.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.2.4 Conditional distribution . . . . . . . . . . . . . . . . . . . 94 3.2.5 Marginal distribution . . . . . . . . . . . . . . . . . . . . . 97 3.2.6 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 99 3.2.7 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 102 3.2.8 Sequential estimation . . . . . . . . . . . . . . . . . . . . . 103 3.2.9 Mixtures of Gaussians . . . . . . . . . . . . . . . . . . . . 104 3.3 Periodic Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.3.1 Von Mises distribution . . . . . . . . . . . . . . . . . . . . 107 3.4 The Exponential Family . . . . . . . . . . . . . . . . . . . . . . . 112 3.4.1 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . 115 3.5 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . 116 3.5.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.5.2 Kernel densities . . . . . . . . . . . . . . . . . . . . . . . . 118 3.5.3 Nearest-neighbours . . . . . . . . . . . . . . . . . . . . . . 121 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4 Single-layer Networks: Regression 129 4.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.1.1 Basis functions . . . . . . . . . . . . . . . . . . . . . . . . 130 4.1.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 132 4.1.3 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 133 4.1.4 Geometry of least squares . . . . . . . . . . . . . . . . . . 135 4.1.5 Sequential learning . . . . . . . . . . . . . . . . . . . . . . 135 4.1.6 Regularized least squares . . . . . . . . . . . . . . . . . . . 136 4.1.7 Multiple outputs . . . . . . . . . . . . . . . . . . . . . . . 137 4.2 Decision theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.3 The Bias-Variance Trade-off . . . . . . . . . . . . . . . . . . . . . 141 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5 Single-layer Networks: Classification 149 5.1 Discriminant Functions . . . . . . . . . . . . . . . . . . . . . . . . 150 5.1.1 Two classes . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.1.2 Multiple classes . . . . . . . . . . . . . . . . . . . . . . . . 152 5.1.3 1-of-K coding . . . . . . . . . . . . . . . . . . . . . . . . 153 5.1.4 Least squares for classification . . . . . . . . . . . . . . . . 154 5.2 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.2.1 Misclassification rate . . . . . . . . . . . . . . . . . . . . . 157 5.2.2 Expected loss . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.2.3 The reject option . . . . . . . . . . . . . . . . . . . . . . . 160 5.2.4 Inference and decision . . . . . . . . . . . . . . . . . . . . 161 5.2.5 Classifier accuracy . . . . . . . . . . . . . . . . . . . . . . 165 5.2.6 ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5.3 Generative Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 168 5.3.1 Continuous inputs . . . . . . . . . . . . . . . . . . . . . . 170 5.3.2 Maximum likelihood solution . . . . . . . . . . . . . . . . 171 5.3.3 Discrete features . . . . . . . . . . . . . . . . . . . . . . . 174 5.3.4 Exponential family . . . . . . . . . . . . . . . . . . . . . . 174 5.4 Discriminative Classifiers . . . . . . . . . . . . . . . . . . . . . . 175 5.4.1 Activation functions . . . . . . . . . . . . . . . . . . . . . 176 5.4.2 Fixed basis functions . . . . . . . . . . . . . . . . . . . . . 176 5.4.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . 177 5.4.4 Multi-class logistic regression . . . . . . . . . . . . . . . . 179 5.4.5 Probit regression . . . . . . . . . . . . . . . . . . . . . . . 181 5.4.6 Canonical link functions . . . . . . . . . . . . . . . . . . . 182 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 6 Deep Neural Networks 189 6.1 Limitations of Fixed Basis Functions . . . . . . . . . . . . . . . . 190 6.1.1 The curse of dimensionality . . . . . . . . . . . . . . . . . 190 6.1.2 High-dimensional spaces . . . . . . . . . . . . . . . . . . . 193 6.1.3 Data manifolds . . . . . . . . . . . . . . . . . . . . . . . . 194 6.1.4 Data-dependent basis functions . . . . . . . . . . . . . . . 196 6.2 Multilayer Networks . . . . . . . . . . . . . . . . . . . . . . . . . 198 6.2.1 Parameter matrices . . . . . . . . . . . . . . . . . . . . . . 199 6.2.2 Universal approximation . . . . . . . . . . . . . . . . . . . 199 6.2.3 Hidden unit activation functions . . . . . . . . . . . . . . . 200 6.2.4 Weight-space symmetries . . . . . . . . . . . . . . . . . . 203 6.3 Deep Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 6.3.1 Hierarchical representations . . . . . . . . . . . . . . . . . 205 6.3.2 Distributed representations . . . . . . . . . . . . . . . . . . 205 6.3.3 Representation learning . . . . . . . . . . . . . . . . . . . 206 6.3.4 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . 207 6.3.5 Contrastive learning . . . . . . . . . . . . . . . . . . . . . 209 6.3.6 General network architectures . . . . . . . . . . . . . . . . 211 6.3.7 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 6.4 Error Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 6.4.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 212 6.4.2 Binary classification . . . . . . . . . . . . . . . . . . . . . 214 6.4.3 multiclass classification . . . . . . . . . . . . . . . . . . . 215 6.5 Mixture Density Networks . . . . . . . . . . . . . . . . . . . . . . 216 6.5.1 Robot kinematics example . . . . . . . . . . . . . . . . . . 216 6.5.2 Conditional mixture distributions . . . . . . . . . . . . . . 217 6.5.3 Gradient optimization . . . . . . . . . . . . . . . . . . . . 219 6.5.4 Predictive distribution . . . . . . . . . . . . . . . . . . . . 220 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 7 Gradient Descent 227 7.1 Error Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 7.1.1 Local quadratic approximation . . . . . . . . . . . . . . . . 229 7.2 Gradient Descent Optimization . . . . . . . . . . . . . . . . . . . 231 7.2.1 Use of gradient information . . . . . . . . . . . . . . . . . 232 7.2.2 Batch gradient descent . . . . . . . . . . . . . . . . . . . . 232 7.2.3 Stochastic gradient descent . . . . . . . . . . . . . . . . . . 232 7.2.4 Mini-batches . . . . . . . . . . . . . . . . . . . . . . . . . 234 7.2.5 Parameter initialization . . . . . . . . . . . . . . . . . . . . 234 7.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 7.3.1 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . 238 7.3.2 Learning rate schedule . . . . . . . . . . . . . . . . . . . . 240 7.3.3 RMSProp and Adam . . . . . . . . . . . . . . . . . . . . . 241 7.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 7.4.1 Data normalization . . . . . . . . . . . . . . . . . . . . . . 244 7.4.2 Batch normalization . . . . . . . . . . . . . . . . . . . . . 245 7.4.3 Layer normalization . . . . . . . . . . . . . . . . . . . . . 247 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 8 Backpropagation 251 8.1 Evaluation of Gradients . . . . . . . . . . . . . . . . . . . . . . . 252 8.1.1 Single-layer networks . . . . . . . . . . . . . . . . . . . . 252 8.1.2 General feed-forward networks . . . . . . . . . . . . . . . 253 8.1.3 A simple example . . . . . . . . . . . . . . . . . . . . . . 256 8.1.4 Numerical differentiation . . . . . . . . . . . . . . . . . . . 257 8.1.5 The Jacobian matrix . . . . . . . . . . . . . . . . . . . . . 258 8.1.6 The Hessian matrix . . . . . . . . . . . . . . . . . . . . . . 260 8.2 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . 262 8.2.1 Forward-mode automatic differentiation . . . . . . . . . . . 264 8.2.2 Reverse-mode automatic differentiation . . . . . . . . . . . 267 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 9 Regularization 271 9.1 Inductive Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 9.1.1 Inverse problems . . . . . . . . . . . . . . . . . . . . . . . 272 9.1.2 No free lunch theorem . . . . . . . . . . . . . . . . . . . . 273 9.1.3 Symmetry and invariance . . . . . . . . . . . . . . . . . . . 274 9.1.4 Equivariance . . . . . . . . . . . . . . . . . . . . . . . . . 277 9.2 Weight Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 9.2.1 Consistent regularizers . . . . . . . . . . . . . . . . . . . . 280 9.2.2 Generalized weight decay . . . . . . . . . . . . . . . . . . 282 9.3 Learning Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 9.3.1 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . 284 9.3.2 Double descent . . . . . . . . . . . . . . . . . . . . . . . . 286 9.4 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . 288 9.4.1 Soft weight sharing . . . . . . . . . . . . . . . . . . . . . . 289 9.5 Residual Connections . . . . . . . . . . . . . . . . . . . . . . . . 292 9.6 Model Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 9.6.1 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 10 Convolutional Networks 305 10.1 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 10.1.1 Image data . . . . . . . . . . . . . . . . . . . . . . . . . . 307 10.2 Convolutional Filters . . . . . . . . . . . . . . . . . . . . . . . . . 308 10.2.1 Feature detectors . . . . . . . . . . . . . . . . . . . . . . . 308 10.2.2 Translation equivariance . . . . . . . . . . . . . . . . . . . 309 10.2.3 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 10.2.4 Strided convolutions . . . . . . . . . . . . . . . . . . . . . 312 10.2.5 Multi-dimensional convolutions . . . . . . . . . . . . . . . 313 10.2.6 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 10.2.7 Multilayer convolutions . . . . . . . . . . . . . . . . . . . 316 10.2.8 Example network architectures . . . . . . . . . . . . . . . . 317 10.3 Visualizing Trained CNNs . . . . . . . . . . . . . . . . . . . . . . 320 10.3.1 Visual cortex . . . . . . . . . . . . . . . . . . . . . . . . . 320 10.3.2 Visualizing trained filters . . . . . . . . . . . . . . . . . . . 321 10.3.3 Saliency maps . . . . . . . . . . . . . . . . . . . . . . . . 323 10.3.4 Adversarial attacks . . . . . . . . . . . . . . . . . . . . . . 324 10.3.5 Synthetic images . . . . . . . . . . . . . . . . . . . . . . . 326 10.4 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 10.4.1 Bounding boxes . . . . . . . . . . . . . . . . . . . . . . . 327 10.4.2 Intersection-over-union . . . . . . . . . . . . . . . . . . . . 328 10.4.3 Sliding windows . . . . . . . . . . . . . . . . . . . . . . . 329 10.4.4 Detection across scales . . . . . . . . . . . . . . . . . . . . 331 10.4.5 Non-max suppression . . . . . . . . . . . . . . . . . . . . . 332 10.4.6 Fast region CNNs . . . . . . . . . . . . . . . . . . . . . . . 332 10.5 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 333 10.5.1 Convolutional segmentation . . . . . . . . . . . . . . . . . 333 10.5.2 Up-sampling . . . . . . . . . . . . . . . . . . . . . . . . . 334 10.5.3 Fully convolutional networks . . . . . . . . . . . . . . . . . 336 10.5.4 The U-net architecture . . . . . . . . . . . . . . . . . . . . 337 10.6 Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 11 Structured Distributions 343 11.1 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 11.1.1 Directed graphs . . . . . . . . . . . . . . . . . . . . . . . . 344 11.1.2 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 345 11.1.3 Discrete variables . . . . . . . . . . . . . . . . . . . . . . . 347 11.1.4 Gaussian variables . . . . . . . . . . . . . . . . . . . . . . 350 11.1.5 Binary classifier . . . . . . . . . . . . . . . . . . . . . . . 352 11.1.6 Parameters and observations . . . . . . . . . . . . . . . . . 352 11.1.7 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 354 11.2 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . 355 11.2.1 Three example graphs . . . . . . . . . . . . . . . . . . . . 356 11.2.2 Explaining away . . . . . . . . . . . . . . . . . . . . . . . 359 11.2.3 D-separation . . . . . . . . . . . . . . . . . . . . . . . . . 361 11.2.4 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . 362 11.2.5 Generative models . . . . . . . . . . . . . . . . . . . . . . 364 11.2.6 Markov blanket . . . . . . . . . . . . . . . . . . . . . . . . 365 11.2.7 Graphs as filters . . . . . . . . . . . . . . . . . . . . . . . . 366 11.3 Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 11.3.1 Hidden variables . . . . . . . . . . . . . . . . . . . . . . . 370 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 12 Transformers 375 12.1 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 12.1.1 Transformer processing . . . . . . . . . . . . . . . . . . . . 378 12.1.2 Attention coefficients . . . . . . . . . . . . . . . . . . . . . 379 12.1.3 Self-attention . . . . . . . . . . . . . . . . . . . . . . . . . 380 12.1.4 Network parameters . . . . . . . . . . . . . . . . . . . . . 381 12.1.5 Scaled self-attention . . . . . . . . . . . . . . . . . . . . . 384 12.1.6 Multi-head attention . . . . . . . . . . . . . . . . . . . . . 384 12.1.7 Transformer layers . . . . . . . . . . . . . . . . . . . . . . 386 12.1.8 Computational complexity . . . . . . . . . . . . . . . . . . 388 12.1.9 Positional encoding . . . . . . . . . . . . . . . . . . . . . . 389 12.2 Natural Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 12.2.1 Word embedding . . . . . . . . . . . . . . . . . . . . . . . 393 12.2.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . 395 12.2.3 Bag of words . . . . . . . . . . . . . . . . . . . . . . . . . 396 12.2.4 Autoregressive models . . . . . . . . . . . . . . . . . . . . 397 12.2.5 Recurrent neural networks . . . . . . . . . . . . . . . . . . 398 12.2.6 Backpropagation through time . . . . . . . . . . . . . . . . 399 12.3 Transformer Language Models . . . . . . . . . . . . . . . . . . . . 400 12.3.1 Decoder transformers . . . . . . . . . . . . . . . . . . . . . 401 12.3.2 Sampling strategies . . . . . . . . . . . . . . . . . . . . . . 404 12.3.3 Encoder transformers . . . . . . . . . . . . . . . . . . . . . 406 12.3.4 Sequence-to-sequence transformers . . . . . . . . . . . . . 408 12.3.5 Large language models . . . . . . . . . . . . . . . . . . . . 408 12.4 Multimodal Transformers . . . . . . . . . . . . . . . . . . . . . . 412 12.4.1 Vision transformers . . . . . . . . . . . . . . . . . . . . . . 413 12.4.2 Generative image transformers . .

Trade Policy 买家须知