图书简介
This book offers a comprehensive introduction to the central ideas that underpin deep learning. It is intended both for newcomers to machine learning and for those already experienced in the field. Covering key concepts relating to contemporary architectures and techniques, this essential book equips readers with a robust foundation for potential future specialization. The field of deep learning is undergoing rapid evolution, and therefore this book focusses on ideas that are likely to endure the test of time. The book is organized into numerous bite-sized chapters, each exploring a distinct topic, and the narrative follows a linear progression, with each chapter building upon content from its predecessors. This structure is well-suited to teaching a two-semester undergraduate or postgraduate machine learning course, while remaining equally relevant to those engaged in active research or in self-study. A full understanding of machine learning requires some mathematical background and so the book includes a self-contained introduction to probability theory. However, the focus of the book is on conveying a clear understanding of ideas, with emphasis on the real-world practical value of techniques rather than on abstract theory. Complex concepts are therefore presented from multiple complementary perspectives including textual descriptions, diagrams, mathematical formulae, and pseudo-code. Chris Bishop is a Technical Fellow at Microsoft and is the Director of Microsoft Research AI4Science. He is a Fellow of Darwin College Cambridge, a Fellow of the Royal Academy of Engineering, and a Fellow of the Royal Society. Hugh Bishop is an Applied Scientist at Wayve, a deep learning autonomous driving company in London, where he designs and trains deep neural networks. Before working at Wayve, he completed his MPhil in Machine Learning and Machine Intelligence in the Engineering Department at Cambridge University. "Chris Bishop wrote a terrific textbook on neural networks in 1995 and has a deep knowledge of the field and its core ideas. His many years of experience in explaining neural networks have made him extremely skillful at presenting complicated ideas in the simplest possible way and it is a delight to see these skills applied to the revolutionary new developments in the field." -- Geoffrey Hinton "This excellent and very educational book will bring the reader up to date with the main concepts and advances in deep learning with a solid anchoring in probability. These concepts are powering current industrial AI systems and are likely to form the basis of further advances towards artificial general intelligence." -- Yoshua Bengio "With the recent explosion of deep learning and AI as a research topic, and the quickly growing importance of AI applications, a modern textbook on the topic was badly needed. The "New Bishop" masterfully fills the gap, covering fundamental topics in linear algebra, probability theory, and function optimisation, learning algorithms for supervised and unsupervised learning, modern deep learning architecture families, as well as how to apply all of this to various application areas. The book is poised to have a similarly large impact as Chris Bishop’s 1995 book on neural networks." - Yann LeCun
Preface 3 1 The Deep Learning Revolution 19 1.1 The Impact of Deep Learning . . . . . . . . . . . . . . . . . . . . 20 1.1.1 Medical diagnosis . . . . . . . . . . . . . . . . . . . . . . 20 1.1.2 Protein structure . . . . . . . . . . . . . . . . . . . . . . . 21 1.1.3 Image synthesis . . . . . . . . . . . . . . . . . . . . . . . . 22 1.1.4 Large language models . . . . . . . . . . . . . . . . . . . . 23 1.2 A Tutorial Example . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.2.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.2.2 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.2.3 Error function . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.2.4 Model complexity . . . . . . . . . . . . . . . . . . . . . . 27 1.2.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 30 1.2.6 Model selection . . . . . . . . . . . . . . . . . . . . . . . . 32 1.3 A Brief History of Machine Learning . . . . . . . . . . . . . . . . 34 1.3.1 Single-layer networks . . . . . . . . . . . . . . . . . . . . 35 1.3.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . 36 1.3.3 Deep networks . . . . . . . . . . . . . . . . . . . . . . . . 38 2 Probabilities 41 2.1 The Rules of Probability . . . . . . . . . . . . . . . . . . . . . . . 43 2.1.1 A medical screening example . . . . . . . . . . . . . . . . 43 2.1.2 The sum and product rules . . . . . . . . . . . . . . . . . . 44 2.1.3 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 46 2.1.4 Medical screening revisited . . . . . . . . . . . . . . . . . 48 2.1.5 Prior and posterior probabilities . . . . . . . . . . . . . . . 49 2.1.6 Independent variables . . . . . . . . . . . . . . . . . . . . 49 2.2 Probability Densities . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.2.1 Example distributions . . . . . . . . . . . . . . . . . . . . 51 2.2.2 Expectations and covariances . . . . . . . . . . . . . . . . 52 2.3 The Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . 54 2.3.1 Mean and variance . . . . . . . . . . . . . . . . . . . . . . 55 2.3.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 55 2.3.3 Bias of maximum likelihood . . . . . . . . . . . . . . . . . 57 2.3.4 Linear regression . . . . . . . . . . . . . . . . . . . . . . . 58 2.4 Transformation of Densities . . . . . . . . . . . . . . . . . . . . . 60 2.4.1 Multivariate distributions . . . . . . . . . . . . . . . . . . . 62 2.5 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.5.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.5.2 Physics perspective . . . . . . . . . . . . . . . . . . . . . . 65 2.5.3 Differential entropy . . . . . . . . . . . . . . . . . . . . . . 67 2.5.4 Maximum entropy . . . . . . . . . . . . . . . . . . . . . . 68 2.5.5 Kullback-Leibler divergence . . . . . . . . . . . . . . . . . 69 2.5.6 Conditional entropy . . . . . . . . . . . . . . . . . . . . . 71 2.5.7 Mutual information . . . . . . . . . . . . . . . . . . . . . . 72 2.6 Bayesian Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 72 2.6.1 Model parameters . . . . . . . . . . . . . . . . . . . . . . . 73 2.6.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 74 2.6.3 Bayesian machine learning . . . . . . . . . . . . . . . . . . 75 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3 Standard Distributions 83 3.1 Discrete Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.1.1 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . 84 3.1.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . 85 3.1.3 Multinomial distribution . . . . . . . . . . . . . . . . . . . 86 3.2 The Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . 88 3.2.1 Geometry of the Gaussian . . . . . . . . . . . . . . . . . . 89 3.2.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.2.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.2.4 Conditional distribution . . . . . . . . . . . . . . . . . . . 94 3.2.5 Marginal distribution . . . . . . . . . . . . . . . . . . . . . 97 3.2.6 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 99 3.2.7 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 102 3.2.8 Sequential estimation . . . . . . . . . . . . . . . . . . . . . 103 3.2.9 Mixtures of Gaussians . . . . . . . . . . . . . . . . . . . . 104 3.3 Periodic Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.3.1 Von Mises distribution . . . . . . . . . . . . . . . . . . . . 107 3.4 The Exponential Family . . . . . . . . . . . . . . . . . . . . . . . 112 3.4.1 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . 115 3.5 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . 116 3.5.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.5.2 Kernel densities . . . . . . . . . . . . . . . . . . . . . . . . 118 3.5.3 Nearest-neighbours . . . . . . . . . . . . . . . . . . . . . . 121 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4 Single-layer Networks: Regression 129 4.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.1.1 Basis functions . . . . . . . . . . . . . . . . . . . . . . . . 130 4.1.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 132 4.1.3 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 133 4.1.4 Geometry of least squares . . . . . . . . . . . . . . . . . . 135 4.1.5 Sequential learning . . . . . . . . . . . . . . . . . . . . . . 135 4.1.6 Regularized least squares . . . . . . . . . . . . . . . . . . . 136 4.1.7 Multiple outputs . . . . . . . . . . . . . . . . . . . . . . . 137 4.2 Decision theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.3 The Bias-Variance Trade-off . . . . . . . . . . . . . . . . . . . . . 141 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5 Single-layer Networks: Classification 149 5.1 Discriminant Functions . . . . . . . . . . . . . . . . . . . . . . . . 150 5.1.1 Two classes . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.1.2 Multiple classes . . . . . . . . . . . . . . . . . . . . . . . . 152 5.1.3 1-of-K coding . . . . . . . . . . . . . . . . . . . . . . . . 153 5.1.4 Least squares for classification . . . . . . . . . . . . . . . . 154 5.2 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.2.1 Misclassification rate . . . . . . . . . . . . . . . . . . . . . 157 5.2.2 Expected loss . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.2.3 The reject option . . . . . . . . . . . . . . . . . . . . . . . 160 5.2.4 Inference and decision . . . . . . . . . . . . . . . . . . . . 161 5.2.5 Classifier accuracy . . . . . . . . . . . . . . . . . . . . . . 165 5.2.6 ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5.3 Generative Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 168 5.3.1 Continuous inputs . . . . . . . . . . . . . . . . . . . . . . 170 5.3.2 Maximum likelihood solution . . . . . . . . . . . . . . . . 171 5.3.3 Discrete features . . . . . . . . . . . . . . . . . . . . . . . 174 5.3.4 Exponential family . . . . . . . . . . . . . . . . . . . . . . 174 5.4 Discriminative Classifiers . . . . . . . . . . . . . . . . . . . . . . 175 5.4.1 Activation functions . . . . . . . . . . . . . . . . . . . . . 176 5.4.2 Fixed basis functions . . . . . . . . . . . . . . . . . . . . . 176 5.4.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . 177 5.4.4 Multi-class logistic regression . . . . . . . . . . . . . . . . 179 5.4.5 Probit regression . . . . . . . . . . . . . . . . . . . . . . . 181 5.4.6 Canonical link functions . . . . . . . . . . . . . . . . . . . 182 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 6 Deep Neural Networks 189 6.1 Limitations of Fixed Basis Functions . . . . . . . . . . . . . . . . 190 6.1.1 The curse of dimensionality . . . . . . . . . . . . . . . . . 190 6.1.2 High-dimensional spaces . . . . . . . . . . . . . . . . . . . 193 6.1.3 Data manifolds . . . . . . . . . . . . . . . . . . . . . . . . 194 6.1.4 Data-dependent basis functions . . . . . . . . . . . . . . . 196 6.2 Multilayer Networks . . . . . . . . . . . . . . . . . . . . . . . . . 198 6.2.1 Parameter matrices . . . . . . . . . . . . . . . . . . . . . . 199 6.2.2 Universal approximation . . . . . . . . . . . . . . . . . . . 199 6.2.3 Hidden unit activation functions . . . . . . . . . . . . . . . 200 6.2.4 Weight-space symmetries . . . . . . . . . . . . . . . . . . 203 6.3 Deep Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 6.3.1 Hierarchical representations . . . . . . . . . . . . . . . . . 205 6.3.2 Distributed representations . . . . . . . . . . . . . . . . . . 205 6.3.3 Representation learning . . . . . . . . . . . . . . . . . . . 206 6.3.4 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . 207 6.3.5 Contrastive learning . . . . . . . . . . . . . . . . . . . . . 209 6.3.6 General network architectures . . . . . . . . . . . . . . . . 211 6.3.7 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 6.4 Error Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 6.4.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 212 6.4.2 Binary classification . . . . . . . . . . . . . . . . . . . . . 214 6.4.3 multiclass classification . . . . . . . . . . . . . . . . . . . 215 6.5 Mixture Density Networks . . . . . . . . . . . . . . . . . . . . . . 216 6.5.1 Robot kinematics example . . . . . . . . . . . . . . . . . . 216 6.5.2 Conditional mixture distributions . . . . . . . . . . . . . . 217 6.5.3 Gradient optimization . . . . . . . . . . . . . . . . . . . . 219 6.5.4 Predictive distribution . . . . . . . . . . . . . . . . . . . . 220 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 7 Gradient Descent 227 7.1 Error Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 7.1.1 Local quadratic approximation . . . . . . . . . . . . . . . . 229 7.2 Gradient Descent Optimization . . . . . . . . . . . . . . . . . . . 231 7.2.1 Use of gradient information . . . . . . . . . . . . . . . . . 232 7.2.2 Batch gradient descent . . . . . . . . . . . . . . . . . . . . 232 7.2.3 Stochastic gradient descent . . . . . . . . . . . . . . . . . . 232 7.2.4 Mini-batches . . . . . . . . . . . . . . . . . . . . . . . . . 234 7.2.5 Parameter initialization . . . . . . . . . . . . . . . . . . . . 234 7.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 7.3.1 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . 238 7.3.2 Learning rate schedule . . . . . . . . . . . . . . . . . . . . 240 7.3.3 RMSProp and Adam . . . . . . . . . . . . . . . . . . . . . 241 7.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 7.4.1 Data normalization . . . . . . . . . . . . . . . . . . . . . . 244 7.4.2 Batch normalization . . . . . . . . . . . . . . . . . . . . . 245 7.4.3 Layer normalization . . . . . . . . . . . . . . . . . . . . . 247 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 8 Backpropagation 251 8.1 Evaluation of Gradients . . . . . . . . . . . . . . . . . . . . . . . 252 8.1.1 Single-layer networks . . . . . . . . . . . . . . . . . . . . 252 8.1.2 General feed-forward networks . . . . . . . . . . . . . . . 253 8.1.3 A simple example . . . . . . . . . . . . . . . . . . . . . . 256 8.1.4 Numerical differentiation . . . . . . . . . . . . . . . . . . . 257 8.1.5 The Jacobian matrix . . . . . . . . . . . . . . . . . . . . . 258 8.1.6 The Hessian matrix . . . . . . . . . . . . . . . . . . . . . . 260 8.2 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . 262 8.2.1 Forward-mode automatic differentiation . . . . . . . . . . . 264 8.2.2 Reverse-mode automatic differentiation . . . . . . . . . . . 267 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 9 Regularization 271 9.1 Inductive Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 9.1.1 Inverse problems . . . . . . . . . . . . . . . . . . . . . . . 272 9.1.2 No free lunch theorem . . . . . . . . . . . . . . . . . . . . 273 9.1.3 Symmetry and invariance . . . . . . . . . . . . . . . . . . . 274 9.1.4 Equivariance . . . . . . . . . . . . . . . . . . . . . . . . . 277 9.2 Weight Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 9.2.1 Consistent regularizers . . . . . . . . . . . . . . . . . . . . 280 9.2.2 Generalized weight decay . . . . . . . . . . . . . . . . . . 282 9.3 Learning Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 9.3.1 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . 284 9.3.2 Double descent . . . . . . . . . . . . . . . . . . . . . . . . 286 9.4 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . 288 9.4.1 Soft weight sharing . . . . . . . . . . . . . . . . . . . . . . 289 9.5 Residual Connections . . . . . . . . . . . . . . . . . . . . . . . . 292 9.6 Model Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 9.6.1 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 10 Convolutional Networks 305 10.1 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 10.1.1 Image data . . . . . . . . . . . . . . . . . . . . . . . . . . 307 10.2 Convolutional Filters . . . . . . . . . . . . . . . . . . . . . . . . . 308 10.2.1 Feature detectors . . . . . . . . . . . . . . . . . . . . . . . 308 10.2.2 Translation equivariance . . . . . . . . . . . . . . . . . . . 309 10.2.3 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 10.2.4 Strided convolutions . . . . . . . . . . . . . . . . . . . . . 312 10.2.5 Multi-dimensional convolutions . . . . . . . . . . . . . . . 313 10.2.6 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 10.2.7 Multilayer convolutions . . . . . . . . . . . . . . . . . . . 316 10.2.8 Example network architectures . . . . . . . . . . . . . . . . 317 10.3 Visualizing Trained CNNs . . . . . . . . . . . . . . . . . . . . . . 320 10.3.1 Visual cortex . . . . . . . . . . . . . . . . . . . . . . . . . 320 10.3.2 Visualizing trained filters . . . . . . . . . . . . . . . . . . . 321 10.3.3 Saliency maps . . . . . . . . . . . . . . . . . . . . . . . . 323 10.3.4 Adversarial attacks . . . . . . . . . . . . . . . . . . . . . . 324 10.3.5 Synthetic images . . . . . . . . . . . . . . . . . . . . . . . 326 10.4 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 10.4.1 Bounding boxes . . . . . . . . . . . . . . . . . . . . . . . 327 10.4.2 Intersection-over-union . . . . . . . . . . . . . . . . . . . . 328 10.4.3 Sliding windows . . . . . . . . . . . . . . . . . . . . . . . 329 10.4.4 Detection across scales . . . . . . . . . . . . . . . . . . . . 331 10.4.5 Non-max suppression . . . . . . . . . . . . . . . . . . . . . 332 10.4.6 Fast region CNNs . . . . . . . . . . . . . . . . . . . . . . . 332 10.5 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 333 10.5.1 Convolutional segmentation . . . . . . . . . . . . . . . . . 333 10.5.2 Up-sampling . . . . . . . . . . . . . . . . . . . . . . . . . 334 10.5.3 Fully convolutional networks . . . . . . . . . . . . . . . . . 336 10.5.4 The U-net architecture . . . . . . . . . . . . . . . . . . . . 337 10.6 Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 11 Structured Distributions 343 11.1 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 11.1.1 Directed graphs . . . . . . . . . . . . . . . . . . . . . . . . 344 11.1.2 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 345 11.1.3 Discrete variables . . . . . . . . . . . . . . . . . . . . . . . 347 11.1.4 Gaussian variables . . . . . . . . . . . . . . . . . . . . . . 350 11.1.5 Binary classifier . . . . . . . . . . . . . . . . . . . . . . . 352 11.1.6 Parameters and observations . . . . . . . . . . . . . . . . . 352 11.1.7 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 354 11.2 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . 355 11.2.1 Three example graphs . . . . . . . . . . . . . . . . . . . . 356 11.2.2 Explaining away . . . . . . . . . . . . . . . . . . . . . . . 359 11.2.3 D-separation . . . . . . . . . . . . . . . . . . . . . . . . . 361 11.2.4 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . 362 11.2.5 Generative models . . . . . . . . . . . . . . . . . . . . . . 364 11.2.6 Markov blanket . . . . . . . . . . . . . . . . . . . . . . . . 365 11.2.7 Graphs as filters . . . . . . . . . . . . . . . . . . . . . . . . 366 11.3 Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 11.3.1 Hidden variables . . . . . . . . . . . . . . . . . . . . . . . 370 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 12 Transformers 375 12.1 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 12.1.1 Transformer processing . . . . . . . . . . . . . . . . . . . . 378 12.1.2 Attention coefficients . . . . . . . . . . . . . . . . . . . . . 379 12.1.3 Self-attention . . . . . . . . . . . . . . . . . . . . . . . . . 380 12.1.4 Network parameters . . . . . . . . . . . . . . . . . . . . . 381 12.1.5 Scaled self-attention . . . . . . . . . . . . . . . . . . . . . 384 12.1.6 Multi-head attention . . . . . . . . . . . . . . . . . . . . . 384 12.1.7 Transformer layers . . . . . . . . . . . . . . . . . . . . . . 386 12.1.8 Computational complexity . . . . . . . . . . . . . . . . . . 388 12.1.9 Positional encoding . . . . . . . . . . . . . . . . . . . . . . 389 12.2 Natural Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 12.2.1 Word embedding . . . . . . . . . . . . . . . . . . . . . . . 393 12.2.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . 395 12.2.3 Bag of words . . . . . . . . . . . . . . . . . . . . . . . . . 396 12.2.4 Autoregressive models . . . . . . . . . . . . . . . . . . . . 397 12.2.5 Recurrent neural networks . . . . . . . . . . . . . . . . . . 398 12.2.6 Backpropagation through time . . . . . . . . . . . . . . . . 399 12.3 Transformer Language Models . . . . . . . . . . . . . . . . . . . . 400 12.3.1 Decoder transformers . . . . . . . . . . . . . . . . . . . . . 401 12.3.2 Sampling strategies . . . . . . . . . . . . . . . . . . . . . . 404 12.3.3 Encoder transformers . . . . . . . . . . . . . . . . . . . . . 406 12.3.4 Sequence-to-sequence transformers . . . . . . . . . . . . . 408 12.3.5 Large language models . . . . . . . . . . . . . . . . . . . . 408 12.4 Multimodal Transformers . . . . . . . . . . . . . . . . . . . . . . 412 12.4.1 Vision transformers . . . . . . . . . . . . . . . . . . . . . . 413 12.4.2 Generative image transformers . .
Trade Policy 买家须知
- 关于产品:
- ● 正版保障:本网站隶属于中国国际图书贸易集团公司,确保所有图书都是100%正版。
- ● 环保纸张:进口图书大多使用的都是环保轻型张,颜色偏黄,重量比较轻。
- ● 毛边版:即书翻页的地方,故意做成了参差不齐的样子,一般为精装版,更具收藏价值。
关于退换货:
- 由于预订产品的特殊性,采购订单正式发订后,买方不得无故取消全部或部分产品的订购。
- 由于进口图书的特殊性,发生以下情况的,请直接拒收货物,由快递返回:
- ● 外包装破损/发错货/少发货/图书外观破损/图书配件不全(例如:光盘等)
并请在工作日通过电话400-008-1110联系我们。
- 签收后,如发生以下情况,请在签收后的5个工作日内联系客服办理退换货:
- ● 缺页/错页/错印/脱线
关于发货时间:
- 一般情况下:
- ●【现货】 下单后48小时内由北京(库房)发出快递。
- ●【预订】【预售】下单后国外发货,到货时间预计5-8周左右,店铺默认中通快递,如需顺丰快递邮费到付。
- ● 需要开具发票的客户,发货时间可能在上述基础上再延后1-2个工作日(紧急发票需求,请联系010-68433105/3213);
- ● 如遇其他特殊原因,对发货时间有影响的,我们会第一时间在网站公告,敬请留意。
关于到货时间:
- 由于进口图书入境入库后,都是委托第三方快递发货,所以我们只能保证在规定时间内发出,但无法为您保证确切的到货时间。
- ● 主要城市一般2-4天
- ● 偏远地区一般4-7天
关于接听咨询电话的时间:
- 010-68433105/3213正常接听咨询电话的时间为:周一至周五上午8:30~下午5:00,周六、日及法定节假日休息,将无法接听来电,敬请谅解。
- 其它时间您也可以通过邮件联系我们:customer@readgo.cn,工作日会优先处理。
关于快递:
- ● 已付款订单:主要由中通、宅急送负责派送,订单进度查询请拨打010-68433105/3213。
本书暂无推荐
本书暂无推荐