Bayesian optimization (BayesOpt) is a robust software broadly used for international optimization duties, comparable to hyperparameter tuning, protein engineering, artificial chemistry, robotic studying, and even baking cookies. BayesOpt is a good technique for these issues as a result of all of them contain optimizing black-box capabilities which might be costly to judge. A black-box operate’s underlying mapping from inputs (configurations of the factor we wish to optimize) to outputs (a measure of efficiency) is unknown. Nevertheless, we will try to know its inner workings by evaluating the operate for various combos of inputs. As a result of every analysis might be computationally costly, we have to discover one of the best inputs in as few evaluations as attainable. BayesOpt works by repeatedly setting up a surrogate mannequin of the black-box operate and strategically evaluating the operate on the most promising or informative enter location, given the knowledge noticed to this point.
Gaussian processes are fashionable surrogate fashions for BayesOpt as a result of they’re simple to make use of, might be up to date with new knowledge, and supply a confidence degree about every of their predictions. The Gaussian course of mannequin constructs a chance distribution over attainable capabilities. This distribution is specified by a imply operate (what these attainable capabilities seem like on common) and a kernel operate (how a lot these capabilities can differ throughout inputs). The efficiency of BayesOpt is determined by whether or not the boldness intervals predicted by the surrogate mannequin comprise the black-box operate. Historically, specialists use area information to quantitatively outline the imply and kernel parameters (e.g., the vary or smoothness of the black-box operate) to precise their expectations about what the black-box operate ought to seem like. Nevertheless, for a lot of real-world functions like hyperparameter tuning, it is vitally obscure the landscapes of the tuning targets. Even for specialists with related expertise, it may be difficult to slender down acceptable mannequin parameters.
In “Pre-trained Gaussian processes for Bayesian optimization”, we think about the problem of hyperparameter optimization for deep neural networks utilizing BayesOpt. We suggest Hyper BayesOpt (HyperBO), a extremely customizable interface with an algorithm that removes the necessity for quantifying mannequin parameters for Gaussian processes in BayesOpt. For brand spanking new optimization issues, specialists can merely choose earlier duties which might be related to the present activity they’re attempting to unravel. HyperBO pre-trains a Gaussian course of mannequin on knowledge from these chosen duties, and routinely defines the mannequin parameters earlier than working BayesOpt. HyperBO enjoys theoretical ensures on the alignment between the pre-trained mannequin and the bottom fact, in addition to the standard of its options for black-box optimization. We share sturdy outcomes of HyperBO each on our new tuning benchmarks for close to–state-of-the-art deep studying fashions and traditional multi-task black-box optimization benchmarks (HPO-B). We additionally display that HyperBO is powerful to the number of related duties and has low necessities on the quantity of information and duties for pre-training.
Loss capabilities for pre-training
We pre-train a Gaussian course of mannequin by minimizing the Kullback–Leibler divergence (a generally used divergence) between the bottom fact mannequin and the pre-trained mannequin. Because the floor fact mannequin is unknown, we can not immediately compute this loss operate. To unravel for this, we introduce two data-driven approximations: (1) Empirical Kullback–Leibler divergence (EKL), which is the divergence between an empirical estimate of the bottom fact mannequin and the pre-trained mannequin; (2) Detrimental log chance (NLL), which is the the sum of destructive log likelihoods of the pre-trained mannequin for all coaching capabilities. The computational price of EKL or NLL scales linearly with the variety of coaching capabilities. Furthermore, stochastic gradient–based mostly strategies like Adam might be employed to optimize the loss capabilities, which additional lowers the price of computation. In well-controlled environments, optimizing EKL and NLL result in the identical end result, however their optimization landscapes might be very completely different. For instance, within the easiest case the place the operate solely has one attainable enter, its Gaussian course of mannequin turns into a Gaussian distribution, described by the imply (m) and variance (s). Therefore the loss operate solely has these two parameters, m and s, and we will visualize EKL and NLL as follows:
Pre-training improves Bayesian optimization
Within the BayesOpt algorithm, choices on the place to judge the black-box operate are made iteratively. The choice standards are based mostly on the boldness ranges supplied by the Gaussian course of, that are up to date in every iteration by conditioning on earlier knowledge factors acquired by BayesOpt. Intuitively, the up to date confidence ranges ought to be excellent: not overly assured or too uncertain, since in both of those two circumstances, BayesOpt can not make the selections that may match what an professional would do.
In HyperBO, we substitute the hand-specified mannequin in conventional BayesOpt with the pre-trained Gaussian course of. Beneath gentle circumstances and with sufficient coaching capabilities, we will mathematically confirm good theoretical properties of HyperBO: (1) Alignment: the pre-trained Gaussian course of ensures to be near the bottom fact mannequin when each are conditioned on noticed knowledge factors; (2) Optimality: HyperBO ensures to discover a near-optimal resolution to the black-box optimization downside for any capabilities distributed in keeping with the unknown floor fact Gaussian course of.
We visualize the Gaussian course of (areas shaded in purple are 95% and 99% confidence intervals) conditional on observations (black dots) from an unknown check operate (orange line). In comparison with the standard BayesOpt with out pre-training, the anticipated confidence ranges in HyperBO captures the unknown check operate a lot better, which is a crucial prerequisite for Bayesian optimization. |
Empirically, to outline the construction of pre-trained Gaussian processes, we select to make use of very expressive imply capabilities modeled by neural networks, and apply well-defined kernel capabilities on inputs encoded to a better dimensional house with neural networks.
To judge HyperBO on difficult and reasonable black-box optimization issues, we created the PD1 benchmark, which accommodates a dataset for multi-task hyperparameter optimization for deep neural networks. PD1 was developed by coaching tens of hundreds of configurations of close to–state-of-the-art deep studying fashions on fashionable picture and textual content datasets, in addition to a protein sequence dataset. PD1 accommodates roughly 50,000 hyperparameter evaluations from 24 completely different duties (e.g., tuning Large ResNet on CIFAR100) with roughly 12,000 machine days of computation.
We display that when pre-training for just a few hours on a single CPU, HyperBO can considerably outperform BayesOpt with rigorously hand-tuned fashions on unseen difficult duties, together with tuning ResNet50 on ImageNet. Even with solely ~100 knowledge factors per coaching operate, HyperBO can carry out competitively in opposition to baselines.
Tuning validation error charges of ResNet50 on ImageNet and Large ResNet (WRN) on the Road View Home Numbers (SVHN) dataset and CIFAR100. By pre-training on solely ~20 duties and ~100 knowledge factors per activity, HyperBO can considerably outperform conventional BayesOpt (with a rigorously hand-tuned Gaussian course of) on beforehand unseen duties. |
Conclusion and future work
HyperBO is a framework that pre-trains a Gaussian course of and subsequently performs Bayesian optimization with a pre-trained mannequin. With HyperBO, we not must hand-specify the precise quantitative parameters in a Gaussian course of. As a substitute, we solely have to determine associated duties and their corresponding knowledge for pre-training. This makes BayesOpt each extra accessible and more practical. An necessary future path is to allow HyperBO to generalize over heterogeneous search areas, for which we’re growing new algorithms by pre-training a hierarchical probabilistic mannequin.
Acknowledgements
The next members of the Google Analysis Mind Crew performed this analysis: Zi Wang, George E. Dahl, Kevin Swersky, Chansoo Lee, Zachary Nado, Justin Gilmer, Jasper Snoek, and Zoubin Ghahramani. We would prefer to thank Zelda Mariet and Matthias Feurer for assist and session on switch studying baselines. We would additionally prefer to thank Rif A. Saurous for constructive suggestions, and Rodolphe Jenatton and David Belanger for suggestions on earlier variations of the manuscript. As well as, we thank Sharat Chikkerur, Ben Adlam, Balaji Lakshminarayanan, Fei Sha and Eytan Bakshy for feedback, and Setareh Ariafar and Alexander Terenin for conversations on animation. Lastly, we thank Tom Small for designing the animation for this submit.