By Carlos E. Perez.
You have to wonder these days about the practice of Deep Learning. It indeed is akin to the black arts. Incorporating lots of alchemy and black magic. Knowledge of best practices, the do’s and don’ts are spread across thousands of unverified Arxiv papers. The usual claims of state-of-the-art (SOTA) is questionable and more likely due to cherry picked data. A majority of papers supposedly move the field forward by a tiny incremental percentage improvements in a standard benchmark. If a research can’t show improvement in one benchmark, there are plenty more to chose from! Alternatively, a researcher can invent his new kind of test.
A lot of the mathematics are mostly handwaving exercises, all desperate to prove that there’s some rational thought that goes into the design process. The truth is, it’s all just a bunch of approximations all over the place. Employing an explicit function or a distribution is a random shot in the dark. Many cases, its just best to use a neural network in replacement of any closed from equation.
Brute force practices are prevalent. In one paper by Google, reseachers decided to show the quickest ever training of Imagenet. It took their design 24 minutes to train. The researchers have the following luxurios claim:
We finish the 100-epoch ImageNet training with AlexNet in 24 minutes, which is the world record. Same as Facebook’s result, we finish the 90-epoch ImageNet training with ResNet-50 in one hour. However, our hardware budget is only 1.2 million USD, which is 3.4 times lower than Facebook’s 4.1 million USD.
In another research with a claim akin to the show ‘lifestyles of the rich and famous.” Another Google team used their vast hardware resources to perform an exploratory search for more efficient deep learning architectures. In a Google’s blog post “Using Machine Learning to Explore Neural Network Architecture”, has a referred paper by Barret Zoph and Quoc V. Le which reads:
For the distributed training, we set the number of parameter server shards S to 20, the number of controller replicas K to 100 and the number of child replicas m to 8, which means there are 800 networks being trained on 800 GPUs concurrently at any time.
This paper is of course much better than the speed test paper. However, what it does clearly show is that, you can discover novel and innovative new architectures through brute force computation. The researchers discovered this monstrosity of an LSTM node:
The same team has an even newer paper where they train there search algorithm to discover new optimization methods (see: “Neural Optimizer Search with Reinforcement Learning” ). Their previous research searched for new kinds of deep learning layers.
They found these two curious optimizers, christened as “AddSign” and “PowerSign”. This image shows the behavior of PowerSign as compared to other optimization methods:
The idea is simple, grab from a collection of these mathematical artifacts:
shake them all aggressively in a bag, and presto, you got your state-of-the-art (SOTA) optimization method! (Viewer discretion, do not do this from home without professional deep learning resources). The authors used a less lavish GPU cluster. That is 12 machines with 8 GPUs. That’s 96 GPUs if you do you multiplication right. An 8 GPU machine like Nvidia’s DGX-1 is listed at $150,000. In short, Cost of Twelve Nvidia DGX-1’s, 1.8 million dollars. Discovery of AddSign and PowerSign — Priceless.
Brute force methods are incidentally a meta-learning technique. In fact, the entire A.I. field, in its search for learning algorithms, is basically meta-learning. Hyper-parameter optimization is a meta-learning technique. Randomly combining many kinds of operators is just a more sophisticated version of hyper-parameter optimizations. Why stick to constants when you can use a variety of functions? If you are seeking SOTA, then the search for diversity is your ticket!
This reality of using brute force methods to achieve innovation isn’t going to go away soon. Massive hardware resources in Deep Learning are like super-colliders in High Energy Physics. Both are experimental sciences, and both kinds of machines allow us to peer deeper into how reality works. You may be surprised that I speak about virtual systems (i.e. computation) as being reality. However, it shouldn’t take one to make a great intellectual leap to realize that our entire universe is indeed all about information processing (see: “Deep Learning is Non-equilibrium Information Dynamics”).
I have to go back to Stephen Wolfram’s “A New Kind of Science” to explain the nature of computational systems. The behavior of machines that have the property of “Universality” (example: weather, brain or computers) cannot be precisely predicted using mathematical shortcuts. It is not that they are entirely random and unpredictable, but rather that we can only make approximations of behavior in the incremental time horizons. However, the further out in time, the poorer predictions become. The implication of this is that, the design of these systems, can only be found by just trying out a combinatorially high number of design configurations.
Is it entirely brute force, or is there some principles of alchemy at play here? My hunch is that it boils down to having good curriculums. Just as teachers have to work with difficult and unruly students, we will just have to discover the teaching methods to move forward. Brute force methods are an intrinsic part of this field. However, we should always strive to seek out research that does give us a better intuition ( See: “ICLR 2017” and “Two Phases of Gradient Descent”).