My Fellow Nerds,

These last few weeks of July have been difficult to say the least. I managed to contract a rare form of a very painful tonsillitis infection. How I managed to contract such a ‘limited edition virus’, I have no clue. A high fever, coupled with what felt like unrelenting and unending throat pain kept me resigned to my bed for a good two weeks. I am happy to report that I am on the other side of it now and healing. But the one good thing that did come out of these tumultuous weeks was a reinvigoration in my love for deep learning.

During the long hours that I stayed in bed, I found I had not much to do except to suffer through the pain and listen to my favourite tech podcast - Machine Learning Street Talk. I relistened to many of the podcasts that I had already listened to and a couple of new ones too. It got me thinking again about deep learning from a fresh perspective. The angle that I found most interesting was how many scientists said they found the inner workings of deep learning to be deeply mysterious and their outcomes to be nothing short of magical. I can completely resonate with this feeling. Deep learning can truly feel that way sometimes.

As I kept pondering on the various aspects of deep learning that make this stochastic process seem magical, an article was born in my head. What you are reading is the manifestation of it.

Time and again, deep learning has defied expectations, producing results that seemed impossible just a decade ago. When I first encountered deep learning, I was struck by its complexity. Here was a technology built on the simple premise of mimicking the human brain, yet it exhibited behaviours and capabilities far beyond what its rudimentary biological inspiration might suggest.

Neural networks consist of layers of interconnected nodes, each performing basic mathematical operations. However, when scaled to millions or even billions of parameters, these networks begin to exhibit emergent properties and abilities that were not explicitly programmed but rather developed organically through training.

And yet, for all its successes, deep learning still remains a black box. The optimisation processes that guide these models to convergence operate in high-dimensional, non-convex spaces that should, by all rights, be fraught with local minima and optimisation challenges. And yet, through techniques like *stochastic gradient descent* and *adaptive learning rates,* these models consistently find good solutions, often outperforming human engineered algorithms.

So I wrote this article in which I listed 5 reasons why deep learning often feels like magic to the scientists that design and the engineers that execute them. Finally, I wrote some short notes, sharing my thoughts as to why we still grapple with deciphering the fundamental concepts of deep learning, even though the field has advanced so much. Let’s get started.

# 1. Complexity & Non-Linearity

At their core, neural networks are composed of layers of nodes, or neurons, each connected to others in subsequent layers. Each neuron performs a basic mathematical operation, taking weighted inputs, applying an activation function, and passing the result to the next layer.

The architecture of a neural network typically includes an *input layer,* *multiple hidden layers,* and an *output layer.* The term "deep" in deep learning refers to the number of hidden layers. While early neural networks might have had just one or two hidden layers, modern deep learning models can have dozens or even hundreds. This depth allows the network to model extremely complex functions.

Linear models are limited in their capacity to capture complex relationships because they can only represent data through straight-line functions. Non-linear activation functions, such as the *Rectified Linear Unit (ReLU),* *sigmoid,* and *tanh *bring to the foray the necessary complexity. When applied after each neuron's weighted sum, these functions enable the network to model *non-linear relationships,* stacking multiple layers to capture ever more complex patterns.

A neural network trained to recognize objects in images will have its initial layers detect simple features like edges and textures. As data passes through subsequent layers, the network combines these simple features into more complex shapes and patterns, eventually recognising entire objects. This hierarchical feature learning is a hallmark of deep learning and is made possible by the non-linear transformations at each layer.

Training a deep neural network involves optimising a high-dimensional, non-convex loss function. Think of it as finding the lowest point in a landscape with many hills and valleys, where each dimension represents a weight in the network. The loss function’s landscape is riddled with local minima, points that are lower than their immediate surroundings but not the lowest overall. This complexity makes finding the global minimum, or even a sufficiently good local minimum, a daunting task.

But the surprising success of deep learning in overcoming these challenges is due in large part to advancements in optimisation techniques. *Stochastic Gradient Descent (SGD)* and its variants, such as *Adam* and *RMSprop,* have become the go-to methods for training deep networks. These algorithms iteratively adjust the weights of the network in small steps, guided by the gradient of the loss function. The stochastic nature of SGD, which uses random subsets of data for each update, helps the optimiser escape local minima and explore the loss landscape more effectively.

Poorly chosen initial weights can lead to vanishing or exploding gradients, where the gradients used to update weights become too small or too large, impeding learning. Techniques like *Xavier* and *He* initialisation have been developed to address this issue, setting initial weights in a way that maintains stable gradients throughout training.

The introduction of *regularisation techniques,* such as *dropout* and *batch normalisation,* has made the success of deep learning. *Dropout* randomly sets a fraction of the neurons to zero during training, preventing the network from becoming too reliant on any single neuron. *Batch normalisation* normalises the inputs of each layer, mitigating the problem of internal covariate shift and accelerating training.

The development of advanced optimisation techniques, weight initialization strategies, and regularisation methods has enabled these models to achieve remarkable performance, defying the expectations of many in the field.

# 2. Emergent Properties in Deep Neural Networks

Okay, we are getting deep into the magic bit now, i.e., the emergence of complex properties and capabilities that were not explicitly programmed into the network. These emergent properties arise naturally from the training process, allowing deep neural networks to learn hierarchical representations and abstract concepts that surpass traditional machine learning approaches. Do you wonder why ChatGPT can sometimes spit out a perspective that hits deep and makes you wonder about the worth of humanity? This is how it does that.

One of the most compelling examples of emergent properties in deep learning is *hierarchical feature learning,* particularly evident in *Convolutional Neural Networks (CNNs)* used for image recognition. At the outset, I was intrigued by how CNNs could transform raw pixel data into meaningful insights. The initial layers of a CNN detect simple features such as edges, corners, and textures. These low-level features are then combined in subsequent layers to form more complex patterns, like shapes and parts of objects. By the final layers, the network can recognise entire objects, such as faces, animals, or vehicles, with remarkable accuracy. This hierarchical approach mirrors the visual processing in the human brain, where simple features are progressively integrated into complex perceptions.

This ability to learn hierarchical features is not limited to image data. *Recurrent Neural Networks (RNNs)* and their advanced variants, like *Long Short-Term Memory (LSTM)* networks and *Gated Recurrent Units (GRUs),* demonstrate emergent properties in sequence processing tasks. When working with sequential data, such as natural language or time series, RNNs can capture dependencies and patterns across different time steps. LSTMs and GRUs address the vanishing gradient problem inherent in traditional RNNs, allowing the network to retain information over long sequences. This capability enables RNNs to understand context and generate coherent text, as seen in language models like GPT-4.

This is where the ‘creativity’ of generative models originates from. The so-called *‘emergent behaviour’,* i.e., the ability of deep learning models to generate entirely new content. Generative Adversarial Networks (GANs), introduced by Ian Goodfellow and his colleagues, exemplify this. GANs consist of two networks, a *generator* and a *discriminator,* that are trained simultaneously. The generator creates synthetic data, while the discriminator evaluates its authenticity. Through this adversarial process, GANs can produce highly realistic images, music, and even video.

The emergence of these properties is *‘magical’* because the underlying principles of neural networks are relatively simple. The power of deep learning lies in the interactions between large numbers of neurons and the data-driven learning process. As networks grow deeper and more complex, they begin to exhibit behaviours that are not explicitly encoded in their design.

When CNNs are trained on vast and diverse datasets, they can generalise to new images with different styles, backgrounds, and distortions. This robustness is partly due to the extensive data augmentation techniques used during training, where images are randomly transformed to mimic real-world variations. This ability to generalise to new data is a key factor in the success of deep learning models in practical applications.

However, where the ‘creativity’ originates, so do the problems. All the *hallucination* that occurs in the modern day generative models is a result of this very emergent behaviour itself. Understanding and interpreting the internal workings of deep neural networks remains difficult, often likened to a *"black box."* While techniques like *feature visualisation* and *activation maximisation* provide some insights, they are limited in their ability to fully explain the network's behaviour. This lack of interpretability raises concerns, particularly in applications where transparency and trust are critical.

# 3. Generalisation

One of the most astonishing and ‘awe-inspiring’ aspects of deep learning is its remarkable ability to generalise from *training data* to *unseen data.* This capability defies traditional expectations in machine learning, where models often struggle with *overfitting,* performing well on training data but failing to *generalise* to new, unseen examples. In deep learning, even with models containing millions or billions of parameters, we observe a surprising resilience against *overfitting,* achieving impressive performance across a wide range of tasks.

To understand this phenomenon, it's essential to understand the concepts of generalisation and overfitting. Generalisation refers to a model’s ability to perform well on new, unseen data that was not part of the training set. Overfitting occurs when a model learns not only the underlying patterns in the training data but also the noise and specificities that do not generalise. Traditional machine learning models, especially those with high capacity, tend to overfit when trained on small datasets.

Deep learning models, however, often manage to generalise well even when trained on relatively small datasets relative to their capacity. Several factors contribute to this counterintuitive behaviour. Firstly, the sheer size and depth of these models allow them to learn a wide variety of features at different levels of abstraction. This hierarchical learning enables the models to capture both low-level and high-level patterns, making them more robust to variations in the data.

Data augmentation plays a critical factor in improving generalisation. By artificially expanding the training dataset through transformations such as rotations, translations, and scaling, data augmentation exposes the model to a wider variety of examples. This practice prevents the model from memorising specific instances and encourages it to learn more generalisable features. In image recognition tasks, for instance, augmented datasets ensure that the model can recognise objects regardless of variations in viewpoint, lighting, or background.

When trained on vast amounts of data, these models encounter a wide range of examples and variations, enabling them to learn more robust and generalisable representations. This is particularly evident in models pre-trained on large-scale datasets such as ImageNet for visual tasks or extensive text corpora for language models. The knowledge gained from these large datasets can be transferred to new tasks through fine-tuning, leveraging the generalisation capability acquired during pre-training.

Now, all of these are ‘reasons’ why deep learning models adapt so well to new data. Don’t confuse them with ‘explanations’ as to why deep learning is so good at generalisation. Despite these advancements, the theoretical understanding of why deep learning models generalise so well remains incomplete. Traditional statistical learning theory suggests that models with a high capacity *should overfit* unless strong regularisation is applied. However, deep learning models often violate these expectations, achieving low training error and low test error simultaneously.

This paradox has led to the exploration of new theoretical frameworks and hypotheses. One such hypothesis is the concept of *over-parameterisation.* In deep learning, models are often significantly over-parameterised, meaning they have far more parameters than necessary to fit the training data. Surprisingly, this over-parameterisation does not necessarily lead to overfitting. Instead, it appears to enable the model to find simpler, more generalisable solutions in the high-dimensional parameter space. This aligns with the empirical observation that larger models often perform better, even when the risk of overfitting should be higher.

The *lottery ticket hypothesis* is another intriguing explanation. Proposed by Jonathan Frankle and Michael Carbin, this hypothesis suggests that within a large, *over-parameterised* model, there exists a smaller sub-network, i.e., a *"winning ticket",* that, when trained in isolation, can achieve similar performance to the original model. Finding these winning tickets through pruning and retraining can lead to more efficient models with strong generalisation capabilities.

# 4. Transfer Learning and Fine Tuning

Transfer learning is exactly what it sounds like. This approach leverages the knowledge gained from training on one task to improve performance on a related but different task. Transfer learning has significantly accelerated progress in various domains, allowing researchers and practitioners to build powerful models with relatively small amounts of task-specific data. When I first encountered transfer learning, I was amazed by its simplicity and effectiveness, which seemed almost counterintuitive given the traditional machine learning emphasis on *task-specific training.*

Transfer learning typically involves two main steps: *pre-training* and *fine-tuning.* During *pre-training,* a deep learning model is trained on a large dataset, often encompassing a wide range of categories and examples. This extensive training allows the model to learn a rich set of features and representations that capture the underlying structure of the data. The pre-trained model can then be *fine-tuned* for a specific task by training it on a smaller, task-specific dataset. Fine-tuning adjusts the pre-trained model’s parameters to better suit the new task while retaining the general knowledge acquired during pre-training.

A prominent example of transfer learning in action is the use of convolutional neural networks (CNNs) pre-trained on the ImageNet dataset. ImageNet contains millions of labelled images across thousands of categories, providing a comprehensive training ground for visual feature extraction. Models like VGG, ResNet, and Inception have been pre-trained on ImageNet and are widely used as starting points for various computer vision tasks. By fine-tuning these pre-trained models on smaller datasets, such as medical images or specific object detection tasks, researchers can achieve high performance with significantly less data and computational resources. The big-tech uses this technique with their models all the time. Transfer learning has revolutionised the field with models like *Gemini* and *GPT* (Generative Pre-trained Transformer).

GPT, particularly GPT-4, takes transfer learning to another level. GPT-4 is pre-trained on a diverse and massive text dataset, allowing it to generate coherent and contextually appropriate text across various tasks without task-specific fine-tuning.

Other forms of transfer learning include *feature extraction* and *domain adaptation.* *Feature extraction* involves using the pre-trained model as a fixed feature extractor, where the model’s learned representations are fed into a simpler classifier for the target task. This method is useful when computational resources are limited or when the target task has a very small dataset. *Domain adaptation,* on the other hand, focuses on adapting a model trained on one domain to perform well on a related but different domain. Techniques such as *adversarial training* and *domain adversarial neural networks (DANN) *are used to bridge the gap between source and target domains.

# 5. Optimisation and Convergence in High Dimensional Spaces

Saving the best for last, I would like to talk about one of the most perplexing and fascinating aspects of deep learning, i.e., its ability to converge to effective solutions despite the immense complexity of its optimisation landscapes. When I first studied the optimisation process in deep neural networks 4 years ago, I was struck by the sheer scale and difficulty of the task. These models often contain millions or even billions of parameters, and their loss functions are highly non-convex, filled with numerous local minima, saddle points, and flat regions. Yet, deep learning models not only manage to find good solutions but often do so efficiently and effectively. As a fresher in my master’s course in AI, this always used to baffle me, triggering long discussions with my AI Lab professor regarding this phenomenon.

The optimisation challenge in deep learning arises from the high-dimensional nature of neural networks. Each weight in the network represents a dimension in the parameter space, creating an optimisation problem with thousands or millions of dimensions. The loss function, which measures the discrepancy between the model’s predictions and the actual data, forms a complex landscape in this high-dimensional space. Traditional optimisation problems, especially those with convex loss functions, are relatively straightforward to solve, as they guarantee a single global minimum. In contrast, the non-convex nature of neural network loss functions means there are multiple local minima and saddle points, making the optimisation process much more challenging.

Despite these challenges, deep learning models have consistently demonstrated an ability to find effective solutions.

Another critical aspect of effective optimisation in deep learning is the initialisation of network weights. Proper initialisation is essential to ensure that gradients do not vanish or explode as they propagate through the network. The *vanishing gradient problem* occurs when gradients become exceedingly small, leading to negligible updates to the weights and slow learning. Conversely, the exploding gradient problem happens when gradients grow exponentially, causing unstable updates and diverging weights. Techniques such as *Xavier (Glorot) initialisation* and *He initialisation* address these issues by setting the initial weights based on the number of input and output units in the network, maintaining a stable gradient flow during training.

While these optimisation techniques have propelled the success of deep learning, the theoretical understanding of why they work so well is still evolving. Traditional optimisation theory suggests that finding the global minimum in high-dimensional, non-convex landscapes should be exceedingly difficult. However, empirical evidence shows that deep learning models often find solutions that generalise well to new data, even if they are not the global minimum.

Why this disparity? One hypothesis is that the *loss landscape* of deep neural networks contains a large number of flat or nearly flat regions, known as *"good" local minima, *that generalise well. These regions may be connected by paths of low loss, allowing the optimisation process to move between them relatively easily. This view aligns with the empirical success of SGD and its variants, which seem to navigate these landscapes effectively.

Techniques such as *stochastic gradient descent, adaptive optimisers* like *Adam, proper weight initialisation,* and *batch normalisation* have played important roles. While our theoretical understanding continues to develop, the empirical effectiveness of these methods highlights the ‘magic’ of deep learning.

# Addressing Our Gap in the Theoretical Understanding of Deep Learning

At the end of this article, I would like to share a few short notes on why deep learning still feels like magic to many. If you have made it this far, you are a true lover of the field and I applaud your interest. Perhaps you can help bridge the gap between the theory and the outcomes.

Simply put, the rapid advancements in deep learning have *outpaced* our theoretical understanding of why these models work so well.

As I explored this field, I realised that while we have developed highly effective methods and models, our understanding of the underlying principles remains incomplete. This is similar to our conundrum with Quantum Computing. We understand enough of Quantum phenomena to build a computer based on it. But we lack an understanding of how the phenomena itself works or why.

The *information bottleneck theory* is another framework that has been proposed to explain deep learning’s success. This theory, introduced by Naftali Tishby and colleagues, suggests that deep neural networks learn to compress input data into a compact representation that retains only the most relevant information for the task at hand. During training, the network undergoes a process of initial fitting, where it captures both relevant and irrelevant information, followed by a compression phase, where it discards the irrelevant information. The resulting compressed representation is more robust and generalises better to new data. This perspective provides a potential explanation for the stages observed in the training dynamics of deep networks.

Despite this hypothesis, significant gaps remain in our theoretical understanding. For instance, the exact mechanisms by which deep networks navigate high-dimensional, non-convex loss landscapes and consistently find good solutions are not fully understood. The role of various architectural choices, such as depth, width, and activation functions, in shaping the loss landscape and influencing generalisation is an ongoing area of research. I hate to say this, but when coming to these features, scientists are mostly still ‘winging it’.

Recent advancements in theoretical research have started to bridge some of these gaps. For example, the *neural tangent kernel (NTK)* theory provides insights into the training dynamics of infinitely wide neural networks. According to NTK theory, as the width of a neural network increases, its behaviour during training becomes more predictable and linear, resembling kernel methods. This theory helps explain why very wide networks are easier to optimise and often generalise well, as their training dynamics become more stable and tractable.

Another promising direction is the *mean-field theory* of neural networks, which studies the behaviour of networks in the limit of large width. Mean-field theory models the collective behaviour of neurons in a layer, providing a statistical description of the network’s dynamics. This approach has yielded insights into the convergence properties of deep networks and the impact of architectural choices on their performance.

While deep learning’s empirical success continues to amaze and inspire, our theoretical understanding is still catching up. The ongoing research into the theoretical foundations of deep learning promises to deepen our understanding. However, as of yet, a lot of what is deep learning remains ‘magical’.