How did we Get Here? The History of Neural Nets and Deep Learning

By Eric Koyanagi

Posted on 03/6/24

While it might seem like chatGPT exploded into existence overnight, most engineers have at least a general understanding that neural nets are a much older technology. Let's review some of the history of this technology and learn more about how modern neural nets work.

Reviewing the Basics

The first thing to review is the subtle difference between "machine learning" and "deep learning". Both are forms of AI. Machine learning relies on algorithms to adapt and learn with little or no human intervention, while deep learning employs neural nets. That's the very basic high-level idea, but really the core concept with both forms of AI is the same. You have training data (sometimes it is "labelled", sometimes it isn't, sometimes it's a mix) that the model trains against. This training process could use a classic statistical algorithm, a simple neural net, or employ deep learning...the core pipeline is often the same.

The idea of training is build a model that can make predictions. When you have labelled data, you can correct that model, but not all algorithms depend on labelled input data. Unsupervised learning algorithms deduce patterns without having labelled data to "validate" against. An example of "labelled data" might be a bounding box identifying a type of object in a picture or a detailed text caption describing a video's content; these usually depend on humans, although there are products that use AI to label data.

You might think that it's a problem to have AI labelling data which is then consumed by a model...which just comes down to how much you trust the given product's confidence rating (and how much you want to spend to validate or label data). For example, SageMaker has an ability to auto-label some data, but if it doesn't have a "high confidence" that it can do so automatically using pre-trained algorithms, it will send it to a human to label via the Mechanical Turk.

First, Let's Talk Vectors!

The idea of vectorizing words is important as a foundation for LLMs, so with our brief review of AI in general, let's start there.

A vector is a point in space -- for games, that represents a 3D point in space. Vectors are useful because we can easily use them to "move" in space. For example, if I need to fire a projectile at a given point, I can use vector subtraction to acquire a vector "pointing toward" the target, then move toward that direction each frame until I splash into the target.

For a word vector, the idea is very similar...but it uses N-dimensional space. Don't try to think about it too hard, you'll end up more twisted around than some sci-fi time travel plot. It's the same idea as 3D, but with more Ds (think 100s or 1000s). What's so hard about that?

This allows us to build a sort of "map" of related words -- similar words are spatially closer in this dimensional grid. It's a vast, vast cloud of words floating in multi-dimensional space so that the algorithm can build relations -- for example, by predicting which words might "fit" best after any other.

Word2Vec was revolutionary because it applied this idea, but using neural networks to establish word relations. It was developed by Tomáš Mikolov and Google in 2013, just two years before OpenAI was founded in 2015.

The idea of the vector database to represent words isn't entirely new, as it builds on the idea of a DSM (Distributed Semantic Modeling), also describing how words can be described in high dimensional space to create relations. Yet again, this underscores how technology and science builds in layers, and this isn't "new" technology entirely, but depends on ideas first popularized in the 1950s and iterated on ever since.

Transformers and Attention

In 2017, Google released a paper describing the idea of the "transformer". The idea of the transformer builds on the idea of vector space representations, but adds the idea of context and "attention". To compute the next word in a given sentence, it considers every other word in the sentence, no just the previous word. Not only that, each word is weighted with an "attention" score, telling the model how important given words are in calculating the next word. Since it doesn't process strings "word by word" it can operate with fixed steps and take greater advantage of parallelization in modern compute environments.

So not only does it yield more accurate results, it's more performant, too.

More or less, we have the ideas in place to understand how the LLM as we understand it today might materialize.

ChatGPT

The "GPT" of ChatGPT stands for generative pre-trained transformer. The idea of transformer architecture is critical to how the LLM works. Yes, you've got this history correct...chatGPT was in fact inspired by Google's paper on transformer architecture two years after the company was formed. It's interesting to see the perspective that Google is a loser in the AI field because of some of its high profile flops (and because maybe we all want them taken down a peg or two), but being fair...they were pioneers in this field.

The first version of chatGPT came in 2018, just 1 year after the transformer was coined. GPT 2 followed in 2019, but it wasn't until ChatGPT 3 was launched in 2020 that the world started to take notice.

Finally, in 2023, GPT 4 was launched, five years after its first version. The first chatGPT had roughly 117 million parameters. That's 117 million different weights in the neural network (although not necessarily 117 million neurons), a pretty vast thing to comprehend in human terms.

Five years later, GPT 4 is rumored to have over a trillion parameters, reflecting oceans of compute resources thrown into the platform since its early days. It isn't clear what sort of scale difference there will be with the next version of the generative pre-trained transformer or if we've already hit a point of diminishing returns where there's simply not enough data centers and compute left to meaningfully scale these models. Further, they're limited by the amount of data even available on the Internet -- more data means more accuracy, but eventually they simply run out of things to crawl. That's the sort of scale behind OpenAI.

But Why does it Work...?

Now we're talking about billions or trillions of neurons linked together in as synthetic brains crunching carefully structured data designed to replicate human speech...but why does a neural network even work? Going back to the basic proof of concept, what the heck is actually happening when signals bounce from neuron to neuron, tweaked by random weights?

That's a great question, and something science doesn't yet have an answer for. As a software engineer, that idea is a bit mind blowing. We're used to a certain level of determinism when we write code, and peeling back each layer of a stack like an onion always reveals some physical flow of information that has some logic to it. At a certain point, we can understand how each bit flips and why.

This isn't so simple with complex neural nets. We can see it working, we can even know how to improve it...but answering the question "why" does a neural net work so well...? That's, uh, something we'll have to get back to you about once science has an answer. It's plausible that delving deeper into that mystery reveals secrets about how our own cognition works...but our brains are vastly more complex than chatGPT. Our 100 billion neurons (sorry it isn't trillions) work far more efficiently, with data flowing in a more complex, asynchronous way we don't yet understand and can't yet replicate with synthetic neural nets.

Conclusion

Each layer of AI's technical development has led to its current state, and it's hard to predict where things will evolve from here. There's clearly more secrets to understand in how and why neural nets work as they do, which means there's (terrifying or exciting as it is) still massive room for growth and change in this field. With more money than ever being piled into AI research, it's hard to predict exactly how these products really materialize.

It isn't entirely some inevitable doom -- we don't know if AI is cost effect as a product. We don't know if there's limits to how it scales, since bigger models and more neurons do not always produce better results. We don't know how courts will interpret copyright laws as models chew through mountains of copyrighted data.

« Back to Article List

This Small Corner