Algorithmic Simplicity
Algorithmic Simplicity
  • 5
  • 563 318
MAMBA from Scratch: Neural Nets Better and Faster than Transformers
Mamba is a new neural network architecture that came out this year, and it performs better than transformers at language modelling! This is probably the most exciting development in AI since 2017. In this video I explain how to derive Mamba from the perspective of linear RNNs. And don't worry, there's no state space model theory needed!
Mamba paper: openreview.net/forum?id=AL1fq05o7H
Linear RNN paper: openreview.net/forum?id=M3Yd3QyRG4
#mamba
#deeplearning
#largelanguagemodels
00:00 Intro
01:33 Recurrent Neural Networks
05:24 Linear Recurrent Neural Networks
06:57 Parallelizing Linear RNNs
15:33 Vanishing and Exploding Gradients
19:08 Stable initialization
21:53 State Space Models
24:33 Mamba
25:26 The High Performance Memory Trick
27:35 The Mamba Drama
Переглядів: 131 563

Відео

Why Does Diffusion Work Better than Auto-Regression?
Переглядів 202 тис.4 місяці тому
Have you ever wondered how generative AI actually works? Well the short answer is, in exactly the same as way as regular AI! In this video I break down the state of the art in generative AI - Auto-regressors and Denoising Diffusion models - and explain how this seemingly magical technology is all the result of curve fitting, like the rest of machine learning. Come learn the differences (and sim...
Transformer Neural Networks Derived from Scratch
Переглядів 126 тис.10 місяців тому
#transformers #chatgpt #SoME3 #deeplearning Join me on a deep dive to understand the most successful neural network ever invented: the transformer. Transformers, originally invented for natural language translation, are now everywhere. They have fast taken over the world of machine learning (and the world more generally) and are now used for almost every application, not the least of which is C...
Why do Convolutional Neural Networks work so well?
Переглядів 40 тис.Рік тому
While deep learning has existed since the 1970s, it wasn't until 2010 that deep learning exploded in popularity, to the point that deep neural networks are now used ubiquitously for all machine learning tasks. The reason for this explosion is the invention of the convolutional neural network. This remarkably simple architecture allowed neural networks to be trained on new kinds of data which we...
But what is a neural network REALLY?
Переглядів 65 тис.Рік тому
My submission for 2022 #SoME2. In this video I try to explain what a neural network is in the simplest way possible. That means no linear algebra, no calculus, and definitely no statistics. The aim is to be accessible to absolutely anyone. 00:00 Intro 00:47 Gauss & Parametric Regression 02:59 Fitting a Straight Line 06:39 Defining a 1-layer Neural Network 09:29 Defining a 2-layer Neural Network...

КОМЕНТАРІ

  • @tryptamedia7375
    @tryptamedia7375 17 годин тому

    So do the recent large world model breakthroughs of Sora, Luma, Runway alpha imply that we've returned to auto regressive? Are they a combo of the two? Amazing video, would love to hear your thoughts!

    • @algorithmicsimplicity
      @algorithmicsimplicity 9 годин тому

      From what little they have released publicly, it seems that they are simply diffusion models applied to videos, i.e. they treat videos as a collection of frames, add noise to all frames, take all noisy frames as input and try to predict all clean frames. I don't think there is any auto-regression done, but maybe that will change when they start generating longer videos.

  • @dschonhaut
    @dschonhaut День тому

    Coming from a background in computational neuroscience, but with no prior NN experience beyond coding a perception on the MNIST dataset for a class years ago, I’ve been watching all your videos lately as a first-pass before delving into these topics more rigorously. The visualizations and your concise summaries that accompany them are hugely helpful! Thanks for all your work on these educational videos, and for releasing them for free! Out of curiosity, how are you scripting your animations? Is it Manim (the @3blue1brown library)? And would you recommend it?

    • @algorithmicsimplicity
      @algorithmicsimplicity День тому

      Thanks for the feedback! I used Manim for the first 3 videos on my channel, and then decided to write my own animation library because I found Manim annoying to use. My animation library is still a work in progress, so my most recent 2 videos are made using both my own library and Manim. I would say that Manim is fine to use for simple animations/graphs, but once you start trying to animate lots of objects at once it gets annoying.

  • @looooool3145
    @looooool3145 День тому

    i now understand things, thanks!

  • @downloadableram2666
    @downloadableram2666 2 дні тому

    State-space models are not necessarily from ML, they're used a lot in control systems actually. Not surprised by their relationship considering both are strongly based on linear algebra.

  • @downloadableram2666
    @downloadableram2666 2 дні тому

    I wrote my own CUDA-based neural network implementation a while ago, but I used a sigmoid instead of RELU. Although the "training" covered in this video works, it was kinda odd (usually neural net backpropagation is done with gradient descent of the cost function). You probably cover this in another video, though, haven't gotten there yet. Good video nonetheless, never really thought about it this way.

    • @algorithmicsimplicity
      @algorithmicsimplicity 2 дні тому

      The training method described in this video IS gradient descent. Backprop is just one way of computing gradients (and it is usually used because it is the fastest). It will yield the same results as this training procedure.

    • @downloadableram2666
      @downloadableram2666 2 дні тому

      @@algorithmicsimplicity I should have figured it was doing much the same thing, just a lot slower and more 'random.' Your videos are awesome though, I'm trying to go into machine learning after doing aerospace embedded systems.

  • @Paraxite7
    @Paraxite7 3 дні тому

    I finally understand MAMBA! I've been trying to get my head around it for months, but now see that approaching the way the original paper stated wasn't the best way. Thank you.

  • @heyalchang
    @heyalchang 3 дні тому

    Thanks! That was a really excellent explanation.

  • @chaseghoste8671
    @chaseghoste8671 3 дні тому

    man, what can I say

  • @laurentkloeble5872
    @laurentkloeble5872 4 дні тому

    very original. I learned so much in few minutes. thank you

  • @nothreeshoes1200
    @nothreeshoes1200 5 днів тому

    Please make more videos. They’re fantastic!

  • @user-ou3ts4hl7p
    @user-ou3ts4hl7p 6 днів тому

    Very good video. I get to konw the straigforward reason: why diffusion idea emerges and why diffusion is intrinsically better than autogression algorithm.

  • @fractalinfinity
    @fractalinfinity 7 днів тому

    I get it now! 🎉 thanks!

  • @alenqquin4509
    @alenqquin4509 8 днів тому

    A very good job, I have deepened my understanding of generative AI

  • @freerockneverdrop1236
    @freerockneverdrop1236 8 днів тому

    Neural Network can be simple, if you you are Neo :)

  • @Real-HumanBeing
    @Real-HumanBeing 8 днів тому

    In exactly the same way as regular AI, by containing the dataset.

  • @yushuowang7820
    @yushuowang7820 9 днів тому

    VAE: what about me?

  • @lialkalo4093
    @lialkalo4093 10 днів тому

    very good explanation

  • @freerockneverdrop1236
    @freerockneverdrop1236 10 днів тому

    This is very nice but I think there is one thing either wrong or not clear. At 11:40, he sais that in order to minimize the steps to generate the image, we generate multiple pixels in one step that are scattered around and independent to each other. This way we will not get the blurred image. This works in the latter steps when enough pixels have been generated and they guide the generation of the next batch of pixels. E.g., we know it is a car and we just try to add more details. But in the early steps, when we don't know what it is, if multiple pixels are generated, we fall back to average issue again, e.g. one pixel is for a car but another one is for a bridge. Therefore, I thinks in early steps, we can only generate one pixels each step. What do you think?

    • @algorithmicsimplicity
      @algorithmicsimplicity 10 днів тому

      You are correct that at the early steps there is large amount of uncertainty over the value of each pixel. But what matters is how much this uncertainty is reduced by knowing the value of the previously generated pixels. Even at the first step, knowing the value of one pixel does not help that much in determining the value of far away pixels. Let's say one pixel in the center is blue, does that make you significantly more confident about the color of the top left pixel? Not really.

    • @freerockneverdrop1236
      @freerockneverdrop1236 7 днів тому

      @@algorithmicsimplicity hi, glad to see your reply. Let me articulate. Let's just create the first 2 pixels in two ways and see the difference. First approach is create them one by one, and the second approach is to create both in one step. First approach first, at the beginning, there is no pixel, when we create one, not problem, there will be no inconsistency. Then we create the second pixel based on the first one, there are still a lot of possibilities. But it is restrained by the first pixel. On the other hand, for approach two, both pixels are created in the same step, then they cannot restrain each other. Then there could be more combinations. Some of them are include inconsistency. To avoid such thing, we should generate pixels one by one until the theme of the image is set. Then we can create batch by batch. A simplified example, if the model is trained with images include only green pixels of different strength, or red pixels of different strength, then approach one will generate second pixel of the same color as the first one, though probably with different strength. But approach 2 could generate one red and one green.

  • @Levy1111
    @Levy1111 10 днів тому

    I do hope you'll soon get at least 6 figures subscribers count. The quality of your videos (both in terms of education and presentation) is top notch, people need you to become popular (at least within our small tech bubble).

  • @official_noself
    @official_noself 11 днів тому

    480p? 2023? Are you kidding me?

  • @LeYuzer
    @LeYuzer 12 днів тому

    Tips: turn on 1.5x

  • @pshirvalkar
    @pshirvalkar 12 днів тому

    A fantastic teacher!!! Thanks! Can you please cover Bayesian monte carlo markov chains? Would be very helpful!

  • @alexanderterry187
    @alexanderterry187 12 днів тому

    How do the models deal with having different numbers of inputs? E.g. the text label provided can be any length, or not provided at all. I'm sure this is a basic question, but whenever I've used NNs previously they've always had a constant number of inputs or been reapplied to a sequence of data that has the same dimension at each step.

    • @algorithmicsimplicity
      @algorithmicsimplicity 12 днів тому

      For image input, the input is always the same size (same image size and same channels), and the output is always the same size (1 pixel). For text, you can also treat the inputs as all being the same size by padding smaller inputs up to a fixed max length, though transformers can also operate on sequences of different lengths. The output for text is always the same size (a probability distribution over tokens in the vocabulary).

  • @karigucio
    @karigucio 12 днів тому

    so the transformation applied to the weights does not concern purely with initialization? instead, in the expression w=exp(-exp(a)*exp(ib)) numbers a and b are the learned parameters and not w, right?

  • @matthewfynn7635
    @matthewfynn7635 13 днів тому

    I have been working with machine learning models for years and this is the first time i have truly understood through visualisation the use of ReLU activation functions! Great video

  • @jaimeduncan6167
    @jaimeduncan6167 13 днів тому

    Out of nothing? no, it grabs people's work and creates a composite with variations.

    • @algorithmicsimplicity
      @algorithmicsimplicity 13 днів тому

      It isn't correct to say it creates a 'composite' with variations, models can generalize outside of their training dataset in certain ways, and generative models are capable of creating entirely new things that aren't present in the training dataset.

  • @MichaelBrown-gt4qi
    @MichaelBrown-gt4qi 13 днів тому

    I've started binge watching all your videos. 😁

  • @fergalhennessy775
    @fergalhennessy775 14 днів тому

    do u have a mewing routine bro love from north korea

  • @MichaelBrown-gt4qi
    @MichaelBrown-gt4qi 14 днів тому

    This is a great video. I have watched videos in the past (years ago) talk about auto-regression and more lately talk about diffusion. But it's nice to see why and how there was such a jump between the two. Amazing! However, I feel this video is a little incomplete when there was no mention of the enhancer model that "cleans up" the final generated image. This enhancing model is able to create a larger image while cleaning up the six fingers gen AI is so famous for. While not technically a part of the diffusion process (because it has no random noise) it is a valuable addition to image gen if anyone is trying to build their own model.

  • @capcadaverman
    @capcadaverman 14 днів тому

    Not made from nothing. Made by training on real people’s intellectual property. 😂

    • @algorithmicsimplicity
      @algorithmicsimplicity 14 днів тому

      My image generator was trained on data licensed to be used for training machine learning models.

    • @capcadaverman
      @capcadaverman 14 днів тому

      @@algorithmicsimplicity not everyone is so ethical

  • @telotawa
    @telotawa 14 днів тому

    could diffusion work on text generation?

    • @algorithmicsimplicity
      @algorithmicsimplicity 14 днів тому

      Yes, it absolutely can! Instead of adding normally distributed noise, you randomly mask tokens with some probability, see e.g. arxiv.org/abs/2406.04329 . That said, it tends to produce a bit worse quality text than auto-regression (actually this is true for images as well, it's just on images auto-regression takes too long to be viable.)

  • @LordDoucheBags
    @LordDoucheBags 14 днів тому

    What did you mean by causal architectures? Because when I search online I get stuff about causal inference, so I’m guessing there’s a different and more popular term for what you’re referring to?

  • @julioalmeida4645
    @julioalmeida4645 14 днів тому

    Damn. Amazing piece

  • @dmitrii.zyrianov
    @dmitrii.zyrianov 14 днів тому

    Hey! Thanks for the video, it is very informative! I have a question. At 18:17 you say that an average of a bunch of noise is still a valid noise. I'm not sure why it is true here. I'd expect the average of a bunch of noise to be just 0.5 value (if we map rgb values to 0..1 range)

    • @algorithmicsimplicity
      @algorithmicsimplicity 14 днів тому

      Right, the average is just the center of the noise distribution which, let's say the color values are mapped from -1 to 1, is 0. This average doesn't look like noise (it is just a solid grey image), but if you ask what is the probability of this image under the noise distribution, it actually has the highest probability. The noise distribution is a normal distribution centered at 0, so the input which is all 0 has the highest probability. So the average image still lies within the noise distribution, as opposed to natural images where the average moves outside the data distribution

    • @dmitrii.zyrianov
      @dmitrii.zyrianov 14 днів тому

      Thank you for the reply, I think I got it now

  • @simonpenelle2574
    @simonpenelle2574 14 днів тому

    Amazing content I now want to implement this

  • @dadogwitdabignose
    @dadogwitdabignose 15 днів тому

    great video man suggestion: can you create a video on how generative transformers work? this has been really bothering me and hearing an in-depth explanation of them, like your video here, would be helpful!

    • @algorithmicsimplicity
      @algorithmicsimplicity 15 днів тому

      Generative transformers work in exactly the same way as generative CNNs. It doesn't matter what backbone you use the idea is the same, you will use auto-regression or diffusion to train a transformer to undo the masking/noising process.

    • @dadogwitdabignose
      @dadogwitdabignose 15 днів тому

      @@algorithmicsimplicity which is more efficient to use and how do they handle text data to map text into tensors?

    • @algorithmicsimplicity
      @algorithmicsimplicity 15 днів тому

      @@dadogwitdabignose I explain how transformer classifiers work in this video: ua-cam.com/video/kWLed8o5M2Y/v-deo.html . As for which is more efficient, it depends on the data. Usually for text data transformers will be more efficient (for reasons I explain in that video), and for images CNNs will be more efficient.

  • @Kavukamari
    @Kavukamari 15 днів тому

    "i can do eleventy kajillion computations every second" "okay, what's your memory throughput"

  • @deep.space.12
    @deep.space.12 15 днів тому

    If there will be a longer version of this video, it might be worth mentioning VAE as well.

  • @wormjuice7772
    @wormjuice7772 16 днів тому

    This has helped me so much wrapping my head around this whole subject! Thank you for now, and the future!

  • @codybarton2090
    @codybarton2090 16 днів тому

    Crazy video

  • @gameboyplayer217
    @gameboyplayer217 16 днів тому

    Nicely explained

  • @snippletrap
    @snippletrap 17 днів тому

    Fantastic explanation. Very intuitive

  • @ibrahimaba8966
    @ibrahimaba8966 17 днів тому

    Thank you for this beautiful work!

  • @boogati9221
    @boogati9221 17 днів тому

    Crazy how two separate ideas ended up converging into one nearly identical solution.

    • @andrewy2957
      @andrewy2957 6 днів тому

      Totally agree. I feel like that's pretty common in math, robotics, and computer science, but it just shows how every field in stem is interconnected.

  • @mattshannon5111
    @mattshannon5111 17 днів тому

    Wow, it requires really deep understanding and a lot of work to make videos this clear that are also so correct and insightful. Very impressive!

  • @vibaj16
    @vibaj16 18 днів тому

    wait, can this be used as a ray tracing denoiser? That is, you'd plug your noisy ray traced image into one of the later steps of the diffusion model, so the model tries to make it clear?

    • @algorithmicsimplicity
      @algorithmicsimplicity 18 днів тому

      Yep you could definitely do that, you would probably need to train a model on some examples of noisy ray traced images though.

    • @Maxawa0851
      @Maxawa0851 17 днів тому

      Yeah but this is very slow do

  • @antongromek4180
    @antongromek4180 18 днів тому

    Actually, there is no LLM, etc - but 500 million nerds - sitting in basements all over the world.

  • @artkuts4792
    @artkuts4792 18 днів тому

    I still didn't get how the scoring model works. So before you were labeling the important pairs by hand giving it a score based on the semantic value each pair has for a given context, but then it's done automatically by a CNN, how does it define the score though (and it's context free, isn't it)?

    • @algorithmicsimplicity
      @algorithmicsimplicity 18 днів тому

      The entire model is trained end-to-end to minimize the training loss. To start off with, the scoring functions are completely random, but during training they will change to output scores which are useful, i.e. which cause the model's final prediction to better match the training labels. In practice it turns out that what these scoring functions learn while trying to be useful is very similar to the 'semantic scoring' that a human would do.

  • @lusayonyondo9111
    @lusayonyondo9111 19 днів тому

    wow, this is such an amazing resource. I'm glad I stuck around. This is literally the first time this is all making sense to me.

  • @istoleyourfridgecall911
    @istoleyourfridgecall911 19 днів тому

    Hands down the best video that explains how these models work. I love that you explain these topics in a way that resembles how the researchers created these models. Your video shows the thinking process behind these models, combined with great animated examples, it is so easy to understand. You really went all out. Only if youtube promoted these kinds of videos instead of brainrot low quality videos made by inexperienced teenagers.