Why Machines Learn

Why Machines Learn: The Elegant Math Behind Modern AI, Anil Ananthaswamy, 2024

November/December…, 2024

Context and Reflection

I am reading this book as part of a small book club: The 26-minute Book Club. This sort of book is really not my cup of tea — it is about the math with which various learning algorithms are implemented. I am, at my deepest, interested in the algorithms. Math is neither a strong point nor a deep interest — it seems more like magic to me: You make a bunch of elaborate tautologies, define some terms and rules, and then elaborate it, and all of a sudden it enables you to do things.

I had imagined I would give this a one-meeting trial, and then bow out. However, I found the group (4 other people) quite pleasant, and composed of — barring the one person I know who invited me — interesting strangers that I would not otherwise get to know. And that, especially in retirement, is actually a pretty valuable thing. Everybody else I encounter is mostly a part of my social network even if only a friend of a friend. Another advantage is that this forces me to think about something that I would otherwise ignore (basic math including linear algebra, vectors, a little calculus, probability, gaussian distributions, Bayes’ Theorm), and perhaps it will prove tractable enough that I will feel emboldened to try to take on fluid dynamics…. Or perhaps I will figure out how to distill the ‘qualitative’ aspects that I am interested from the mangle of symbols… We shall see.

C1: Desperately Seeking Patterns

  • Konrad Lorenz and duckling imprinting. Ducklings can imprint on two moving red objects, and will follow any two moving objects of the same color; or two moving objects of different colors, mutatis mutandis. How is it that an organism with a small brain and only a few instants of exposure can learn something like this?
  • This leads into a brief introduction to Frank Rosenblatt’s invention of Perceptrons in the late 1950’s. Perceptrons are one of the first ‘brain-inspired’ algorithms that can learn patterns by inspecting labeled data.
  • Then there is a foray into notation: linear equations (relationships) with weights (aka coefficients) and symbols (variables). Also sigma notation.
  • McCulloch and Pitt, 1943, story of their meeting, and their model of a neuron — a neurode — that can implement basic logical operations.
  • The MCP Neurode. Basically, the neurode takes inputs, combines them according to a function, and outputs a 1 if they are over a threshold theta, and a 0 otherwise. If you allow inputs to be negative and lead to inhibition, as well as allow neuroses to connect to one another, you can implement all of boolean logic. The problem, however, is that the thresholds theta must be hand-crafted.
  • Rosenblatt’s Perceptron made a splash because it could learn its weights and theta from the data. An early application was to train perceptrons to recognize hand drawn letters — and it could learn simply by ‘punishing’ it for mistakes.
  • Hebbian Learning: Neurons that fire together, wire together. Or, learning takes place by the formation of connections between firing neurons, and the loss or severing of connections between neurons that are not in sync.
  • The difference between the MCP Neurode and Perceptrons is that perceptrons input’s don’t have to be 1 or 0 — they can be continuous. And they are weighted, and they are compared to a bias.
  • The Perceptron does make on basic assumption: that there is a clear, unambiguous rule to learn — no noise in the data.
  • It can be proven that a perceptron will always find a linear divide when there is one to be found.

C2: We are All Just Numbers

  • Hamilton’s discovery of quaternions, and his inscription on Brougham bridge in Dublin. i2 = j2 = k2 = idk = -1 Quaternions don’t concern us, but Hamilton developed concepts for manipulating them that are quite important: vectors and scalars.
  • Scalar/Vector math: computing length; sum; stretching a vector by scalar multiplication;
  • Dot products: a-b = a1b1 + a2b2
  • Something about dot products being similar to weighted sums, which can be used to represent perceptrons??? Didn’t understand this bit. [p. 36-42]
  • A perceptron is essentially an algorithm for finding a line/plane/hyperplane that accurately divides values into appropriately labeled regions.
  • Using matrices to represent vectors. Matrix math. Multiplying matrix A with the Transpose of Matrix B
  • So the point of all this is to take Rosenblatt’s Perceptron and transform it into formal notation that linearly transforms an input to an output.
  • “Lower bounds tell us about whether something is impossible.” — Manuel Sabin
  • Minsky and Papert’s book, Perceptrons, poured cold water on the field by proving that Perceptrons could not cope with XOR. XOR could only be solved with multiple layers of Perceptrons, but nobody knew how to train anything but the top layer
  • …I am not clear on why failure to cope with XOR was such cold water...
    ➔ It is because XOR is a simple logical operation; the inability of Perceptrons handling it suggested that they would not work for even moderately complex problems. Some also generalized the failure to all neural networks, rather than just single layer ones.
  • Multiple layers requires backpropagation…

C3: The Bottom of the Bowl

  • McCarthy, Minsky, Shannon and Rochester –1955 Dartmouth summer seminar on Artificial Intelligence.
  • Widrow worked on filtering noise out of signals: he worked on continuous signals; others applied his approach to filtering digital signals. Widrow and Hoff — Adaptive filtering — invented Least Mean Squares algorithm.
  • Least Mean Squares is a method for quantifying error. What Widrow wanted to do was to create an adaptive filter that would learn in response to errors — this required a method for adjusting parameters of the filter so as to minimize errors. This is referred to as The Method of Steepest Descent, discovered by the French mathematician, Cauchy.
  • Much of the rest of the chapter introduces math for ‘descending slopes.’ dx/dy moves us along a gradient… the minimum will have a slope of zero. When we have planes we need to take multiple variables into account so we have partial derivatives.

“If there’s one thing to take away from this discussion, it’s this: For a multi-dimensional or high-dimensional function (meaning, a function of many variables), the gradient is given by a vector. The components of the vector are partial derivatives of that function with respect to each of the variables.

What we have just seen is extraordinarily powerful. If we know how to take the partial derivative of a function with respect to each of its variables, no matter how many variables or how complex the function, we can always express the gradient as a row vector or column vector.

Our analysis has also connected the dots between two important concepts: functions on the one hand and vectors on the other. Keep this in mind. These seemingly disparate fields of mathematics-vectors, matrices, linear algebra, calculus, probability and statistics, and optimization theory (we have yet to touch upon the latter two) – will all come together as we make sense of why machines learn.”

To be continued…

Views: 0