Gökçe Aydos
income ($) | house-age (years) | … | house-value ($) |
---|---|---|---|
83252 | 41 | … | 452600 |
83014 | 21 | … | 358500 |
… | … | … | … |
. . .
Goal learn weights \(w_1, w_2, \ldots\) such that:
\[ w_1 \cdot 83252 + w_2 \cdot 41 + ... \approx 452600 \] \[ w_1 \cdot 83014 + w_2 \cdot 21 + ... \approx 358500 \]
This is an excerpt from a real life dataset from a US census in 1990
that contains ~20000 rows. Each row represents aggregated data about an
area in California. Each column excluding the house-value
represents a feature. The house value represents the labels (called
label vector, in case of many labels a label
matrix).
❓ Why are \(w_i\)s the same across all rows?
❗ Our goal is to predict the house value/s from features. We will use \(w_i\)s to predict using new feature data in the future.
The data can be conveniently fetched using Scikit-learn which also includes the data description. Note that the actual data measures the income in tens of thousands and the house value in hundreds of thousands.
User:Wiso, Public domain, via Wikimedia Commons, edited
Neural networks make predictions
Goal: Implementation using object-oriented programming (OOP)
Class ideas?
. . .
Operation
ParameterOperation
Layer
NeuralNetwork
Optimizer
Trainer
…
Loss
from
Operation
?backward()
method of Loss
with any value which differentiates
Loss
from Operation
. Python does not allow to
create two versions of a method.breakpoint()
is useful for debugging while interacting
with the program in ipython
(3, 1)
array with a (3,)
array. Assertions
help.\(x\) | \(y\) |
---|---|
1 | 2 |
2 | 4 |
3 | 6 |
4 | ? |
-1 | ? |
How did we do that? We noticed: \(y_i = x_i \cdot 2\) and applied this observation to 4 and -1.
Example: You are a child and sit near the a saleswoman who sells figs. If she sells one fig, the customer gives 2 coins, if she sells 2, the customer gives 4 coins etc. You would learn that the number of figs should be multiplied by 2.
\(x\) | \(y\) |
---|---|
1 | 1.99 |
2 | 4.02 |
3 | 5.98 |
4 | ? |
-1 | ? |
This time the numbers are a bit off, however we can still roughly say \(y_i = x_i \cdot 2\) and accept some error for a simple model.
SHOW: Draw the values on a graph and draw a line.
Using the same principle the machines can learn from observations (by finding patterns) and make predictions. This is called machine learning.
\(x_1\) | \(x_2\) | \(y\) |
---|---|---|
1 | 1 | 1.5 |
2 | 1 | 2.5 |
2 | 2 | 3 |
3 | 1 | ? |
-1 | 2 | ? |
The same idea also works with more input variables. \(y_i = x_{1i} + 0.5 x_{2i}\).
In general: \(y_i = w_1 x_{1i} + w_2 x_{2i}\). In our case \(y_i\) is a linear combination of \(x_i\)s. We can have many input variables as well as many predictions.
❓ Assume the machine learned some \(w_i\)s. Are the predictions correct? How do we test?
. . .
❗ We can compare each prediction \(p_i\) (using \(w_i\)s) with the actual house value (\(y_i\)).
❓ How do we test the prediction quality using math?
. . .
❗ For example by using mean squared error (MSE).
\[ \frac{(y_1 - p_1)^2 + (y_2 - p_2)^2 + ... + (y_n - p_n)^2}{n} \]
MSE is - a loss function - measures how erroneous the prediction is
Another example is mean absolute error (MAE). Compared to MSE, MAE emphasizes large error by squaring them and small errors, i.e., \(<1\) get even smaller.
Goal: minimum loss by training:
❓ How can we find these parameters?
. . .
❗ Taking the partial derivative with respect to each parameter (gradient). However we typically cannot find the exact minimum, because the loss function can get very complex.
❓ What do we do now?
Alternative perspective follows:
❓ You are stuck on a mountain. How would you get down?
❗ Look around, move in direction of the descending path, repeat.
Approach: Find out how much we should change each parameter so that the loss decreases.
❓ How can we find out how much the loss \(L\) changes if we e.g., increase the parameter \(w_1\) by 1?
. . .
❗ By computing \(\frac{\partial L}{\partial w_1}(w_1=1)\). If the result is positive, then we decrease \(w_1\) and vice-versa. This is called back-propagation.
Procedure: Pick a random point, move in direction of the descending path: ❓ Where would these three points end up?
Note that the rightmost point will end up in a local minimum.
Goal: minimum loss by training: