# [TOPIC 2] LINEAR REGRESSION (2) – NEURAL NETWORK APPROACH

## Introduction

Hi everybody,

In the last post, I depicted how to implement linear regression based on inverse matrix approach or normal equation using both Scikit-learn and TensorFlow libs. Though the approach gives us a concise answer, it cost the computational memory. The larger dataset is, the more computation need to be calculated.

Thus, today, I would like to talk about another way to do it based on learning algorithm of neural network so as to you can use it for your very large data.

Here is a bit information about neural network that help you get its concepts: https://en.wikipedia.org/wiki/Artificial_neural_network.

## Linear neuron ##### Fig. 1 Linear neuron

Fig. 1 show a diagram of a linear neuron (or it is also called a linear predictor). In which, the process of input-output include 3 steps, beginning with multiplying inputs ( $x_1$, $x_2$) by corresponding weights ( $\omega_1$, $\omega_1$), followed by summing the multiplication that we have acquired ( $\sum_{i=1}^{2} {x_i \omega_i}$) and culminating with adding a bias to the sum ( $b$). $y = \sum_{i=1}^{2} {x_i \omega_i} + b = x_1 \omega_1 + x_2 \omega_2 + b \indent \indent (1)$

Look at the Eq. (1), now, if we see the bias ( $b$) as intercept ( $w_0$) and the weights as slopes , we can see that the model of the linear neuron show exactly that of regression tasks. Thus, linear regression tasks can be also solved by using the linear neuron.

Now, let’s take a detail at how to train the neuron.

## Learning algorithm

Similar to inverse matrix approach, we need to evaluate the linear model at first. To do it, I use the least square defined as Eq. (2) as loss function again. $\mathcal{L} (\omega) = \frac{1} {2} \sum_{i=0}^{N} {(y - (x_i \omega_i + b_i))^2} \indent \indent (2)$

Afterwards, I use gradient descent as learning algorithm to train the linear neuron to adapt to the data. In this topic, I will mainly focus on concepts, the details of learning algorithms will be left till later, in topics of Neural Network.

## Implementing Linear neuron using TensorFlow

As usual, we need to import library at first.

import csv
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

Next, initialize weights and biases.

''' Initializing weight W and bias b '''
W = tf.Variable([[0.5]], dtype=tf.float32)
b = tf.Variable([[0.5]], dtype=tf.float32)
# Initializing variables
init = tf.global_variables_initializer()

Afterwards, declare hyperparameters such as learning rate, etc.

''' Initializing hyperparameters '''
learning_rate = 0.05

Then calculate L2 loss function.

# Model
model = tf.add(tf.matmul(X_data, W), b)
# Calculating loss function
loss = tf.reduce_sum(tf.square(y_data - model))/2.0
# Calling gradient descent
train = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

Finally, begin the training loop.

''' Training loop '''
with tf.Session() as sess:
sess.run(init)
for i in range(100):
sess.run(train, feed_dict={X_data: X, y_data: y})
intercept = sess.run(b)
slope = sess.run(W)

However, the results are unexpected. ##### Fig. 2 The result before feature scaling

Why were all we get ‘nan’ instead of numbers? To understand it let’s take a look more detail at our dataset and learning rate.
As you knew, the training process is related to the optimization of Eq. (2). In our case, the input vector have only 1 attribute, so to simplify the Eq. (2), I transform it into: $\mathcal{L} (\omega) = \frac{1} {2} \sum_{i=0}^{N} {(y - (x_i \omega + b))^2} \indent \indent (3)$

Then I plot the least square curve w.r.t the weights. ##### Fig. 3 The relation of the loss function and the weights

As illustrated in Fig. 3, the curve is very narrow, which shows that only a small change in the weight can cause a significant change in the loss function. Moreover, look at the Eq. (4) which describes updating weights. $\omega_{new} = \omega_{old} - \eta \frac{\partial \mathcal{L} (\omega)} {\partial \omega} \indent \indent (4)$

where, $\eta$ is the learning rate and the derivative of the least square function is: $\frac{\partial \mathcal{L} (\omega)} {\partial \omega} = - \sum_{i=1}^{N} {x_i (y_i - (x_i \omega +b))}$

Notice that, our inputs lie into the range which is dramatic larger than that of outputs. So, if the learning rate and weights are large enough, $\omega$ will soar every iteration. As a result of that, the function $\mathcal{L} (\omega)$ will not convergent (see the way the weight is updated in Fig. 3 and tranining process line in Fig. 4). It is exactly what we are facing. ##### Fig. 4 Divergence of training algorithm

On the other hand, because the value of loss function is too large, as you can see in Fig. 4, our loss function value reach above $4 x 10^{32}$ in only 1 epoch. So, the loss was soon overflow which cause ‘nan’ value as Fig. 2.

To tackle this problem, you can reduce the value of initializing weights and learning rate. However, it is difficult to find out how much is enough. So I want to talk about the second way – feature scaling.

The code for this section are available on my github: https://github.com/tenga061/machine_learning/blob/master/linear_regression/7_lr_learning_tensorflow_withou_prescale.py.

## Feature scaling

Our dataset seems to be a good example for feature scaling. Sometimes, you don’t need, but you have to know, gradient descent work well in the small range, so when it is applied for the large range, it will either slowly convergent or divergent as our problem.

There are many ways to scale features before feed it to the linear neuron. In this post, I used the formula as defined in Eq. (5). $x_{i}^{*} = \frac{x_i - x_{min}} {x_{max} - x_{min}} \indent \indent (5)$

The code for feature scaling:

# Scaling data
for i in range(10):
mans_monthly_income[i] = (mans_monthly_income[i]-min(mans_monthly_income))/(max(mans_monthly_income)-min(mans_monthly_income))

The results show that we get approximate results of normal equation.

>>> intercept = 0.44028864
>>> slope = 0.00086274 ##### Fig. 5 The training line after feature scaling ##### Fig. 6 The linear model using linear neuron

The code for this section are available on my github: https://github.com/tenga061/machine_learning/blob/master/linear_regression/5_lr_learning_tensorflow.py.

Now, you’ve known how to use the learning algorithm to train linear neuron so as to it can fit the data. So, what are pros and cons when using it instead of normal equation? Let’s take a comparision.

## Comparison between the normal equation approach and the linear neuron one

As mentioned before, the training linear neuron approach is better for larger datasets. However, you can’t have your cake and eat it too, each approach has their own benefits and drawbacks. The table below shows the comparison of Gradient Descent and Normal Equation to illuminate it.

##### Table 1. The comparison of Gradient Descent and Normal Equation On the other hand, as you can see from the previous and recent post, you don’t need to scale features if you use the normal equation.

So, the best suitable approach depends on your particular situation.

## Disadvantages of linear regression

OK, now you knew that the linear regression is very useful and easy to use. Nonetheless, there is some drawbacks when using it.

Firstly, because it is a linear model, it cannot fit the complicated datasets, though you can apply it for some simple non-linear problem (will be discussed in the later post). In this case, you need to use non-linear model instead.

Secondly, it is sensitive to outliers or noises in datasets. For example, now we add a noise to my our dataset. Assuming that, there was a guy that get no more than \$300 a month but the total number of his girl friends up to 9. Let’s see what is changed now in our model (see Fig. 7). ##### Fig. 7 Linear model after being expose to noise

To deal with this, dataset need to be pre-processed before being applied to the linear model. Each discipline have diffrent methods to tackle this.

Another way is to change the loss function. As you may know, one of reasons of its extreme sensitivity is caused by L2 loss function. Whereby, for example, four deviation of 4 will give a penalty of 64 while one larger deviation of 10 will give a larger penalty of 100. So, for datasets with outliers, using the absolute function, which is also called L1 function (as defined in Eq. (6)), is more suitable than using the L2. $\mathcal{L} 1 (\omega) = \sum_{i=0}^{N} {\mid y_i - (x_i \omega + b) \mid} \indent \indent (6)$

Now, alternate L2 with L1 in our program, we get the results illustrated in Fig. 8. ##### Fig. 8 Linear model with L1 loss function

So far, we have explored how to implement a linear model for regression by using both normal equation and linear neuron and known what are the advantages and disadvantages of each. In the next post, I will talk about how to apply linear regression to some simple non-linear problems.

See ya,

Curious Chick

## References

 Andrew Ng – Machine Learning Course

 Sandhya Samarasinghe – Neural Networks for Applied Sciences and Engineering

 Nick McClure – Machine Learning Cookbook

 Field Cady – The Data Science Handbook 