# [TOPIC 3] LINEAR REGRESSION (3) – SOLVING NON-LINEAR PROBLEMS

Hi everyone! It has been a long time no see. I got extremely busy preparing my application for the deadline of Spring 2018 semester admission at universities to study at a higher level of education and do my further researches on the field of Machine Learning and Human-Computer Interface. It really was a difficult process, occupied most my time for travelling around and around the city for preparing necessary paperworks, writing statments like statment of purpose or study plan and sending emails to my professors to ask for recommendation letters.

## Introduction:

OK, just a little talk, to begin this topic, let’s remember, I introduced you how to implement a Linear Neuron Model to solve a linear problem with the aid of famous library Tensorflow. However, remember that how frequently do you face with the linear problem? Honestly, it rarely happens, even the “Materialism” problem related to the number of girl friend and monthly salary we dealt with in the first and second topic of “Linear Regression” is also not a pure linear problem. Let me show you what is a problem you get when using linear model for this. You might say that it was very good at trying to guess the number of my may-be-have girl friends (GF) in the first topic. But how about trying to use this linear model to predict the number of girl friends of the richest man in the world – Bill Gates (OK, this is just a joke, I mean no offence when I said that). From , we know Bill’s monthly income is $\216,666,666.00$, infer from the model, we get the number of GF he may have is $0.45753154 + 0.00087198*216,666,666.00 = 188,929.457$. How could it be real? Theoretically, may be. Practically, it’s infeasible, even Casanova couldn’t reach this number. Now, you see the problem here, if the monthly income of a man increase to infinite, the number of GF he has will increase to infinte too while the limit on the number of population worldwide or others such as morality, religion, etc. make it impossible. To deal with this, we need a restriction on the upper bound.

## Polynomial regression model

### Polynomial function

For our dataset, the polynomial model has form: $\hat{y} = f(x, \omega) = \omega_{0} + x_{1} \omega_{1} + x_{1}^{2} \omega_{2} + ... + x_{1}^{n} \omega_{n} \indent (1)$

Or we can rewrite it into: $\hat{y} = f(x, \omega) = \sum_{i=0}^{n}{x_{1}^{i} \omega_{i}}$

Or by matrix form: $\hat{y} = f(x, \omega) = \bold{X} \omega \indent \indent \indent \indent (2)$

Let’s look at the equation (1), $f(x, \omega)$ is a nonlinear function of x though, it is a linear function of the coefficient $\omega$. Because of that, we can still implement the polynomial model as the same way as we do it with the linear one in the first topic. But now, we have the form of the matrix $\bold{X}$ is: $\bold{X} = \begin{bmatrix} 1&x_{1}&x_{1}^{2}&...&x_{1}^{n}\\ 1&x_{2}&x_{2}^{2}&...&x_{2}^{n}\\..\\1&x_{k}&x_{k}^{2}&...&x_{k}^{n} \end{bmatrix}$

### Implementing polynomial model using TensorFlow

To form input vector after feeding it into the model, I wrote the function below.

Forming X data:

# Creating the input matrix X
def create_x(order):
global mans_monthly_income

# Creating element vector
X = np.ones((len(mans_monthly_income)), 1)
for i in range(1, order+1):
temp = np.power(mans_monthly_income, i)
x_temp = np.array(temp, dtype=float)
x_temp = np.transpose(np.matrix(x_temp))

X = np.column_stack((X, x_temp))

return X

Forming y data:

# Creating the target vector y
def create_y():
global total_number_of_gfs

y = np.array(total_number_of_gfs, dtype=float)
y = np.transpose(np.matrix(y))

return y

The results are showed below when using polynomial function with order of 3, 7 and 9, respectively: ##### Fig. 2 The model with order of 3 ##### Fig. 3 The mode with order of 7 ##### Fig. 4 The model with order of 9

The code can be downloaded from my github: https://github.com/tenga061/machine_learning/blob/master/linear_regression/episode_3/code/12_lr_nonlinear_polynomial.py.

### Testing the model

The question now is that how could we know which order is the best? To answer this, I made another dataset (red points in Fig. 5) for the purpose of testing our model by evaluating the root-mean-square (RMS) error (showed as Eq. (3)) each order from 0 to 10. $E_{RMS} = \sqrt{\frac{1} {N} \sum_{i=1}^{N} {(\hat{y} - y)^{2}}} \indent \indent \indent \indent (3)$ ##### Fig. 5 Training (blue points) and testing (red points) dataset

The testing dataset can be downloaded here: https://github.com/tenga061/machine_learning/blob/master/linear_regression/episode_3/dataset/funny_dataset_test.csv. ##### Fig. 6 The relation between the value of RMS error and the order of polynomial function

The code can be downloaded from my github: https://github.com/tenga061/machine_learning/blob/master/linear_regression/episode_3/code/13_lr_nonlinear_polynomial_rms.py.

From the Fig. 6, we can see that the RMS value of models with order higher than 8 begin increasing. In that case, we say our model was overfitted. By which, the model couldn’t work well with other test datasets (even it may work wrong), though it have done well with the training. To deal with this problem, there is some techniques can be used.

## Overfitting

### Increasing the size of the dataset

The first is to collect more and more data. This depends on the data acquisition step, sometimes we can, but sometimes not. There is a ‘lex non scripta’ supported by many is that the number of data points should be larger than some multiple (from 5 to 10 is OK) of the number of parameters in the model.

### Regularization

However, what happens if you are in the situation that you cannot acquire more data. In the case, I would like to introduce you a simple technique of regularization by adding $\frac{1} {2} \lambda \omega^{2}$ to the loss function $\mathcal{L} (\omega)$.  Now, the loss function become: $\mathcal{L}(\omega) = \frac{1} {2} \sum_{i=1}^{N} {(y_{i} - \bold{x}_{i})^{2} \omega} + \frac{1} {2} \lambda \omega^{2} \indent \indent \indent \indent (4)$

The lambda in the adding element help us penalize the loss when the model become overfitting and keep the parameters small.

Turn Eq. (4) to the matrix form: $\mathcal{L}(\omega) = \frac{1} {2} (\bold{y} - \bold{X} \omega)^{2} + \frac{1} {2} \lambda \omega^{2} \indent \indent \indent \indent (5)$

Calculating the derivation of Eq. (5) and set it equal 0 to find roots of the model: $\frac{\partial{\mathcal{L} (\omega)}} {\partial{\omega}} = -\bold{X}^{T} (\bold{y} - \bold{X} \omega) + \lambda \omega = 0$

Similarly, $(\bold{X}^{T} \bold{X} + \lambda \bold{I}) \omega = \bold{X}^{T} \bold{y}$

Assume that $(\bold{X}^{T} \bold{X} + \lambda \bold{I})$ is a nonsingular matrix, we have parameters of model is the root of the equation: $\omega = (\bold{X}^{T} \bold{X} + \lambda \bold{I})^{-1} \bold{X}^{T} \bold{y}$

#### Implementing polynomial with regularization using Scikit-Learn lib

The part of code to implement the model with regularization using Scikit-Learn is showed below:

# Creating the model and feeding data to it
model = linear_model.Ridge(alpha=lamda)
model.fit(X, y)

# Getting the coefficients
coef = model.coef_
params = coef
print params

The code can also be downloaded from my github: https://github.com/tenga061/machine_learning/blob/master/linear_regression/episode_3/code/16_lr_regularization_sklearn.py.

#### Implementting polynomial with regularization using TensorFlow lib

The part of code to implement the model with regularization using TensorFlow is showed below:

# Creating constant opt to store data
X_tensor = tf.constant(X)
y_tensor = tf.constant(y)

''' Finding the root of the model '''
# Calculating Xt*X
first_step_1 = tf.matmul(tf.transpose(X_tensor), X_tensor)
# Calculating lamda.I
temp_step = (order+1)*lamda*np.identity(order+1)
first_step_2 = tf.constant(temp_step)
# Calculating Xt*X + lamda.I
first_step = tf.add(first_step_1, first_step_2)
# Calculating (Xt*X + lamda.I)^(-1)
second_step = tf.matrix_inverse(first_step)
# Calculating (Xt*X + lamda.I)^(-1)*Xt
third_step = tf.matmul(second_step, tf.transpose(X_tensor))
# Calculating (Xt*X + 2.lamda.I)^(-1)*Xt*y
root = tf.matmul(third_step, y_tensor)

''' Getting parameters '''
with tf.Session() as sess:
params = sess.run(root)

print params

The code can also be downloaded from my github: https://github.com/tenga061/machine_learning/blob/master/linear_regression/episode_3/code/15_lr_regularization.py.

## Sigmoid function

Another way to confront with nonlinear problems is to use functions having features suiting dataset features. This is based on your experience much.

Look at our dataset, on the one hand, we can see points distributing in a curve that is partly look like a function called “sigmoid” (as showed as Fig. 7). ##### Fig. 7 The graph of sigmoid function

On the other hand, we have a restriction on the upper bound, so it can be easily recognised that these features suit sigmoid’s features well. For the reason, now, we will use the model showed in Eq. (6) instead of the polynomial model as before. $\hat{y} = f(x, \omega) = \omega_{0} + \frac{1} {1 + e^{-x}} \omega_{1} \indent \indent \indent \indent (6)$

However, note that the input x is very large in our dataset (up to 20,000), so $e^{-x}$ is very small and can be set as 0. For that reason, we need to add a trick to the Eq. (6) by dividing x by a constant number k. In this case, I use k = 4000, the Eq. (6) turn into: $\hat{y} = f(x, \omega) = \omega_{0} + \frac{1} {1 + e^{-x/4000}} \omega_{1} \indent \indent \indent \indent (7)$

The new model is illustrated in Fig. 8. ##### Fig. 8 The sigmoid model

The code can be downloaded from my github: https://github.com/tenga061/machine_learning/blob/master/linear_regression/episode_3/code/9_lr_nonlinear.py.

Apart from polynomial or sigmoid function, we can also apply others like trigonometic or logaric function for our regression model.

So, that’s all I want to talk about how to use linear model for regression tasks. In the next topic, I would introduce the way to use linear model, but for classification tasks.

Hope you enjoy it, goodbye, and see ya,

Curious Chick

## References

 Simon Rogers, Mark Girolami – A first course in Machine Learning

 Christopher M. Bishop – Pattern Recognition and Machine Learning 