## [TOPIC 3] LINEAR REGRESSION (3) – SOLVING NON-LINEAR PROBLEMS

Hi everyone! It has been a long time no see. I got extremely busy preparing my application for the deadline of Spring 2018 semester admission at universities to study at a higher level of education and do my further researches on the field of Machine Learning and Human-Computer Interface. It really was a difficult process, occupied most my time for travelling around and around the city for preparing necessary paperworks, writing statments like statment of purpose or study plan and sending emails to my professors to ask for recommendation letters.

## Introduction:

OK, just a little talk, to begin this topic, let’s remember, I introduced you how to implement a Linear Neuron Model to solve a linear problem with the aid of famous library Tensorflow. However, remember that how frequently do you face with the linear problem? Honestly, it rarely happens, even the “Materialism” problem related to the number of girl friend and monthly salary we dealt with in the first and second topic of “Linear Regression” is also not a pure linear problem. Let me show you what is a problem you get when using linear model for this. You might say that it was very good at trying to guess the number of my may-be-have girl friends (GF) in the first topic. But how about trying to use this linear model to predict the number of girl friends of the richest man in the world – Bill Gates (OK, this is just a joke, I mean no offence when I said that). From [1], we know Bill’s monthly income is $\216,666,666.00$, infer from the model, we get the number of GF he may have is $0.45753154 + 0.00087198*216,666,666.00 = 188,929.457$. How could it be real? Theoretically, may be. Practically, it’s infeasible, even Casanova couldn’t reach this number. Now, you see the problem here, if the monthly income of a man increase to infinite, the number of GF he has will increase to infinte too while the limit on the number of population worldwide or others such as morality, religion, etc. make it impossible. To deal with this, we need a restriction on the upper bound.

## Polynomial regression model

### Polynomial function

For our dataset, the polynomial model has form:

$\hat{y} = f(x, \omega) = \omega_{0} + x_{1} \omega_{1} + x_{1}^{2} \omega_{2} + ... + x_{1}^{n} \omega_{n} \indent (1)$

Or we can rewrite it into:

$\hat{y} = f(x, \omega) = \sum_{i=0}^{n}{x_{1}^{i} \omega_{i}}$

Or by matrix form:

$\hat{y} = f(x, \omega) = \bold{X} \omega \indent \indent \indent \indent (2)$

Let’s look at the equation (1), $f(x, \omega)$ is a nonlinear function of x though, it is a linear function of the coefficient $\omega$. Because of that, we can still implement the polynomial model as the same way as we do it with the linear one in the first topic. But now, we have the form of the matrix $\bold{X}$ is:

$\bold{X} = \begin{bmatrix} 1&x_{1}&x_{1}^{2}&...&x_{1}^{n}\\ 1&x_{2}&x_{2}^{2}&...&x_{2}^{n}\\..\\1&x_{k}&x_{k}^{2}&...&x_{k}^{n} \end{bmatrix}$

### Implementing polynomial model using TensorFlow

To form input vector after feeding it into the model, I wrote the function below.

Forming X data:

# Creating the input matrix X
def create_x(order):
global mans_monthly_income

# Creating element vector
X = np.ones((len(mans_monthly_income)), 1)
for i in range(1, order+1):
temp = np.power(mans_monthly_income, i)
x_temp = np.array(temp, dtype=float)
x_temp = np.transpose(np.matrix(x_temp))

X = np.column_stack((X, x_temp))

return X

Forming y data:

# Creating the target vector y
def create_y():
global total_number_of_gfs

y = np.array(total_number_of_gfs, dtype=float)
y = np.transpose(np.matrix(y))

return y

The results are showed below when using polynomial function with order of 3, 7 and 9, respectively:

### Testing the model

The question now is that how could we know which order is the best? To answer this, I made another dataset (red points in Fig. 5) for the purpose of testing our model by evaluating the root-mean-square (RMS) error (showed as Eq. (3)) each order from 0 to 10.

$E_{RMS} = \sqrt{\frac{1} {N} \sum_{i=1}^{N} {(\hat{y} - y)^{2}}} \indent \indent \indent \indent (3)$

##### Fig. 6 The relation between the value of RMS error and the order of polynomial function

From the Fig. 6, we can see that the RMS value of models with order higher than 8 begin increasing. In that case, we say our model was overfitted. By which, the model couldn’t work well with other test datasets (even it may work wrong), though it have done well with the training. To deal with this problem, there is some techniques can be used.

## Overfitting

### Increasing the size of the dataset

The first is to collect more and more data. This depends on the data acquisition step, sometimes we can, but sometimes not. There is a ‘lex non scripta’ supported by many is that the number of data points should be larger than some multiple (from 5 to 10 is OK) of the number of parameters in the model.

### Regularization

However, what happens if you are in the situation that you cannot acquire more data. In the case, I would like to introduce you a simple technique of regularization by adding $\frac{1} {2} \lambda \omega^{2}$ to the loss function $\mathcal{L} (\omega)$.  Now, the loss function become:

$\mathcal{L}(\omega) = \frac{1} {2} \sum_{i=1}^{N} {(y_{i} - \bold{x}_{i})^{2} \omega} + \frac{1} {2} \lambda \omega^{2} \indent \indent \indent \indent (4)$

The lambda in the adding element help us penalize the loss when the model become overfitting and keep the parameters small.

Turn Eq. (4) to the matrix form:

$\mathcal{L}(\omega) = \frac{1} {2} (\bold{y} - \bold{X} \omega)^{2} + \frac{1} {2} \lambda \omega^{2} \indent \indent \indent \indent (5)$

Calculating the derivation of Eq. (5) and set it equal 0 to find roots of the model:

$\frac{\partial{\mathcal{L} (\omega)}} {\partial{\omega}} = -\bold{X}^{T} (\bold{y} - \bold{X} \omega) + \lambda \omega = 0$

Similarly,

$(\bold{X}^{T} \bold{X} + \lambda \bold{I}) \omega = \bold{X}^{T} \bold{y}$

Assume that $(\bold{X}^{T} \bold{X} + \lambda \bold{I})$ is a nonsingular matrix, we have parameters of model is the root of the equation:

$\omega = (\bold{X}^{T} \bold{X} + \lambda \bold{I})^{-1} \bold{X}^{T} \bold{y}$

#### Implementing polynomial with regularization using Scikit-Learn lib

The part of code to implement the model with regularization using Scikit-Learn is showed below:

# Creating the model and feeding data to it
model = linear_model.Ridge(alpha=lamda)
model.fit(X, y)

# Getting the coefficients
coef = model.coef_
params = coef[0]
print params

#### Implementting polynomial with regularization using TensorFlow lib

The part of code to implement the model with regularization using TensorFlow is showed below:

# Creating constant opt to store data
X_tensor = tf.constant(X)
y_tensor = tf.constant(y)

''' Finding the root of the model '''
# Calculating Xt*X
first_step_1 = tf.matmul(tf.transpose(X_tensor), X_tensor)
# Calculating lamda.I
temp_step = (order+1)*lamda*np.identity(order+1)
first_step_2 = tf.constant(temp_step)
# Calculating Xt*X + lamda.I
# Calculating (Xt*X + lamda.I)^(-1)
second_step = tf.matrix_inverse(first_step)
# Calculating (Xt*X + lamda.I)^(-1)*Xt
third_step = tf.matmul(second_step, tf.transpose(X_tensor))
# Calculating (Xt*X + 2.lamda.I)^(-1)*Xt*y
root = tf.matmul(third_step, y_tensor)

''' Getting parameters '''
with tf.Session() as sess:
params = sess.run(root)

print params

## Sigmoid function

Another way to confront with nonlinear problems is to use functions having features suiting dataset features. This is based on your experience much.

Look at our dataset, on the one hand, we can see points distributing in a curve that is partly look like a function called “sigmoid” (as showed as Fig. 7).

##### Fig. 7 The graph of sigmoid function

On the other hand, we have a restriction on the upper bound, so it can be easily recognised that these features suit sigmoid’s features well. For the reason, now, we will use the model showed in Eq. (6) instead of the polynomial model as before.

$\hat{y} = f(x, \omega) = \omega_{0} + \frac{1} {1 + e^{-x}} \omega_{1} \indent \indent \indent \indent (6)$

However, note that the input x is very large in our dataset (up to 20,000), so $e^{-x}$ is very small and can be set as 0. For that reason, we need to add a trick to the Eq. (6) by dividing x by a constant number k. In this case, I use k = 4000, the Eq. (6) turn into:

$\hat{y} = f(x, \omega) = \omega_{0} + \frac{1} {1 + e^{-x/4000}} \omega_{1} \indent \indent \indent \indent (7)$

The new model is illustrated in Fig. 8.

##### Fig. 8 The sigmoid model

Apart from polynomial or sigmoid function, we can also apply others like trigonometic or logaric function for our regression model.

So, that’s all I want to talk about how to use linear model for regression tasks. In the next topic, I would introduce the way to use linear model, but for classification tasks.

Hope you enjoy it, goodbye, and see ya,

Curious Chick

## References

[2] Simon Rogers, Mark Girolami – A first course in Machine Learning

[3] Christopher M. Bishop – Pattern Recognition and Machine Learning

## Introduction

Hi everybody,

In the last post, I depicted how to implement linear regression based on inverse matrix approach or normal equation using both Scikit-learn and TensorFlow libs. Continue reading “[TOPIC 2] LINEAR REGRESSION (2) – NEURAL NETWORK APPROACH”

## [TOPIC 1] LINEAR REGRESSION (1) – MATRIX INVERSE APPROACH

Hi everybody,

Today, I would like to talk about the linear model for regression task in this post. While you have seperate labels (which may be (0 and 1) or (-1 and 1)) in classification tasks, the outputs in regression are continuous (i.e. they are arbitrary numbers lie in particular ranges). Continue reading “[TOPIC 1] LINEAR REGRESSION (1) – MATRIX INVERSE APPROACH”

## MACHINE LEARNING – WHAT IS THE DATA TELLING YOU?

Hi everyone,

Once I mention the concept of Machine Learning (ML) to my friends, who doesn’t work in the discipline of technology or data science, at least one of them think that it may be related to some characters on sci-fi like The Terminator, Wall-E, or recently, Ultron. In fact, it is partly true, that ultimately, the final purpose of ML is to make a robot that can be able to think and do like human beings, but now, it seem to be used mainly for analyzing data and getting information from it.

Nowadays, lots of technical companies have been bringing researches to the life (Natural Language Processing (NLP), Pattern Recognition, etc.), many businesses also used ML to analyze their customer and stock stock, and it can be said that ML become one of the sexist field in the world. Basically, there is two types of ML: supervised and unsupervised learning (see Fig. 1). For supervised learning, you need a datasets including input and output to train your algorithms, it is used for regression or classification tasks. By contrast, unsupervised learning don’t need output to train and it is used for dimensionality reduction or cluster tasks. For more details about ML, you can refer to these references [1], [2].

##### Fig.1 Two types of Machine Learning

Actually, learning ML is not as tough as you can think and if you just stay at using it, it is no more than a tool. The hardest works is that how to collect and preprocess the data before you feed it into ML algorithms. Each discipline have their own way to do that. For example, speaking from personal experience in building Brain – Computer interface systems (BCIs), the data in this case is EEG signal which is acquired from the scalp. Firstly, for data collection stage, what you need to do before collecting signal is to ask yourself some question to consider the purpose of collecting data, kind of what is the sampling rate you want to design the suitable hardware? Which position is chose to collect signal on your scalp to select the number of electrodes? etc. Secondly, for the preprocessing stage, of course, the raw data often have a lot of noise, so you need to use filters to denoise your datasets after that, then extract features into a feature input vector and feed it to ML algorithms. I can say that feature extraction is one of the most creative work when you try to analyze the data, and it require several knowledge about the field you work in. For example, come back to the field of BCIs, you may use some common descriptive statistics such as mean, median, standard deviation, etc. as features of data, but if you know a little bit about how your brain works, like the magnitude of alpha will increase and that of beta decrease when you are relax or in meditation, blah blah, and instead of statistical features, you now use the energy of alpha and beta from every electrode to make feature vector, it will be more efficient (not only decrease the size of feature vector but also increase the classification results). OK, you see that a lot of things to do before using ML. However, these works are beyond of this topic, so I will not discuss more.

In this series, I just focus on some popular algorithms in both supervised and unsupervised learning to help us solve some basic problems. And as I mentioned before, even your work is out of the field of technology, but if you are interested in ML, you can also learn and discuss it.

On the other hand, all codes in this series will be implemented using Scikit-learn and TensorFlow. A little bit about TensorFlow, it is low-level library which is more complicated than Scikit-learn to be used to implement ML algorithm. However, I would like to use it in some of my real projects because it is more efficient. The reasons of this is that TensorFlow sources was built by using C. Moreover, a computational graph is needed to drawn before running the ML algorithms. These help TensorFlow run much faster than Scikit-learn. Actually, there is other libs like Theano using the same ideas as TensorFlow but I would rather using it. It depends the taste.

Another thing is how to install TensorFlow? If you use Ubuntu or MAC OS, it wouldn’t be problem, prefer to [3] and [4] to get how to install it. However, there is a problem when you try to install TensorFlow in your Windows computer. The problem is that, TensorFlow is only support for Python 3.5.* on Windows. I tried it and it seemed not to work as expectation. For example, I wrote a simple code that print ‘Hello World’ to the screen using TensorFlow, but the line that is printed on the screen respond for byte instead of string type.

##### Fig. 2 TensorFlow on Python 3.5

On the other hand, as I mentioned in the first tutorial in the series of Basic Python Tutorial, there are some important libraries that cannot be used on Python 3.5.* such as Scipy. Thus, to solve this problem, I use Ubuntu on Virtual Machine. The instruction to install Virtual Machine is described on [5].

##### How to install Ubuntu on VirtualBox in Windows [5]

OK, so have you set up all need of software on your computer for ML? In the next tutorial, we will familiar to the simplest ML algorithms called Linear Regression.

Hope you enjoy it,

Curious Chick