Machine learning is one of my favourite studies (though I am not very knowledgeable in machine learning). It impressed me very much during my first time entering the lecture of the course. I was wondering how can a machine learn? I know how searching and optimisation work, they are just trial-and-error to get almost all the possibilities. But how a machine learn?

### How does it work?

The simplest example is using the **perceptron**. It is a very basic **Artificial Neural Network (ANN)** concept. It can be used to learn a very simple **pattern**. (All the learnings are related to the pattern.) Perceptron simulates the neurons of the brain. In human, when we learn something, the **connections** of the neurons become stronger. In perceptron, the connections of the neurons are represented as the “**weights**“. The stronger the weight, the stronger the connection.

In a nerve cell (neuron), the electrical signal will only be activated when a certain threshold is reached. Similarly, in the perceptron design, the **activation function** will be used so that the value (represents the signal) will be transferred when triggered.

Perceptron is a **supervised learning**. Meaning that, the machine will only learn when someone/something is supervising it, that is the machine must be told whether the output is correct or incorrect during the process of learning. The differences of the **desired output** (from the supervisor) and the **actual output** (from the perceptron) is called **error**. Thus, the purpose of the learning is to reduce the error, so that the actual output is as similar as the desired output. This is just like a child who starts to learn something and supervised by the parents.

In order to let a machine to learn, the machine must be given a set of **training samples**. The machine will **update the weights** of the neurons when given the training. The weights will be updated in each training. The training sample set should be large and diverse so that the machine will “experience” various conditions. Each sample can be trained several times. The order of the training sample can be randomised. Each round of training with all samples, we called it as an **epoch**. Thus, the more epochs it goes, the **mean squared error** will theoretical be reduced.

The training of the machine can be ended if the mean squared error is **close to zero**. This indicates that the machine can produce the actual outputs that are very close to desired output. Another condition to end the training is when the mean squared error does not reduced for many cycles of training consecutively. This indicates that, there is no more improvement on the weights any more even more epochs are run. (Please bear in mind that, not all information has a pattern, such as a sequence of random numbers, there is no pattern so that there is no way to learn from the random.)

The last condition to let the machine to end is the number of epochs reaches the user defined number, such as 1000 or more. That means, after a long time of learning, the weights is not stabled and the mean squared error is not reduced to the desired level. So, the training just ended because the training does not improve the machine any more.

After the machine is trained, the machine can be given a set of **testing samples**. The testing samples should not derived from the training samples. This is to evaluate the performance of the machine. **Zero error in the training samples does not indicates zero error in the testing samples**. Normally, the testing samples are the real world data and also our actual input. Only after the testing samples produce the satisfying result, the machine can only be considered success.

### Learning rate

There is a very interesting mathematical formula for the machine to learn, especially perceptron. In order to update the weights, the weights need to be added (or substracted) by a value. The calculation of the weight is

w_{i}(t+1) = w_{i}(t) + α ⋅ *e* ⋅ x_{i}(t)

where w_{i}(t+1) is the new weight, w_{i}(t) is the current weight, α is the **learning rate**, *e* is the error, x_{i}(t) is the current input value of the neuron.

The interesting part is the **learning rate**. It is a value within the interval of [0,1]. The higher the learning rate, the more changes of the weights for each update (training); the lower the learning rate, the less the changes of the weights. The higher learning rate will results faster learning, but not guarantee an optimal result (because the changes is too big); the lower the learning rate, the slower the learning, yet it may produce better result than higher learning rate.

So does our learning. Some children, they are fast learner, but their learning performance is dropping when they get older. Yet, some children are slow learner, then the performance becomes better and better when they grow older.

### Over-train

If the training samples given to the learning machine contain limited patterns, yet in fact the real world problems contain wider range of the patterns, and the machine is trained with these samples thoroughly, this will result **over-train**. That is the machine weights are totally adapted to the given pattern. When given another set of training samples (which contains other patterns), the machine will will need to have a longer time to be trained or even may be failed. That is why, randomize the training samples are important during the training.

This is exactly similar to us. If we are “over-trained” with something, we will stick to it, and difficult for us to change, such as our language, our habits, our hand-writing, etc.

### Momentum

In the ANN, the training process will sometimes produce a long time of static stage, that is, no more improvement after many epochs. This is sometimes a local optima, which we may think that this is the optimal result. In fact, continue with more epochs, this local optima will be passed and a better result will be produced (global optima).

Metaphorically, this is just like a ball rolling down from the mountain to get the lowest point, yet due to the rugged surface, the ball will stuck at some point. With the **momentum** (stronger power) to move the ball, the ball may by pass the small peak to reach the bottom. (The peak indicates larger error; the bottom indicates the minimal error).

Interestingly, when we learn something, we will often reach a bottleneck. This bottleneck make us feel no improvement and boring. With the further persistent like momentum, we can possibly by pass this bottleneck to reach a higher level of what we are learning.

### Implications

Because of this machine learning, I learnt something else in my learning method.

- The machine can learn, with just 2 input neurons and 1 output neuron can demonstrate a simple learning. Yet our brain is even more complex, why can’t we learn?
- A machine can learn with the supervision. This is why supervision from the superior to the subordinate is sometimes important, especially when the subordinate is a beginner.
- When we learn, we are trying to figure out the pattern. Just like playing games, doing maths, learning a language, learning a skill, etc. In the example of doing maths, the more questions we do, the better our mathematical skills. Because we experienced various types of questions. This is just like the large set of training samples improve the machine learning.
- When we learnt a skill from book or from some other learning materials, and we can solve the problems perfectly, yet this doesn’t mean that we can solve the real world problem as well. This is just like the training sample set and the testing sample set.
- Slow learner is not necessary more inferior. Stable and smooth learning seems work better.
- If a person is over-trained to certain patterns, he/she will stick to the pattern.
- In order to surpass our learning bottleneck, we need perseverance.

### Unsupervised learning

Interestingly, the perceptron algorithm was developed during 1957. Then **unsupervised learning** methods are developed. In my opinion, the development of these algorithms just like a human life. When we are children, we need parents and teachers to supervise our learning, to tell us what is right and wrong. When we have reached the reasoning age, we can learn by ourselves little by little. And this is the unsupervised learning.

P/S: The parents can recognise/differentiate their twin children, this is because they are over-trained to differentiate them. For people not familiar, will see the twins as identical. Yet the parents can differentiate, because they “learn” them everyday for years.