Machine learning


Machine learning is one of my favourite studies (though I am not very knowledgeable in machine learning). It impressed me very much during my first time entering the lecture of the course. I was wondering how can a machine learn? I know how searching and optimisation work, they are just trial-and-error to get almost all the possibilities. But how a machine learn?

How does it work?

The simplest example is using the perceptron. It is a very basic Artificial Neural Network (ANN) concept. It can be used to learn a very simple pattern. (All the learnings are related to the pattern.) Perceptron simulates the neurons of the brain. In human, when we learn something, the connections of the neurons become stronger. In perceptron, the connections of the neurons are represented as the “weights“. The stronger the weight, the stronger the connection.

In a nerve cell (neuron), the electrical signal will only be activated when a certain threshold is reached. Similarly, in the perceptron design, the activation function will be used so that the value (represents the signal) will be transferred when triggered.

Perceptron is a supervised learning. Meaning that, the machine will only learn when someone/something is supervising it, that is the machine must be told whether the output is correct or incorrect during the process of learning. The differences of the desired output (from the supervisor) and the actual output (from the perceptron) is called error. Thus, the purpose of the learning is to reduce the error, so that the actual output is as similar as the desired output. This is just like a child who starts to learn something and supervised by the parents.

In order to let a machine to learn, the machine must be given a set of training samples. The machine will update the weights of the neurons when given the training. The weights will be updated in each training. The training sample set should be large and diverse so that the machine will “experience” various conditions. Each sample can be trained several times. The order of the training sample can be randomised. Each round of training with all samples, we called it as an epoch. Thus, the more epochs it goes, the mean squared error will theoretical be reduced.

The training of the machine can be ended if the mean squared error is close to zero. This indicates that the machine can produce the actual outputs that are very close to desired output. Another condition to end the training is when the mean squared error does not reduced for many cycles of training consecutively. This indicates that, there is no more improvement on the weights any more even more epochs are run. (Please bear in mind that, not all information has a pattern, such as a sequence of random numbers, there is no pattern so that there is no way to learn from the random.)

The last condition to let the machine to end is the number of  epochs reaches the user defined number, such as 1000 or more. That means, after a long time of learning, the weights is not stabled and the mean squared error is not reduced to the desired level. So, the training just ended because the training does not improve the machine any more.

After the machine is trained, the machine can be given a set of testing samples. The testing samples should not derived from the training samples. This is to evaluate the performance of the machine. Zero error in the training samples does not indicates zero error in the testing samples. Normally, the testing samples are the real world data and also our actual input. Only after the testing samples produce the satisfying result, the machine can only be considered success.

Learning rate

There is a very interesting mathematical formula for the machine to learn, especially perceptron. In order to update the weights, the weights need to be added (or substracted) by a value. The calculation of the weight is

wi(t+1) = wi(t) + α ⋅ e ⋅ xi(t)

where wi(t+1) is the new weight, wi(t) is the current weight, α is the learning rate, e is the error, xi(t) is the current input value of the neuron.

The interesting part is the learning rate. It is a value within the interval of [0,1]. The higher the learning rate, the more changes of the weights for each update (training); the lower the learning rate, the less the changes of the weights. The higher learning rate will results faster learning, but not guarantee an optimal result (because the changes is too big); the lower the learning rate, the slower the learning, yet it may produce better result than higher learning rate.

So does our learning. Some children, they are fast learner, but their learning performance is dropping when they get older. Yet, some children are slow learner, then the performance becomes better and better when they grow older.

Over-train

If the training samples given to the learning machine contain limited patterns, yet in fact the real world problems contain wider range of the patterns, and the machine is trained with these samples thoroughly, this will result over-train. That is the machine weights are totally adapted to the given pattern. When given another set of training samples (which contains other patterns), the machine will will need to have a longer time to be trained or even may be failed. That is why, randomize the training samples are important during the training.

This is exactly similar to us. If we are “over-trained” with something, we will stick to it, and difficult for us to change, such as our language, our habits, our hand-writing, etc.

Momentum

In the ANN, the training process will sometimes produce a long time of static stage, that is, no more improvement after many epochs. This is sometimes a local optima, which we may think that this is the optimal result. In fact, continue with more epochs, this local optima will be passed and a better result will be produced (global optima).

Metaphorically, this is just like a ball rolling down from the mountain to get the lowest point, yet due to the rugged surface, the ball will stuck at some point. With the momentum (stronger power) to move the ball, the ball may by pass the small peak to reach the bottom. (The peak indicates larger error; the bottom indicates the minimal error).

Interestingly, when we learn something, we will often reach a bottleneck. This bottleneck make us feel no improvement and boring. With the further persistent like momentum, we can possibly by pass this bottleneck to reach a higher level of what we are learning.

Implications

Because of this machine learning, I learnt something else in my learning method.

  • The machine can learn, with just 2 input neurons and 1 output neuron can demonstrate a simple learning. Yet our brain is even more complex, why can’t we learn?
  • A machine can learn with the supervision. This is why supervision from the superior to the subordinate is sometimes important, especially when the subordinate is a beginner.
  • When we learn, we are trying to figure out the pattern. Just like playing games, doing maths, learning a language, learning a skill, etc. In the example of doing maths, the more questions we do, the better our mathematical skills. Because we experienced various types of questions. This is just like the large set of training samples improve the machine learning.
  • When we learnt a skill from book or from some other learning materials, and we can solve the problems perfectly, yet this doesn’t mean that we can solve the real world problem as well. This is just like the training sample set and the testing sample set.
  • Slow learner is not necessary more inferior. Stable and smooth learning seems work better.
  • If a person is over-trained to certain patterns, he/she will stick to the pattern.
  • In order to surpass our learning bottleneck, we need perseverance.

Unsupervised learning

Interestingly, the perceptron algorithm was developed during 1957. Then unsupervised learning methods are developed. In my opinion, the development of these algorithms just like a human life. When we are children, we need parents and teachers to supervise our learning, to tell us what is right and wrong. When we have reached the reasoning age, we can learn by ourselves little by little. And this is the unsupervised learning.

 

P/S: The parents can recognise/differentiate their twin children, this is because they are over-trained to differentiate them. For people not familiar, will see the twins as identical. Yet the parents can differentiate, because they “learn” them everyday for years.

Gambler’s fallacy


Referring to my previous post about gambler’s fallacy, I was totally wrong after I pondering more about this.

In an example of tossing a coin, we know that to get a “tail” is 0.5 probability and “head” is 0.5 probability. That means, each result should fairly appear once. And in the experiment, if we tossed the coin 1000 times, then we will get the result of “tail” appeared around 500 times and “head” another 500 times.

And in my previous post, I mentioned that, if I tossed the coin 10 times, and all the results are “tail”, then, as a gambler’s fallacy, I will feel that next toss or next 10 tosses should be probably “head”, so that the probability will be 0.5 and 0.5.

However, the problem is the “time to start tossing” restricted my thinking, thus I have a feeling as mentioned above.

In the experimental probability, the more we toss the coin, and collect the results, then the more accurate our results. For example, calculating the probability by tossing the coin 1000 times is better than calculating the probability by tossing the coin 100 times. Thus, it is not valid by tossing the coin ONCE and conclude that, “tossing the coin will ALWAYS be head (or tail)”.

Therefore, referring the situation that if I tossed the coin 10 times and all the results are “tail”, it cannot be considered as a reliable data. This is because, “someone” may have tossed the same coin 10,000,000 before me and the the result of probability 0.5 and 0.5. Thus my 10 times and get the “tail” doesn’t mean anything.

Besides that, the experiments are done to get the calculation of the probability, not reversing it by presume a probability and test by the experiments as the situation above. If I am the first person to toss a specific coin 100 times, and all the results are “tail”,  then I can say that the probability of getting the “head” of that specific coin is less than 0.5 and the “tail” is more than 0.5. I cannot simply assume that the next 100 times have the high probability to get “head”. There are several reasons: i) the coin may be poorly designed, it may ALWAYS produce “tail”, and ii) the event of tossing the coin is independent, that is tossing the coin now does not affect tossing the coin next time.

So, my commenter’s statement is very convincing.

Prayer life


There are three ways of prayer according to Catechism of the Catholic Church.

  1. Vocal prayer 口祷
  2. Meditation 默想(运用思维、想像、情感、渴望)
  3. Contemplation 心祷

What is the difference between meditation and contemplation? The following part is a summary from here.

Contemplative prayer is more passive or sublime experience of God. Meditative prayer is more from our work of seeking God (though with the aid from God). Contemplative prayer can be distinguished as the pure work from God.

However, meditation can be again differentiate with discursive meditation and affective meditation. Discursive meditation is more to logical analysis to discover the insight or deeper understanding about the God. And this discovery will lead to the conversation with God such as thanksgiving, praise, contrition, and petition.

Affective meditation is more to conversation (not necessarily emotional) from the soul.

After a period of spiritual maturity, a person without much discursive effort, can enter into “prayer of simplicity” (or prayer of quiet). This is the contemplative prayer.

A deeper contemplative prayer is “infused contemplation” which God submerges us in himself and we feel a union with him. This is actually another level of prayer.

Therefore the, most active mental prayer is discursive meditation, which leads to affective meditation, then followed by the contemplation.

However, sometimes we feel that our prayer life is not growing, or worse. So, to understand this, I summarized another post.

The faculties of the soul are intellect and will. Where the intellect allows us to know something abstract (not exactly knowing what or how); will allows us to freely choose good things. Contrary to our sense faculties, they are sight, sound, touch, taste, smell. Emotional and imaginationconsolation” is more to the sense faculties. “Consolation” can also be experienced by intellect and will. Without these consolation, we are in the dryness of prayer. The dryness may come from ourselves, or from God, so that we are not looking for the “consolation”, but God himself, to grow our faith.

Consolation is a sense of presence of God in our souls and hearts, new insight about God, about the world, about ourselves, during meditation or prayer (refers from here).

The dryness from God, can also be assumed as “passive purification“, that is to burn our impurities that are beyond our reach. While “active purification” is our own acts such as mortification.

The long period of dryness on the level of emotions and imagination is also known as “dark night of the senses” (stated by St John of the Cross). If it is on the level of intellect and will, it is “dark night of the soul“. That is how the holy souls suffer in the purgatory (refers from here).

Knowledge Tree


This is an interesting thing. I love the tree so much. The Knowledge Tree here I mean is not referring to the Tree of Knowledge of Good and Bad.

Tree is a graph theory. The powerfulness of tree, is that almost any data structure can be represented into the tree. In order to store the data, we can use a single variable, for a single data. In order to store a list of data, we can use array and associative array (or hash, map, dictionary), list, vector, and others. Another powerful data structure which I like is the 2D array, especially used in matrix calculation. The 2D array represents the data in the table form. It is very useful in database. An image can also be treated as 2D array. Comparing the tree to the matrix, the tree is possible to represent the matrix (though it is not preferred).

Tree is mostly used in the programming, especially data representation such as JSON, XML, HTML, YAML, scene graph. Even the object (oriented) can be assumed as a tree, with the attributes and methods as the leaves, and the inheritance a parent-child relationship in the tree. Moreover, the programming languages can be parsed into parse tree and abstract syntax tree. The Python language is wonderful with the indentation syntax, which the indentation can be thought as the tree also.

In the decision support tools, the decision tree is used. In the pathfinding, such as depth first search, breath first search, A* search, these are also represented as graph and tree. In the file system, we have files and directories, these are also tree representation that is why the “/” is called the root directory. Furthermore, in the version control such as CVS, SVN, Git, Mercurial and else, they are also trees. That is why there are trunk and branches in the versioning system. In a document, we have heading 1, 2, 3, and so on with the body text. This makes our documents also a tree structure.

As a result, by combining these together, especially Python’s indentation syntax, I am creating my knowledge base using the tree forms. This is easy because creating something like concept map needs image editor software, it is time consuming. So, just using a plain text editor, with indentation, this helps me to manage my knowledge easier.

Example,

Text editor
    File
        quit, open, save
        buffers (open multiple files)
    Edit
        insert, delete, select (or highlight)
        cut, copy, paste (or yank)
        find, find next, find previous
        search and replace (with regular expression)
        undo, redo
    Others
        macro
        windows (split)
        mark
        folding
        syntax highlighting
        autoindent

The above is an example of my knowledge node, which is a text editor with the generic features. The next one is about the data types of database.

Database
    Null
    Numeric
        Integer
            bool
            int, tinyint, samllint, mediumint, bigint
            bit
        Real
            float
            double
    String
        Text
            datetime
                date
                time
                timestamp
            varchar, char
            enum, set
        Blob

Is gambler’s fallacy really a fallacy?


The probability subject is a very difficult subject to me. This is because it involves estimation of all the possible events. Therefore, it involves the combination and permutation. And there is no exact formula for different situations. It also involves statistics.

Gambler’s fallacy, is a very good notion. To simplify it, gambler’s fallacy is a belief that the next outcome will be different if the observed outcome is repeated consecutively, where these events are actually independent. The best example is tossing the coin, which has the probability of 0.5 for head and 0.5 for tail. Because tossing the coin first time will not affect the second time, the probability to get the head or tail is always same.

For example, first tossing the coin to get the head is 0.5, then 2nd for head is 0.5*0.5 = 0.25, then 3rd for head is 0.125. As for the gambler’s fallacy, the person will think that the probability to get another head is 0.0625, which the chance is very small. Thus, the person will assume that the next one is tail. However, in the actual sense, because of the events are independent, thus, to get the 4th time as tail, it is also 0.25 * 0.5 = 0.0625. That means, whenever we toss the coin, the probability to get head or tail is always 0.5.

However, recently I think about the probability again in the empirical way. Firstly, we need to know, the probability 0.5 means that, if we toss the coin 1000 times, the result of head is approximately 500 times. The greater the number of tossing, the results will be more close to 0.5. However, if the total number of tossing decreases, the deviation of the empirical result becomes higher. For example, if we toss the coins only 2 times, we might get 2 tails for both tossing, where the empirical result of head is 0.

So, that is why gambler’s false assumption happened. If a person is going to toss the coin 500 times, and this results 250 tails successively, that means the next 250 toss must be heads, so that the empirical result will be 0.5. This is interesting part. This kind of belief normally connected to the fate or luck. That is why some people believe that if we are too lucky successively, we might finish using our good luck for our whole life, then we will left only bad luck until the end of the day.

As a conclusion, the gambler’s fallacy is true (refers to 2nd and 3rd paragraphs). But sometimes we cannot accept it, for example tossing the coins and get the head 10 times successively, then the next 10 toss are probably tails, so that the probability will equal to 0.5, this is what we normally believe. Therefore, if asking me to guess the next outcome after 10 successive heads, I will also guess tail, even I know that gambler’s fallacy is true.

Learn to speed read


I want to learn the speed reading skill. So, I look for the resources, and keep practising. The following is the summary on what I learn about speed reading, especially based on the following video.

Average reading speed is 150-250 wpm (words per minute).

Most of the problem of reading is regression, that is go back to read again. To avoid this problem, if one sentence we don’t understand, we just continue. Only go back if the whole paragraph is not understandable. To make the reading continuously, one of the way is using a finger or any pointer to go through the words. There are several reasons. One of the reason is that our eyes always distracted by the thing it is moving. Another reason is help us to focus. Also, using finger, we will not go back to read again. This is best way used in speed drill (refer below).

Next, to reduce fixation problem, we need to learn the peripheral vision to read the words. (Comparing to foveal vision, foveal vision is the image fall to the fovea centralis). To learn the peripheral vision, when we read the text for each line, we no need to “look at” the first word. Our peripheral vision will help us to see the first word. This can be learn until read at the first 3rd word until the last 3rd word. This is to reduce the saccades (eye movement).

The auditory reassurance (aka subvocalisation) is the situation when we read the words, we read it in our mind auditorily. That is, hearing the voices in our mind when we do reading. To suppress the auditory reassurance, we can repeat such as ABCD or 1234 in our mind as subvocal. But the best way is learn the speed drill to reduce the subvocal including the 1234 or ABCD. This is because like me, I will affected by my new subvocal, “1234”, then my eyes movement will base on the rhythm of the subvocal.

Reading has three important areas: 1) speed, 2) comprehension, and 3) retention. Retention is how much we remember, different from comprehension. Therefore, it is impossible to train three of these at the same time, focus training one thing only. Firstly speed, then comprehension, then only retention. So, when we learn the speed, we our comprehension and retention will drop down extremely. Never mind, just keep practising. Good results will come forth when practise more and more.

In order to learn speed reading, firstly we can practice speed drill by finger. Go through faster and faster, without focusing on the comprehension. As we can see (not read) faster for the words, when we go a little slower, it is able to let us understand the words better, which will increase our comprehension.

To increase comprehension, scanning the headlines or titles will let us grasp some meaning of the text.

(Update: 2012-09-23)

I missed one of the point. Besides practising speed drill, another practice is eyes movement. That is to exercise our eye muscles. Keep moving our eyes left and right, or up and down, to exercise the muscle so that our eyes can move fast. It is useful to improve the saccades and fixation when reading.

When we lose faith


As a Christian, I always think about the meaning of life. The meaning of life means the reason of living and the purpose to live. Even some Christians, they will relate everything with a meaning.  That means, whatever incident is happened, there is a reason behind, the incident is happened with a purpose. If a person broke a vase, there is also a meaning, may be the vase must be broken. If a person woke up late, may be this is the reason that he could avoid from an accident.

Then, when do we lose faith? For a person believes that every incident happened with a purpose, if a great tragedy happened unexpectedly, without any reason, not understandable, without purpose, the outcome is only difficulty, despair, and sorrowfulness, then this is the time when he might lose his faith. Joseph the Patriarch might lose faith when he was betrayed by his own brothers and sold to Egypt. Job might lose faith when all his children die.

But those who endure and wait until the end of these trials, surely will gain their faith.