These few days I was spending my whole time to understand this ROC (receiver operating characteristic) curve. In machine learning, ROC is a very common way to evaluate the prediction performance. The AUC (area under curve) of ROC indicates the accuracy of prediction of a classifier.
I searched through tutorial and Q and A sites on how to do the plot of ROC and calculating the AUC. The answers were telling me about “cut-off”, “threshold”, or some weird terms. And some answers were telling me to use R package to plot the graph. WTF? I know all of these things. My question was, “How can I plot ROC curve with my classifier?!”
So, there was a real gap between what I had known and the problem I faced. The gap was the classifier that I created cannot be used to plot the ROC curve directly. Because my classifier is a discrete classifier. That is, it just predicts with a label, then checks with actual label whether the prediction is true or false. Though, I can improve this result to true positive, false positive, true negative, and false negative, it can only produce one point in the ROC space.
In order to plot the curve, I need a theshold that can be manipulated to produce a sequence of TPRs (true positive rates) and FPRs (false positive rates). If I have a discrete classifier, I can never produce the sequence of TPRs and FPRs. That is my real problem.
In order to solve this, the only solution is to transform my classifier to ranking or scoring classifier. This is the crucial part that solves my problem. Now, the question is, what is the score here? I believe all classifiers involve some calculations to get some values, then only do the prediction whether the input should be labelled as true or false, even it is a discrete classifier. So, that calculated value is the score! Even if your classifier does a random guess, the random value can be used as the score.
As a result, I need to discard the discrete classification when plotting ROC curve. But the discrete classification is stilled being used during the training. So, by adjusting the theshold, for any input that produces the score greater than the theshold will be labelled as the positive. And from here, I just need to check whether it is true positive or false positive. Continuously adjusting the threshold, I can get a sequence of TPRs and FPRs. Yay. No more discrete classification, and the ROC curve produced.
P/S: Maybe this concept is just as simple as ABC, that is why nobody cares and mention it online.