Naive Bayes

  • I will begin my note with the plain expression from me.If there are something wrong, remember to remind me of it in Guestbook or by sending me a e-mail.
  • Naive Bayes here refers to a method used in classifying data by machine learning.
  • First, you need to know Bayes theorem. For Bayes theorem you just need to remember the following equation $$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$ which means:
    • P(A|B)P(B) = P(B|A)P(A) = P(A∩B)
  • second, you need to know independence assumption(more details: statistical assumption), which can be denoted by the following equation:
    $$(P(X=X_n) =P(X=X_n^1,X=X_n^2,······,X=X_n^n) = P(X=X_n^1)P(X=X_n^2)······P(X=X_n^n))$$

  • ( X_n^j ) is the jth dimention parameter of ( X_n ), which is a vector of n dimensions. The equation above illustrates ( X_n^j ) is independent with other dimention like (X_n^{j-1}), (X_n^{j+1}) and so on. That’s to say,that (X_n^j) occurs or not has no business with that (X_n^i) happens or not, where i ≠ j. Thus, We can get the assumtion very easily.

  • You can just understand the equation above very quickly, believe me!

  • Now, we can begin. Given a feature vector X of n dimension. What we want to do is to figure out the probability of (Y=C_k) when given a condition that X is (X_n), which is denoted by (P(Y=C_k|X=X_n)).

  • Tips: The equation above means that when you give a input like (X_n), the machine can figure out the probability distribution of its corresponding output. According to the probability distribution, you can choose the biggest one as the final result(according to Maximum Likelihood Estimation), with which you can start classifying.

  • Next, we will try to figure out with Bayes theorem.Now we give out the computing equation like above:
    $$P(Y=C_k|X=X_n) = \frac{P(X=X_n|Y=C_k)P(Y=C_k)}{P(X = X_n)}$$
  • Analysis the equation above.
    • (P(X_n)) equals to 1 because you have given the input Xn.
    • (P(Y=C_k)) is a prior distribution, which can be computed by the data set. eg. Five (Y=C_k) samples exists in data set, and there are totally 10 samples. Therefore, (P(Y=C_k) = \frac{5}{10} = 0.5)
    • (P(X=X_n|Y=C_k) needs to be transformed later to get the maximum value, which is most likely to be the category of the given input Xn.
  • Now we will go to see the (P(X=X_n|Y=C_k). According to the independence assumption, the equation above can be transformed like this:
    $$(P(X=X_n|Y=C_k) = P(X=X_n^1,X=X_n^2,······,X=X_n^n|Y=C_k) = P(X=X_n^1|Y=C_k)P(X=X_n^2|Y=C_k)······P(X=X_n^n|Y=C_k))$$

  • Finally, what we want to do is to maximize $$(P(X=X_n^1|Y=C_k)P(X=X_n^2|Y=C_k)······P(X=X_n^n|Y=C_k))$$. Actually, this is very easy to figure out from the data set. The final (C_k) that makes the production be the maximum is just what we want.