1 minute read

Naive Bayes is a supervised classification algorithm. Let’s use the well known titanic dataset shown below to study the Naive Bayes model. The titanic dataset records features for each passenger on the Titanic and whether they survived or not. Here, \(y_i\) is binary and represents whether passenger \(i\) survived or not. Meanwhile, \(x_i\) contains all the other features (Pclass, Name, Sex, Age, etc.). Notice that some features are discrete (Pclass, Sex) while others are continuous (Fare). The key assumption made in Naive Bayes is that features are conditionally independent given the class label. Mathematically this yields, $$p(x\mid y=c) = \prod_{j=1}^D p(x_j\mid y=c)$$ For our Titanic example, the goal is to predict whether passenger \(i\) will survive or not given features Pclass, Name, Sex, etc. Naive Bayes will use the dataset to learn a model which intakes a passenger’s set of features and predicts whether he or she will survive on the Titanic. Therefore, according to Bayes' Theorem, the goal is to compute $$p(y=c\mid x, D) \propto p(y=c\mid D) \prod_{j=1}^D p(x_j\mid y=c, D)$$ and choose the class with the highest probability. Namely, to get the probability of a male passenger in class 1 surviving we compute, $$p(y=1\mid x, D) \propto p(y=1\mid D) p(Sex=Male \mid y=1, D)p(Pclass=1\mid y=1, D)$$ If we use MLE to train our model, \(p(y=1\mid D)\) is simply the fraction of passengers in the dataset that survived, \(p(Sex=Male \mid y=1, D)\) is the fraction of survivors who were male and \(p(Pclass=1\mid y=1, D)\) the fraction of survivors in class 1. We can also train the model using MAP or by computing the full posterior in a Bayesian setting.

To close, let's discuss some advantages of using Naive Bayes. Since Naive Bayes assumes conditional independence among features, it bypasses the curse of dimensionality. This means that it requires a relatively smaller amount of data to train and is extremely fast. Moreover, it has proven to work well in document classification and spam filtering