Naive Bayes Classifier from Scratch

The Problem

In modern Data Science, it is easy to import sklearn and fit a model without understanding the underlying mechanics. I wanted to break out of the "black box" mentality and understand exactly how a machine learns to classify data.

I didn't want to simply use a pre-built library. I wanted to implement the mathematical engine myself, translating raw statistical theorems into executable Python code.

The Challenge: Writing an ML algorithm from scratch means handling the math manually. My goal was to build a classifier that could compete with standard libraries in accuracy, relying solely on my own implementation of probability theory.

The Goal

My goal was to build a Gaussian Naive Bayes classifier that demonstrated:

First Principles Implementation: Coding the Gaussian Probability Density Function (PDF) manually.
Statistical Training: Calculating priors, means, and standard deviations for multi-dimensional datasets without using model wrappers.
Verification: Achieving high classification accuracy (>95%) on the standard Iris Flower dataset to prove the math holds up.

The Solution

Technical Decisions

I chose Python and Pandas for data manipulation, but I strictly forbade the use of scikit-learn for the modeling itself.

The "Math" Challenge: Instead of calling .fit(), I had to architect the training phase myself. I treated the training data not as a black box, but as a statistical distribution. By grouping samples by class and feature, I reduced the entire dataset into a "Summary Model" consisting only of the Mean and Variance for each attribute.

Algorithms & Challenges

The core engineering challenge was translating Bayes' Theorem into code.

Gaussian PDF: To determine the likelihood of a data point belonging to a class, I implemented the Gaussian Probability Density Function formula manually. This allows the model to handle continuous data (like petal lengths) by assuming a normal distribution.
Posterior Calculation: I wrote a custom prediction engine that calculates the posterior probability for every class and selects the winner (argmax).
Handling Precision: Multiplying many small probabilities together often leads to floating-point underflow. I learned to structure the code to handle these minute calculations carefully.

The Result

The final algorithm successfully classified the Iris dataset with 96.6% accuracy, matching the performance of production-grade libraries.

It serves as a transparent proof-of-concept, showing that "Machine Learning" isn't magic—it's just statistics at scale.

Lessons Learned

This project gave me a massive appreciation for Bayesian Statistics. I learned that the "Naive" assumption (that all features are independent) is mathematically incorrect for many datasets, yet creates a model that is surprisingly robust and efficient.

I also learned the value of vectorized thinking. While my initial approach used loops, understanding how to apply statistical operations across entire columns of data is crucial for writing performant data science code.