SIFT (Bag of features) + SVM for classification

8 min readApr 16, 2020

Okay, I am gonna try to document what I have learned about this subject (computer vision) from my CS3244 project on Machine Learning.

Lots of thanks to my friend Tiffany who helped me to understand the concepts :)

Big picture

I will broadly classify the overall process into the main steps below:

  1. Identifying keypoints from an image: For each keypoint, we need to extract their features, using a certain feature extraction model such as SIFT, to create a 128-dimensional feature vector that describes it.
  2. Selecting feature descriptors from all the identified feature descriptors for each image (Keeping necessary ones and removing the rest).
  3. Vocabulary construction: From the features we obtain, we have to construct our own set of features (bag of features) to simplify and narrow down the features that we will check future image data against. We use KMeans and histogram for this.
  4. Classifying images using euclidean distance and identifying the key features present in the images in the form of a histogram.
  5. SVM: We use SVM for the final classification of images.

Before I go into details into each of the steps, let’s understand what are feature descriptors.

(Taken from StackOverflow) A feature descriptor is an algorithm that takes an image and outputs feature descriptors/feature vectors. Feature descriptors encode interesting information into a series of numbers and act as a sort of numerical “fingerprint” that can be used to differentiate one feature from another. Ideally, this information would be invariant under image transformation, so we can find the feature again even if the image is transformed in some way such as image rotation, illumination or noise. An example would be SIFT, which encodes information about the local neighbourhood image gradients the numbers of the feature vector.

Step 1: Identifying keypoints from an image (using SIFT)

A SIFT will take in an image and output a descriptor specific to the image that can be used to compare this image with other images.

  • Given an image, it will identify keypoints in the image (areas of varying sizes in the image) that it thinks are interesting. Besides identifying these keypoints, it will determine how big is the interest area.
  • As for how keypoints are being identified, maybe SIFT compares a pixel with its’ neighbours. The implementation will not be gone through here.
  • We can think of SIFT as a function because the same image (regardless of rotation as seen in the image below), will lead to the same keypoints being produced as output.
Image is taken from
  • For each keypoint (inside red square), SIFT may scale and resize the region to a 16 x 16 pixels.
Image is taken from
  • The 16 x 16 pixels will be divided into 16 4x4 pixel squares as seen below. In each of these squares, SIFT will produce a gradient vector (in 8 directions) as seen in the right image below.
  • For each 4x4 squares, SIFT will compute what is called gradient direction histogram over the 8 directions. Each 4x4 squares will have a histogram each. In the example, we will have a total of 16 histograms detailing the feature vectors of each 4x4 squares.
  • Remember that the 16x16 pixel represents the entire keypoint that was identified earlier. And we now have divided the keypoint into further smaller squares (4x4) and have 16 of them. We have 16 histograms. Now, we need to get a summary description of the entire keypoint. We need to have all the histograms combined together and we do so by concatenating the histograms to obtain a 128-dimensional feature vector(16*8) where 16 is the total number of histograms and 8 represents the number of vector directions.

To get a summary of step 1:

Whatever that has been explained in step 1 applies only to a single keypoint inside an image that may comprise more than one keypoint. We have to repeat the whole process above for all the other keypoints identified for a single image.

Let's say an image has 3000 keypoints associated with it. The second line of code below will take in the training data image to produce 128-dimensional feature vectors for each keypoint (descriptors)and frames. The variable descriptors will contain all the descriptors for all the 3000 keypoints in the image being fed.

Step 2: Selecting feature descriptors from all the identified feature descriptors (processed keypoints)

Continuing from above, after we obtain all the feature descriptors from all keypoints in an image, we will need to select which feature descriptors we want. Let’s say I only want 20 feature descriptors, I will pick and keep 20 of them. desc_samples will hence contain 20 feature descriptors. Note that n_sample has been initialised to 20 below.

Then, the next part of the code simply adds desc_samples to a list called total_SIFT_features which is supposed to contain all 20 feature descriptors for all the images. See that we use a for loop and the process is repeated for all training images.

Keypoints and feature descriptors are used interchangeably here…a feature descriptor vector is a processed form of a keypoint.

An image illustrating the general idea

Step 3: Vocabulary construction

After obtaining all 20 feature descriptors (processed keypoints) for ALL training images we will need to start building the vocabulary. The vocabulary is simply a set of feature descriptors that we check future images against to classify them correctly.

Lets’ say we have 100 training images and 20 feature descriptors for each training image. This means we have a total of 2000 feature descriptors (keypoints). We can define the vocabulary size to be of any number, 200 in this case (variable vocab_size).

The last line of code shows how we use clustering through k-means method to obtain the vocabulary of the required size 200.

Since we are going to refer to this vocabulary for future runs of image data, we need to store it. We can do so using pickle.

Here is an image summarising what was mentioned above.

Step 4: Classifying images using histograms (training/test data images)

Here, you still want to construct SIFT features in the same way we did when building the vocabulary.

We check the feature descriptors (keypoints) generated from each image against the vocabulary to determine how many of the 200 feature descriptors are present in this image. In other words, we need to assign each local feature to its nearest cluster center and build a histogram indicating how many times each cluster was used.

Say, we have 14 000 keypoints

Below is a pictorial representation of the histogram, assigned to the variable feats in the code, being produced. Each bin represents each feature descriptor. Following above, there are 200 of them (200 selected from 2000 of them generated from the 1000 training images with 20 feature descriptors each). The histogram simply showcases how many times each feature descriptor was found in the particular image we are trying to classify.

Then, we normalise and convert feats into np array which is to be fed into the SVM.

Step 5: SVM

Now, after obtaining the processed images we need to combine the images with their labels. The labels are simply 1 or 0s. Each image is associated with a label to indicate the presence of something. For eg, if the image is supposed to be an X-ray of someone with Pneumonia, the label should be 1. Otherwise, it will be 0.

Below is a great article explaning how SVM works:

And more about kernels (Changing the hyperplane to find one that best classifies the data):

There are actually many types of SVM kernels (linear, Gaussian, sigmoid etc)and we can choose whichever we want.

We then fit both images and labels.

Refer to the above link for more code examples on how to feed data images and labels into SVM.

Once we have trained the algorithm, the next step is to make predictions on the test data.

Then, we need to evaluate the algorithm. We can do so with the help of a confusion_matrix.

Here, we can focus on the columns titled precision and recall.

Recall: The ability of a model to find all the relevant cases within a data sets. Number of true positives divided by the number of true positive plus false negatives.

Eg, Out of those with pneumonia, how many did you call out?

Precision: The ability of a classification model to identify only the relevant data points. Number of true positives divided by the number of true positives plus false positives.

Eg, How correct are the classification? How many of those that were identified to have pneumonia actually does have pneumonia. The proportion of data that our model says is relevant is actually relevant.

For more about precision and recall:

More helpful links that were referenced:

The end! :)




𓆉︎ 𝙳𝚛𝚎𝚊𝚖𝚎𝚛 🪴𝙲𝚛𝚎𝚊𝚝𝚘𝚛 👩‍💻𝚂𝚘𝚏𝚝𝚠𝚊𝚛𝚎 𝚎𝚗𝚐𝚒𝚗𝚎𝚎𝚛 ☻ I write & reflect weekly about software engineering, my life and books. Ŧ๏ɭɭ๏ฬ ๓є!