Nearest neighbors
This tutorials corresponds to the same slides following the introductory developments performed in the previous tutorial. Based on the features computed, we will implement a simple querying and classification system based on Nearest Neighbors.
Reference slides
Download the slides
- Introduction to artificial intelligence
- Properties of machine learning
- Nearest-neighbors
Tutorial
In this tutorial, we will cover the simplest querying and classification algorithm, namely the \(k\)-Nearest Neighbor method. The idea is for a given (usually unknown) element, to find its closest neighbor(s), based on the distances between this element and the known dataset for a given set of features. Formally, given a set of elements \(e_{i}\), \(i\in\left[1,N\right]\) and their corresponding features \(\mathbf{f_{i,m}}\in\mathbb{R}^{d}\) (which denotes the \(m^{th}\) feature of the \(i^{th}\) element, which can be \(d\)-dimensional), we will need to compute a distance measure \(\mathcal{D}\left(\mathbf{f_{i,m}},\mathbf{f_{j,m}}\right)\) between the features of elements of the dataset. This distance will express the dissimilarity between two features. For the first two questions of the tutorial, we will simply consider that the dissimilarity between features is expressed by their Euclidean \((l_{2})\) distance.
\[\begin{equation} \mathcal{D}\left(\mathbf{f_{i,m}},\mathbf{f_{j,m}}\right)=\sqrt{\sum_{d=1}^{D}\left(f_{i,m}^{d}-f_{j,m}^{d}\right)^{2}} \end{equation}\]Given distances for each feature, we need to merge these various dissimilarities and then select the nearest neighbor of a particular element by computing
\[\begin{equation} kNN\left(e_{i}\right)=\overset{k}{\underset{j\neq i}{argmin}}\left(\frac{1}{M}\sum_{m=1}^{M}\left(\mathcal{D}\left(\mathbf{f_{i,m}},\mathbf{f_{j,m}}\right)\right)\right) \end{equation}\]Of course, based on the types of features, different distance measures can be used, such as any of the \(l_{p}\) distances (a generalization of the Euclidean distance)
\[l_{p}\left(\mathbf{f_{i,m}},\mathbf{f_{j,m}}\right)=\sqrt[p]{\sum_{d=1}^{D}\left|f_{i,m}^{d}-f_{j,m}^{d}\right|^{p}}\]the Cosine distance
\[cosine\left(\mathbf{f_{i,m}},\mathbf{f_{j,m}}\right)=1-\frac{\mathbf{f_{i,m}}\cdot\mathbf{f_{j,m}}}{\left\Vert \mathbf{f_{i,m}}\right\Vert \left\Vert \mathbf{f_{j,m}}\right\Vert }=1-\frac{\sum_{d=1}^{D}f_{i,m}^{d}f_{j,m}^{d}}{\sum_{d=1}^{D}\left(f_{i,m}^{d}\right)^{2}\sum_{d=1}^{D}\left(f_{j,m}^{d}\right)^{2}}\]or the correlation distance.
\[correlation\left(\mathbf{f_{i,m}},\mathbf{f_{j,m}}\right)=1-\frac{\mathbf{\left(f_{i,m}-\mu_{\mathbf{f_{i,m}}}\right)}\cdot\left(\mathbf{f_{j,m}}-\mu_{\mathbf{f_{j,m}}}\right)}{\left\Vert \mathbf{f_{i,m}-\mu_{\mathbf{f_{i,m}}}}\right\Vert \left\Vert \mathbf{f_{j,m}}-\mu_{\mathbf{f_{j,m}}}\right\Vert }\]The same observation holds for the way we decided to “merge” the different distances. By looking at these given definitions, start by thinking about the following questions.
Questions
- What problems can arise from \(n\)-dimensional audio features?
- Based on the selection equation, what constraints are implicitly made on the distances?
- Does the Euclidean distance seems like a sound way to measure the similarity between temporal features?
1.1 - Querying
First, we can use the nearest-neighbor idea to devise a very simple querying system. This type of method is typically used in many systems such as Query By Humming (QBH) softwares (similar to Shazam). As previously, we provide a baseline code in the main script. First, we create a \(N \times M\) distance matrix dataMatrix
corresponding to the \(M\) features of the \(N\) elements of the datasets. We selected here only the SpectralCentroidMean, SpectralFlatnessMean and SpectralSkewnessMean features. Then, after your code is filled, the dist
matrix should contain the mean distances (eventually, for various types of distances), which will then be sorted to the nnIDs
vector allowing to select the nearest neighbors.
Exercise
- Compute the set of distances between a random element and the rest of the dataset.
- Complete the plotting code in order to plot the element and its 10 nearest neighbors.
- Check that you obtain plots similar to those displayed below.
- Implement the \(l_{p}\), Cosine and correlation distances
- Try the same piece of code by varying the distances and the
usedFeatures
. - What can you tell about the discriminative power of features?
- What other steps should we perform on the features?
- (Optional) Extend your code to include temporal features
- (Optional) Extend your code to a fully functional Query-By-Example (QBE) system.
Expected output [Reveal]
1.2 - Classification
For the second part of this tutorial, we will rely on the same technique (computing the distance of a selected point to the rest of the dataset) in a classification framework. The overarching idea behind \(k\)-NN classification is that elements from a same class should have similar properties in the feature space. Hence, the closest feature to those of an element should be from elements of its right class. These types of approaches are usually termed as distance-based classification methods. The skeleton for this algorithm is provided in the 01_Nearest_Neighbors/knnClassify.m
function.
The algorithm will globally look quite similar to the previous one. However, this time we will compute the \(k\) Nearest Neighbors for each of the classes separately, which will allow to consider the resulting distance vectors as probabilities. Hence, we will compute for the set of classes \(\mathcal{C}_{t}\) the vector of distances, and select the \(k\) closest elements per class.
\[\begin{equation} kNN_{\mathcal{C}_{t}}\left(e_{i}\right)=\overset{k}{\underset{j\in\mathcal{C}_{t} \wedge j\neq i}{argmin}}\left(\frac{1}{M}\sum_{m=1}^{M}\left(\mathcal{D}\left(\mathbf{f_{i,m}},\mathbf{f_{j,m}}\right)\right)\right) \end{equation}\]Then, in order to consider the distances as probabilities, we compute for each class the mean distance of its \(k\) nearest neighbors and normalize these distances across classes
\[p_{\mathcal{C}_{t}}\left(e_{i}\right)=\frac{1}{k}\sum_{j=1}^{k}kNN_{\mathcal{C}_{t}}\left(e_{i}\right)\]In the knnClassify
function, we store in testFeatures
the vector of features from the element we are trying to classify, and construct a cell of features for each class in the classFeats
cell.
Exercise
- Update the
knnClassify
code to perform the basic k-NN classification function - Run the algorithms for both 1-NN and 5-NN evaluation
- Plot the different confusion matrix to visually check the accuracies (you should obtain the values displayed in the following figure).
- Extend the code to take various distances into account (argument
distType
) - What is the use of “class weighting” (argument
normalize
)? - Implement the class weighting system and evaluate its effect
- Perform the same classification with various K and features to evaluate the properties and qualities of different parametrizations.
- (Optional) Automatize the evaluation of various configurations.
Expected output [Reveal]
1.3 - Evaluation
When proposing machine learning algorithms, the fundamental aspects lies in correctly evaluating their performances. Depending on the application, methods, dataset and even the nature of corresponding data, a plethora of evaluation measures can be used. We highly recommend the following articles for those interested in machine learning, so that you develop your critical mind and do not limit yourself to narrow evaluations (by relying on statistical tests) and also that you avoid cherry picking
- Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7, 1-30. PDF
- Sturm, B. L. (2013). Classification accuracy is not enough. Journal of Intelligent Information Systems, 41(3), 371-406. PDF
- Keogh, E., & Kasetty, S. (2003). On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Mining and knowledge discovery, 7(4), 349-371. PDF
However, for the scope of this tutorial, we will limit ourselves to typical measures minimally required to evaluate your classifier. Overall, the most important aspects of evaluation lies in different ways of comparing the real labels (ground truth) to the assigned labels (picked by the classifier).
- The confusion matrix is computed simply by counting the occurences in which a particular instance of a real label (row) is classified to an assigned label (column). This code is already provided in the starter code, and all the following measures can be derived directly from it.
- The overall accuracy is computed as the ratio of correctly classified examples divided by the complete number of examples.
- The (per-class) precision defines the ratio of examples correctly assigned to a class divided by the number of instances assigned to that class by the classifier.
- The (per-class) recall defines the ratio of examples correctly assigned to a class divided by the number of instances really belonging to that class.
- The F1 measure is defined as the ratio between the geometric and harmonic means between the precision and recall measures.
You can implement these measures by simply completing the starter code. If you have doubts about the implementation of these measures, you can check the corresponding Wikipedia article
Exercise
- Implement the accuracy, recall, precision and F1 measure
- Evaluate the previous algorithms with your new measures.
- Perform an automatization of the previous evaluations.
- Run the automatic evaluation for different K, distances and features.
- Plot the corresponding results for comparison