LET’S IMPLEMENT “LAZY LEARNING” ALGORITHM USING XL8ML — PART I

6 min readMay 20, 2021

XL8ML (https://xl8ml.com) is a powerful parsing and data manipulation language that gets me data in the format I need and it’s state of the art machine learning functions allow me to perform advanced predictive analytics while remaining within Excel work environment.

Have you ever wondered how companies like Netflix and Amazon recommend you different movies to watch and things to buy? Well, won’t you be surprised if I say these companies apply a machine learning algorithm on a data gathered about the movies you have watched or the things you have bought on their website? In this blog, we will explore the implementation of K nearest neighbors (KNN) algorithm also known as lazy learning algorithm using XL8ML software.

Before we discuss about KNN, let’s first understand what supervised learning is. Here the algorithm learns from labelled data and determines the label for the new data by understanding the patterns of the unlabeled new data. Supervised learning can be divided into,

· Regression: Numerical value is predicted by observing previous values. You can check out https://poojadurai1997.medium.com/a-sneak-peek-into-linear-regression-using-xl8ml-software-e558d3fc316b for understanding regression better

· Classification: The category of the data is predicted

K nearest neighbors algorithm is a supervised learner that is used for both classification and regression problems. In Part I, we will see about KNN as a classifier.

KNN uses the entire dataset in its training phase. Whenever a prediction is required for an unseen data, it searches through the entire training dataset for k-most similar instances and the data with the most similar instance is finally returned as the prediction. It is often used in applications where you are looking for similar items. For example, if apple looks more similar to peach, pear, and cherry (fruits) than dog, cat or a rat (animals), then most likely apple is a fruit.

KNN is considered both non-parametric and an example of lazy learning.

· Non-parametric means that it makes no assumptions. The model is made up entirely from the data given to it rather than assuming its structure is normal.

· Lazy learning means that the algorithm makes no generalizations. This means that there is little training involved when using this method. Because of this, all of the training data is also used in testing when using KNN.

It uses a very simple approach to perform classification. When tested with a new example, it looks through the training data and finds the k training examples that are closest to the new example. It then assigns the most common class label (among those k-training examples) to the test example. k in KNN algorithm represents the number of nearest neighbor points which are voting for the new test data’s class. If k=1, then test examples are given the same label as the closest example in the training set. If k=3, the labels of the three closest classes are checked and the most common (occurring at least twice) label is assigned.

Let’s get started on the implementation! We are using the Social network ad dataset. The dataset contains the details of users in a social networking site to find whether a user buys a product by clicking the ad on the site based on their salary, age, and gender.

Now let us load the dataset into Excel. This is how the dataset looks after loading into Excel.

We need to separate out the individual columns in order to analyze the data. This can be done using xl_s_split(input range,delimiter,trim) function from XL8ML software. Once we do that, we get the desired format.

In order to build the classifier model, we need to use two functions from XL8ML software.

1.xl_ml_knn_classifier_train()

2.xl_ml_ knn_classifier _predict()

The dataset is split into training and testing dataset randomly in 80:20 ratio. Let us give a name for the training model. Let it be ‘TRAIN’. Let us apply the function xl_ml_knn_classifier_train(model name,input range,output range,weight,k) over the range of cells. ‘k’ refers to the number of training samples to form a cluster and ‘weight’ can take two values: uniform and distance. When weight is ‘uniform’, all points in the neighborhood are weighted equally. When weight is ‘distance’, closer neighbors of a query point will have a greater influence than neighbors which are further away.

We get the trained model in cell F3. We will be considering this moving forward for other calculations.

Now let’s predict whether a person will purchase or not for the test data using xl_ml_knn_classifier_predict (trained model, input range).

Now it’s time to evaluate the model we have built. We can use metrics like accuracy score, precision and recall to evaluate the performance. First we need to understand what these metrics are.

Accuracy

Accuracy is a ratio of correctly predicted observation to the total observations. True Positive is the number of correct predictions that the occurrence is positive. True Negative is the number of correct predictions that the occurrence is negative.

Precision and Recall

Precision is the fraction of relevant instances among the retrieved instances, while recall is the fraction of relevant instances that have been retrieved over the total number of instances. They are basically used as the measure of relevance.

F1- Score

It is the weighted average of precision and recall.

XL8ML software offers these metrics for evaluation purpose. The accuracy score can be calculated using xl_ml_accuracy(LHS,RHS). We get an accuracy score of 0.78. Our model has performed well!

The precision and recall can be calculated using xl_ml_precision(LHS,RHS) and xl_ml_recall(LHS,RHS) respectively. We get precision of 0.60 and recall of 0.68. The question that precision answers is of all customers who are labelled as purchased, how many actually purchased? High precision relates to the low false positive rate. We have got 0.60 precision. The question recall answers is of all the customers who truly purchased, how many are labelled by the model as purchased? We have got recall of 0.68. Both metrics are looking good for this model as they are above 0.5

The f1 score can be calculated using xl_ml_f1(LHS,RHS,beta) where beta is a positive real factor. We get 0.63. f1 score gives us a representation of both precision and recall. We have a good score for the model built.

We have covered the basic concepts of KNN algorithm and built a classifier model using XL8ML software. It is a good algorithm to use when beginning to explore the world of machine learning. You can use different datasets and get familiar with this algorithm. XL8ML is very user friendly in terms of usage as the functions are very similar to Python. Please do check it out at https://xl8ml.com. Stay tuned for Part II where we will build a KNN regressor!

LET’S IMPLEMENT “LAZY LEARNING” ALGORITHM USING XL8ML — PART I

Written by Poojadurai