Decrypting the world of Cryptocurrency with Random Forest-Part II

6 min readJun 1, 2021

In the previous blog, we created a dataset using historical data from Quandl and technical indicators using Tulip Indicators with the help of XL8ML software in Excel. In this blog, we will use this dataset to build Random Forest Classifier using XL8ML to predict if Cryptocurrency value will increase or decrease in the next 7 days!

Let us understand the intuition behind Random Forest algorithm first. It is good to start with real life analogy to understand better.

Edward is a movie buff. He wants to decide which movie to watch during the weekend, so he asks the people who know him best for suggestions. The first friend he seeks out asks him about his likes and dislikes of movies. Based on the answers, he will give Edward some advice. His friend created rules to guide his decision about what he should recommend, by using Edward’s answers. Afterwards, Edward starts asking more and more of his friends to advise him and they again ask him different questions they can use to derive some recommendations from. Finally, Edward chooses the movies that are recommended the most to him, which is the typical Random Forest algorithm approach.

Random forest is a supervised learning algorithm. The “forest” it builds, is an ensemble of decision trees. The Random Forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. One big advantage of Random Forest is that it can be used for both classification and regression problems.

Random Forest adds randomness to the model while growing the trees. Instead of searching for the most important feature while splitting a node, it searches for the best feature among a random subset of features. This results in a wide diversity that generally results in a better model.

Therefore, in Random Forest, only a random subset of the features is taken into consideration by the algorithm for splitting a node.

Since the Random Forest combines multiple trees to predict the class of the dataset, it is possible that some decision trees may predict the correct output, while others may not. But together, all the trees predict the correct output. Therefore, below are two assumptions for a better Random forest classifier:

There should be some actual values in the feature variable of the dataset so that the classifier can predict accurate results rather than a guessed result
The predictions from each tree must have very low correlations

Let’s start building our own Random Forest classifier. Below is the dataset we created to pass inside the model.

It’s time to create our target variable. Since we need to predict the direction of Cryptocurrency price for next 7 days, let’s calculate the 7 days weighted price change using xl_to_log_returns(Input range, Period) where Period is 7.

Using this variable, we create our target variable using SIGN(Number) which gives +1 and -1 as outputs.

The dataset is split into training and testing dataset randomly in 80:20 ratio. Let us give a name for the training model. Let it be ‘TRAIN’. Let us apply the function xl_ml_random_forest_classifier_train(model name,input range,output range, number of levels,min_split,min_leaf,number of forests) over the range of cells.

We get the trained model . We will be considering this moving forward for other calculations. Let’s predict the direction of Cryptocurrency price using xl_ml_random_forest_classifier_predict(trained model,input range).

Evaluation of model is a critical step to analyze how good our model performs. We can use metrics like accuracy score, precision and recall to evaluate the performance of a classifier model. First we need to understand what these metrics are.

Accuracy

Accuracy is a ratio of correctly predicted observation to the total observations. True Positive is the number of correct predictions that the occurrence is positive. True Negative is the number of correct predictions that the occurrence is negative.

Precision and Recall

Precision is the fraction of relevant instances among the retrieved instances, while recall is the fraction of relevant instances that have been retrieved over the total number of instances. They are basically used as the measure of relevance.

F1- Score

It is the weighted average of precision and recall.

XL8ML software offers these metrics for evaluation purpose. The accuracy score can be calculated using xl_ml_accuracy(LHS,RHS). We get an accuracy score of 0.54. Our model has performed well!

You might be surprised to see Random Forest performing well on a heterogeneous data. In a binary classification problem, a Random Forest classifier generally works better than a decision tree classifier. This can be explained by the binomial distribution. The binomial distribution B(n, p) is a discrete probability distribution of the number of successes in a sequence of n independent Bernoulli trials, each with probability p of success.

Assuming each tree in the forest is independent of one another and the probability of making the correct prediction from each tree is constant and greater than 50%, then the probability of more than half of the trees giving the correct predictions is greater than the probability of any individual tree making the correct prediction.

Let us calculate the binomial probability for our model using binomial calculator.

You can see a 10.3% probability of this result happening by randomness. That’s a very interesting insight!

The precision and recall can be calculated using xl_ml_precision(LHS,RHS) and xl_ml_recall(LHS,RHS) respectively. We get precision of 0.77 and recall of 0.48. The question that precision answers is of all instances where the Cryptocurrency price direction is predicted as increasing, how many times it is actually increasing? High precision relates to the low false positive rate. We have got 0.77 precision. The question recall answers is of all the instances where the price truly increased, how many instances are labelled by the model as increasing? We have got recall of 0.48. Both metrics are looking good for this model.

The f1 score can be calculated using xl_ml_f1(LHS,RHS,beta) where beta is a positive real factor. We get 0.60. f1 score gives us a representation of both precision and recall. We have a good score for the model built.

Random forest is a great algorithm to train early in the model development process, to see how it performs. The algorithm is also a great choice for anyone who needs to develop a model quickly. On top of that, it provides a pretty good indicator of the importance it assigns to your features. We have predicted the direction of Cryptocurrency price successfully using XL8ML which is very user friendly in terms of usage as the functions are very similar to Python. Please do check it out at https://xl8ml.com. Stay tuned for more blogs!

Decrypting the world of Cryptocurrency with Random Forest-Part II

Written by Poojadurai