A Sneak Peek into Linear Regression using XL8ML software

Poojadurai
6 min readMay 18, 2021

XL8ML (https://xl8ml.com) is to Excel what Python Pandas and scikit-learn is to Python. I came across it as I trying to prototype some machine learning ideas before investing a huge amount of time in coding. I often find that 90% of the python code is about data extraction, transform and loading (ETL) — the preparation, rather than the actual application of ML concepts. To that end, XL8ML offers a way to skip that first step, or at least reduce the effort often by 80% allowing one to concentrate on the problem at hand.

The growth of technology in our day-to-day enterprise with advanced machines are outstanding through Artificial Intelligence all over the world. Machine learning is the concept of programming the machine in such a way that it learns from its experiences and different examples, without being programmed explicitly. It is an application of AI that allows machines to learn on their own. Machine learning algorithms are a combination of math and logic that adjust themselves to perform more progressively once the input data varies. In this blog, we intend to construct a Linear regression prediction model related to Supervised Learning in Machine Learning using XL8ML software in order to predict the housing prices of the famous Boston Housing Dataset.

Linear Regression is used for predicting the continuous dependent variable with the help of independent variables. The goal of Linear regression is to find the best fit line that can accurately predict the output for the continuous dependent variable. By finding the best fit line, algorithm establish the relationship between dependent variable and independent variable. The output for Linear regression should only be the continuous values such as price, age, salary, etc.

Linear Regression is generally classified into two types:

· Simple Linear Regression

· Multiple Linear Regression

In simple Linear Regression basically there is one dependent and one independent variable. For example, relationship between height and weight. In multiple Linear Regression there is one dependent and many independent variables. For example, predicting house prices based on multiple inputs like area, locality, amenities etc.

There are four assumptions associated with a linear regression model:

1. Linearity: The relationship between X and the mean of Y is linear.

2. Homoscedasticity: The variance of residual is the same for any value of X.

3. Independence: Observations are independent of each other.

4. Normality: For any fixed value of X, Y is normally distributed.

Now let us understand the data which we are using to build the linear regression model. Boston Housing dataset is a small dataset with 506 observations and contains information about houses in Boston. The dataset is clean with no missing values. A regression model is trained to predict the selling price of the house based on input features. Below are the features present in the dataset,

· CRIM : per capita crime rate by town

· INDUS : the proportion of non-retail business acres per town

· CHAS : Charles River dummy variable (1 if the tract bounds river otherwise 0)

· NOX : nitric oxides concentration (parts per 10 million)

· RM : the average number of rooms per dwelling

· AGE : the proportion of owner-occupied units built prior to 1940

· DIS : weighted distances to five Boston employment centers

· RAD : index of accessibility to radial highways

· TAX : full-value property-tax rate per $10,000

· PTRATIO : pupil-teacher ratio by town

· B : 1000(Bk — 0.63)² where Bk is the proportion of blacks by town

· LSTAT : % lower status of the population

· ZN : the proportion of residential land zoned for lots over 25,000 sq.ft.

Let’s get started! The first and foremost thing is to load the dataset into Excel. This is how the dataset looks after loading into Excel.

Data when loaded into Excel

We need to separate out the individual columns in order to analyze the data. This can be done using xl_s_split(input range,delimiter,trim) function from XL8ML software. Once we do that, we get the desired format.

Data after applying xl_s_split

In order to find the relationship between the features and target variable, we need to find the correlation coefficients. The correlation coefficient ranges from -1 to 1. If the value is close to 1, it means that there is a strong positive correlation between the two variables. When it is close to -1, the variables have a strong negative correlation.

Correlation coefficients of X variables with PRICE variable
Scatter plots are highly correlated variables with PRICE variable

The correlation coefficients are found using CORREL() function in Excel. We can see that RM and LSTAT variables are closely related to PRICE and hence we will be using these two features for the model.

In order to build the regression model, we are going to use 4 functions from XL8ML software,

1. xl_ml_linear_regression_train()

2. xl_ml_linear_regression_coefficient()

3. xl_ml_linear_regression_intercept()

4. xl_ml_linear_regression_predict()

The dataset is split into training and testing dataset randomly in 70:30 ratio. Let us give a name for the training model. Let it be ‘TRAIN’. Let us apply the function xl_ml_linear_regression_train(model name,input range,output range) over the range of cells.

xl_ml_linear_regression_train()

We get the trained model in cell F3. We will be considering this moving forward for other calculations.

Trained model

Next we calculate the regression coefficients using xl_ml_linear_regression_coefficient(trained model). Coefficients are the values that multiply the predictor values. The sign of each coefficient indicates the direction of the relationship between a predictor variable and the input variable.

o A positive sign indicates that as the predictor variable increases, the response variable also increases

o A negative sign indicates that as the predictor variable increases, the response variable decreases

xl_ml_linear_regression_coefficient()

Then we calculate the intercept using xl_ml_linear_regression_intercept(trained model). The constant term at which the regression line crosses the y-axis is called intercept.

xl_ml_linear_regression_intercept()

Now let’s predict the price for the test data using xl_ml_linear_regression_predict (trained model, input range)

xl_ml_linear_regression_predict()

The R score and Mean Squared Error values are calculated. We get a R score of 0.61 and MSE of 36.4.

Actual PRICE vs Predicted PRICE

We have successfully built a linear regression model using XL8ML software! I found this exercise quite pleasant using XL8ML (https://xl8ml.com). I was able to quickly test my hypotheses in an interactive environment yet achieve very similar results to Python. Please do check it out at https://xl8ml.com Stay tuned for more blogs!

--

--

Poojadurai

I am a curious learner who loves to explore the trending technologies and experiment them. I like to share my thoughts on latest topics.