In this article we will learn in a step by step method Machine Learning through Predictive Analysis using Linear Regression methodology by using the language R with an example.
Introduction
We will start our topic with Tom M. Mitchell definition of Machine Learning
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E
Now, machine learning is classified broadly as Supervised Learning and UnSupervised Learning. In case of Supervised Learning method, the output depends on the data set provided.That means a direct relation exits between the input and the output.In another word, we predict the result of a future element, based on the analysis of the past dataset(s).
This Supervised Learning method is further categorized as Regression and Classification problems.
In the case of Regression Model we predict results for continuous output while for Classification Model we predict results for discrete output.
Linear Regression can be defined as
In statistics, linear regression is a linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X.
We must bear in our mind that simple linear regression has only 1 independent variable. What it means will be clarified below.
In this article we will learn in a step by step method Machine Learning through Predictive Analysis using Linear Regression methodology by using the language R with an example.
We will use RStudio for this purpose.
Let's start with an example
We will start with a simple example. Say we have been presented with the below set of data
Salary data of Employees |
X - (Years of Exp.) |
Y - (Salary in INR) |
3 |
30 |
8 |
57 |
9 |
64 |
12 |
72 |
3 |
36 |
6 |
43 |
11 |
59 |
21 |
90 |
1 |
12 |
16 |
83 |
In the dataset presented we can figure out that X - denotes Number of years an Employee words while Y - denotes his/ her salary.This is a linear relationship between two variables X and Y of the form
Y = a + X * b.
where,
Y - dependent/predictor variable
X - independent/response variable
a - intercept
b - slope of the line / tangent
a and b are rather constants which are called as co-efficients.
What we are going to solve ?
The above data presented to us is a set of training data / historical data. Using our training data, we have to train our Predictive Model by using Simple Linear Regression algorithm. Once, our algorithm is trained i.e. the machine has learnt what to do, we will predict Y given a new value of X.
Straight to experiment
Open RStudio. First we will establish a Relationship Model between X(Predictor Variable) and Y(Response Variable) and obtain the Coefficients (a and b). For this we will use the lm function of R that creates a relationship model between the predictor and the response variable.
#Training data for predictor variable
x <- c(3,8,9,12,3,6,11,21,1,16)
#Training data for response variable
y <- c(30,57,64,72,36,43,59,90,12,83)
# Using lm() function to establish the Relationship between predictor and response variable.
relation <- lm(y~x)
#print the
print(relation)
Thus we obtain the mathematical equation for Simple Linear Regression Model based on the above intercept and coefficient values which is
Y = 20.927 + X * 3.741 [ Y = a + X * b ]
Let us make a Scatter Plot of our data set as under
#Training data for predictor variable
x <- c(3,8,9,12,3,6,11,21,1,16)
#Training data for response variable
y <- c(30,57,64,72,36,43,59,90,12,83)
# Using lm() function to establish the Relationship between predictor and response variable.
relation <- lm(y~x)
#Training data for predictor variable
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
#Training data for response variable
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
# Using lm() function to establish the Relationship between predictor and response variable.
relation <- lm(y~x)
# Give the chart file a name.
png(file = "D:\\SalaryDataSimpleLinearRegression.png")
# Plot the chart.
plot(y,x,col="blue",main="Salary and Year Of Experience Of Employee Records",
abline(lm(x~y)),cex = 1.8,pch=20,xlab="Years of Exp",ylab="(Salary in INR)")
# Save the file.
dev.off()
That gives the below output
It is revealed from the diagram that though the data points does not exactly fall on a straight line however the pattern suggest that there is indeed a linear relationship which exists between X(Years of Exp.) and Y (Salary in INR).
So far we have trained our Model using the training dataset. Means our machine has learnt the algorithm. The next step is to predict. Say, we would like to predict the salary of an employee having 17 years of experience. In this case we have to use predict function as shown below.
#Training data for predictor variable
x <- c(3,8,9,12,3,6,11,21,1,16)
#Training data for response variable
y <- c(30,57,64,72,36,43,59,90,12,83)
# Using lm() function to establish the Relationship between predictor and response variable.
relation < lm(y~x)
# Predict the salary of an employee(SalarY) having 17 years of experience(x)
SalarY <- data.frame(x=17)
#display the value
print( predict(relation,SalarY) )
So, we find that the salary of an employee having 17 years of experience is around 84.5K. And that's machine learning.
We can cross verify our result by putting the value of X = 17 in the earlier mathematical equation for Simple Linear Regression Model
Y = 20.927 + X * 3.741
When X = 17,then
Y = 20.927 + 17 * 3.741 => 84.524
Reference
- Machine Learning
- Predictive Analytics
Conclusion
In this article we have learnt Machine Learning through Predictive Analysis using simple Linear Regression methodology by using the language R with a simple example. The article, at the bare minimum, taught us
- What is Machine Learning
- What is Predective Analysis
- How to do Machine Learning through Predictive Analysis
- How to perform machine learning through simple Linear Regression - a Supervised Modeling technique.
- How to use R language for performing Machine Learning through Predictive Analysis via RStudio.
- Data visualization(Scatter Plot) using R language via RStudio.
- Some R functions.
- etc. etc.
Hope this helps. Thanks for reading.