Salary Prediction with ML - Linear Regression
Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.
How Does Machine Learning Works?
Similar to how the human brain gains knowledge and understanding, machine learning relies on input, such as training data or knowledge graphs, to understand entities, domains and the connections between them. With entities defined, deep learning can begin.
Data is The Key : The algorithms that drive machine learning are critical to success. ML algorithms build a mathematical model based on sample data, known as “training data,” to make predictions or decisions without being explicitly programmed to do so. This can reveal trends within data that information businesses can use to improve decision making, optimize efficiency and capture actionable data at scale.
AI is the Goal: Machine Learning provides the foundation for AI systems that automate processes and solve data-based business problems autonomously. It enables companies to replace or augment certain human capabilities. Common machine learning applications you may find in the real world include chatbots, self-driving cars and speech recognition.
Machine Learning Method
Machine learning classifiers fall into three primary categories, such as:
Supervised learning, also known as supervised machine learning, is defined by its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately. As input data is fed into the model, it adjusts its weights until the model has been fitted appropriately. This occurs as part of the cross validation process to ensure that the model avoids overfitting or underfitting. Supervised learning helps organizations solve for a variety of real-world problems at scale, such as classifying spam in a separate folder from your inbox. Some methods used in supervised learning include neural networks, naïve bayes, linear regression, logistic regression, random forest, support vector machine (SVM), and more.
Unsupervised learning, also known as unsupervised machine learning, uses machine learning algorithms to analyze and cluster unlabeled datasets. These algorithms discover hidden patterns or data groupings without the need for human intervention. Its ability to discover similarities and differences in information make it the ideal solution for exploratory data analysis, cross-selling strategies, customer segmentation, image and pattern recognition. It’s also used to reduce the number of features in a model through the process of dimensionality reduction; principal component analysis (PCA) and singular value decomposition (SVD) are two common approaches for this. Other algorithms used in unsupervised learning include neural networks, k-means clustering, probabilistic clustering methods, and more.
Semi-supervised learning offers a happy medium between supervised and unsupervised learning. During training, it uses a smaller labeled data set to guide classification and feature extraction from a larger, unlabeled data set. Semi-supervised learning can solve the problem of having not enough labeled data (or not being able to afford to label enough data) to train a supervised learning algorithm.
Linear Regression
Linear regression is a basic and commonly used type of predictive analysis. The overall idea of regression is to examine two things:
Does a set of predictor variables do a good job in predicting an outcome (dependent) variable?
Which variables in particular are significant predictors of the outcome variable, and in what way do they–indicated by the magnitude and sign of the beta estimates–impact the outcome variable? These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables.
The simplest form of the regression equation with one dependent and one independent variable is defined by the formula :
Where:
- y : estimated dependent variable score
- c : constant
- b : regression coefficient
- x : score on the independent variable
There are many names for a regression’s dependent variable. It may be called an outcome variable, criterion variable, endogenous variable, or regressand. The independent variables can be called exogenous variables, predictor variables, or regressors.
Three major uses for regression analysis are determining the strength of predictors, forecasting an effect, and trend forecasting.
Salary Prediction Model
First of all, we should provides the dataset. Dataset can be a excel file, csv file or etc. You can use my example dataset here.
Import pandas
library for building the data frames.
|
|
Then load the dataset, like below
|
|
knowledge | technical | logical | year_experience | salary | |
---|---|---|---|---|---|
0 | 50 | 60 | 50 | 0 | Rp 2,500,000.00 |
1 | 60 | 50 | 50 | 0 | Rp 2,500,000.00 |
2 | 50 | 70 | 70 | 0 | Rp 3,000,000.00 |
3 | 40 | 50 | 60 | 0 | Rp 2,800,000.00 |
4 | 70 | 70 | 70 | 1.1 | Rp 4,000,000.00 |
5 | 75 | 70 | 65 | 1.2 | Rp 4,000,000.00 |
6 | 65 | 65 | 60 | 1.1 | Rp 3,800,000.00 |
7 | 70 | 70 | 70 | 1.5 | Rp 4,500,000.00 |
8 | 65 | NaN | 70 | 1 | Rp 3,400,000.00 |
9 | 70 | 80 | 80 | 2 | Rp 6,000,000.00 |
10 | 75 | 75 | 85 | 1.8 | Rp 6,000,000.00 |
11 | 80 | 80 | 80 | 2 | Rp 7,000,000.00 |
12 | 80 | 80 | 80 | 2.2 | Rp 7,500,000.00 |
13 | 75 | 70 | 80 | 2.9 | Rp 7,800,000.00 |
14 | 80 | 85 | 80 | 3 | Rp 8,400,000.00 |
15 | 75 | 80 | 75 | 2.4 | Rp 7,500,000.00 |
16 | 85 | 80 | 90 | 3.2 | Rp 8,200,000.00 |
17 | 85 | 80 | 85 | 3.2 | Rp 8,000,000.00 |
18 | 85 | 90 | 90 | 2.7 | Rp 8,000,000.00 |
19 | 90 | 90 | 90 | 3.7 | Rp 10,000,000.00 |
20 | NaN | NaN | NaN | 3 | Rp 8,000,000.00 |
Cleaning Dataset
Clean null or NaN
values from data frame using dropna()
.
|
|
knowledge | technical | logical | year_experience | salary | |
---|---|---|---|---|---|
0 | 50 | 60 | 50 | 0.0 | Rp 2,500,000.00 |
1 | 60 | 50 | 50 | 0.0 | Rp 2,500,000.00 |
2 | 50 | 70 | 70 | 0.0 | Rp 3,000,000.00 |
3 | 40 | 50 | 60 | 0.0 | Rp 2,800,000.00 |
4 | 70 | 70 | 70 | 1.1 | Rp 4,000,000.00 |
Building Model
Import train_test_split
from scikit learn to split arrays or matrices into random train and test subsets. Quick utility that wraps input validation and next(ShuffleSplit().split(X, y))
and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.
|
|
knowledge | technical | logical | year_experience | |
---|---|---|---|---|
0 | 50 | 60 | 50 | 0.0 |
1 | 60 | 50 | 50 | 0.0 |
2 | 50 | 70 | 70 | 0.0 |
3 | 40 | 50 | 60 | 0.0 |
4 | 70 | 70 | 70 | 1.1 |
|
|
Prediction
Import LinearRegression
from scikit learn and use the linear regression function. And create object from LinearRegression
.
|
|
Predict test data x_test
with call function predict()
and store to variable y_pred
. The result is prediction salary with test data using LinearRegression
.
|
|
array([2122535.96880463, 3980638.07697809, 6537626.01871658, 1078550.87649938])
Accuracy
Machine learning
model accuracy is the measurement used to determine which model
is best at identifying relationships and patterns between variables in a dataset based on the input, or training
data. The better a model can generalize to ‘unseen’ data, the better predictions
and insights
it can produce, which in turn deliver more business value.
|
|
0.8148593096952005
Our linear regression model accuracy score is 81.4%
Implementation
Implementation a Linear Regression
with some input from user that have value of knowledge
, techincal
, logical
and year of experience
. Assumes, you are fresh graduate with have a knowledge score is 50, technical score is 50 and logical score is 60. In this case we will use a dictionary
data and convert it into DataFrame
like below.
|
|
knowledge | technical | logical | year_experience | |
---|---|---|---|---|
0 | 50 | 50 | 60 | 0.0 |
And predict using LinearRegression
function like below.
|
|
IDR 1,864,514.00
The result of the prediction of the case is IDR 1,864,514.00.
Conclusion
Simple Linear Regression
help us to predict a dependent variable for salary prediction model. It can estimated of a response variable for people with values of the carier variable within the knowledges. You can download my jupyter notebook Predict Salary - Linear Regression.ipynb.