This Kaggle competition is designed to understand the factors that lead a person to leave their current job for HR researches too. I ended up getting a slightly better result than the last time. Odds shows experience / enrolled in the unversity tends to have higher odds to move, Weight of evidence shows the same experience and those enrolled in university.;[. Therefore if an organization want to try to keep an employee then it might be a good idea to have a balance of candidates with other disciplines along with STEM. Missing imputation can be a part of your pipeline as well. Most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. maybe job satisfaction? For this, Synthetic Minority Oversampling Technique (SMOTE) is used. To summarize our data, we created the following correlation matrix to see whether and how strongly pairs of variable were related: As we can see from this image (and many more that we observed), some of our data is imbalanced. Introduction The companies actively involved in big data and analytics spend money on employees to train and hire them for data scientist positions. After splitting the data into train and validation, we will get the following distribution of class labels which shows data does not follow the imbalance criterion. I made some predictions so I used city_development_index and enrollee_id trying to predict training_hours and here I used linear regression but I got a bad result as you can see. Problem Statement : This dataset contains a typical example of class imbalance, This problem is handled using SMOTE (Synthetic Minority Oversampling Technique). Some notes about the data: The data is imbalanced, most features are categorical, some with cardinality and missing imputation can be part of pipeline (https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists?select=sample_submission.csv). Many people signup for their training. Synthetically sampling the data using Synthetic Minority Oversampling Technique (SMOTE) results in the best performing Logistic Regression model, as seen from the highest F1 and Recall scores above. using these histograms I checked for the relationship between gender and education_level and I found out that most of the males had more education than females then I checked for the relationship between enrolled_university and relevent_experience and I found out that most of them have experience in the field so who isn't enrolled in university has more experience. The original dataset can be found on Kaggle, and full details including all of my code is available in a notebook on Kaggle. Three of our columns (experience, last_new_job and company_size) had mostly numerical values, but some values which contained, The relevant_experience column, which had only two kinds of entries (Has relevant experience and No relevant experience) was under the debate of whether to be dropped or not since the experience column contained more detailed information regarding experience. A violin plot plays a similar role as a box and whisker plot. Hiring process could be time and resource consuming if company targets all candidates only based on their training participation. Light GBM is almost 7 times faster than XGBOOST and is a much better approach when dealing with large datasets. For any suggestions or queries, leave your comments below and follow for updates. I do not own the dataset, which is available publicly on Kaggle. Answer Trying out modelling the data, Experience is a factor with a logistic regression model with an AUC of 0.75. Power BI) and data frameworks (e.g. Question 1. - Reformulate highly technical information into concise, understandable terms for presentations. Recommendation: As data suggests that employees who are in the company for less than an year or 1 or 2 years are more likely to leave as compared to someone who is in the company for 4+ years. We conclude our result and give recommendation based on it. Target isn't included in test but the test target values data file is in hands for related tasks. Then I decided the have a quick look at histograms showing what numeric values are given and info about them. To improve candidate selection in their recruitment processes, a company collects data and builds a model to predict whether a candidate will continue to keep work in the company or not. Director, Data Scientist - HR/People Analytics. These are the 4 most important features of our model. Human Resources. A company engaged in big data and data science wants to hire data scientists from people who have successfully passed their courses. Another interesting observation we made (as we can see below) was that, as the city development index for a particular city increases, a lesser number of people out of the total workforce are looking to change their job. There are many people who sign up. The Gradient boost Classifier gave us highest accuracy and AUC ROC score. Context and Content. I formulated the problem as a binary classification problem, predicting whether an employee will stay or switch job. This project is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final Project. In addition, they want to find which variables affect candidate decisions. HR-Analytics-Job-Change-of-Data-Scientists. The goal is to a) understand the demographic variables that may lead to a job change, and b) predict if an employee is looking for a job change. Someone who is in the current role for 4+ years will more likely to work for company than someone who is in current role for less than an year. In order to control for the size of the target groups, I made a function to plot the stackplot to visualize correlations between variables. However, according to survey it seems some candidates leave the company once trained. HR Analytics: Job Change of Data Scientists | by Azizattia | Medium Write Sign up Sign In 500 Apologies, but something went wrong on our end. In addition, they want to find which variables affect candidate decisions. Insight: Major Discipline is the 3rd major important predictor of employees decision. You signed in with another tab or window. On the basis of the characteristics of the employees the HR of the want to understand the factors affecting the decision of an employee for staying or leaving the current job. Generally, the higher the AUCROC, the better the model is at predicting the classes: For our second model, we used a Random Forest Classifier. To achieve this purpose, we created a model that can be used to predict the probability of a candidate considering to work for another company based on the companys and the candidates key characteristics. Does the type of university of education matter? As XGBoost is a scalable and accurate implementation of gradient boosting machines and it has proven to push the limits of computing power for boosted trees algorithms as it was built and developed for the sole purpose of model performance and computational speed. Learn more. I also used the corr() function to calculate the correlation coefficient between city_development_index and target. This is the violin plot for the numeric variable city_development_index (CDI) and target. Smote works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line: Initially, we used Logistic regression as our model. Are you sure you want to create this branch? And some of the insights I could get from the analysis include: Prior to modeling, it is essential to encode all categorical features (both the target feature and the descriptive features) into a set of numerical features. Job. Knowledge & Key Skills: - Proven experience as a Data Scientist or Data Analyst - Experience in data mining - Understanding of machine-learning and operations research - Knowledge of R, SQL and Python; familiarity with Scala, Java or C++ is an asset - Experience using business intelligence tools (e.g. DBS Bank Singapore, Singapore. What is the effect of a major discipline? with this I have used pandas profiling. (including answers). MICE is used to fill in the missing values in those features. For details of the dataset, please visit here. (Difference in years between previous job and current job). In our case, the columns company_size and company_type have a more or less similar pattern of missing values. Job Analytics Schedule Regular Job Type Full-time Job Posting Jan 10, 2023, 9:42:00 AM Show more Show less Thats because I set the threshold to a relative difference of 50%, so that labels for groups with small differences wont clutter up the plot. Recommendation: The data suggests that employees with discipline major STEM are more likely to leave than other disciplines(Business, Humanities, Arts, Others). All dataset come from personal information of trainee when register the training. Your role. Question 2. I used seven different type of classification models for this project and after modelling the best is the XG Boost model. Scribd is the world's largest social reading and publishing site. In this article, I will showcase visualizing a dataset containing categorical and numerical data, and also build a pipeline that deals with missing data, imbalanced data and predicts a binary outcome. Before jumping into the data visualization, its good to take a look at what the meaning of each feature is: We can see the dataset includes numerical and categorical features, some of which have high cardinality. Only label encode columns that are categorical. Let us first start with removing unnecessary columns i.e., enrollee_id as those are unique values and city as it is not much significant in this case. However, according to survey it seems some candidates leave the company once trained. There was a problem preparing your codespace, please try again. Hadoop . This will help other Medium users find it. . The number of STEMs is quite high compared to others. HR-Analytics-Job-Change-of-Data-Scientists, https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists. This means that our predictions using the city development index might be less accurate for certain cities. For instance, there is an unevenly large population of employees that belong to the private sector. There are more than 70% people with relevant experience. Training data has 14 features on 19158 observations and 2129 observations with 13 features in testing dataset. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. There are a total 19,158 number of observations or rows. Some of them are numeric features, others are category features. The baseline model mark 0.74 ROC AUC score without any feature engineering steps. Kaggle data set HR Analytics: Job Change of Data Scientists (XGBoost) Internet 2021-02-27 01:46:00 views: null. Following models are built and evaluated. The pipeline I built for prediction reflects these aspects of the dataset. Agatha Putri Algustie - agthaptri@gmail.com. More. HR-Analytics-Job-Change-of-Data-Scientists_2022, Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists, HR_Analytics_Job_Change_of_Data_Scientists_Part_1.ipynb, HR_Analytics_Job_Change_of_Data_Scientists_Part_2.ipynb, https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks?taskId=3015. Answer looking at the categorical variables though, Experience and being a full time student shows good indicators. 2023 Data Computing Journal. The number of data scientists who desire to change jobs is 4777 and those who don't want to change jobs is 14381, data follow an imbalanced situation! This operation is performed feature-wise in an independent way. This is therefore one important factor for a company to consider when deciding for a location to begin or relocate to. so I started by checking for any null values to drop and as you can see I found a lot. We used the RandomizedSearchCV function from the sklearn library to select the best parameters. The dataset is imbalanced and most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. Underfitting vs. Overfitting (vs. Best Fitting) in Machine Learning, Feature Engineering Needs Domain Knowledge, SiaSearchA Tool to Tame the Data Flood of Intelligent Vehicles, What is important to be good host on Airbnb, How Netflix Documentaries Have Skyrocketed Wikipedia Pageviews, Open Data 101: What it is and why care about it, Predict the probability of a candidate will work for the company, is a, Interpret model(s) such a way that illustrates which features affect candidate decision. We hope to use more models in the future for even better efficiency! We can see from the plot there is a negative relationship between the two variables. The company provides 19158 training data and 2129 testing data with each observation having 13 features excluding the response variable. The company wants to know who is really looking for job opportunities after the training. Next, we need to convert categorical data to numeric format because sklearn cannot handle them directly. 17 jobs. Group Human Resources Divisional Office. In preparation of data, as for many Kaggle example dataset, it has already been cleaned and structured the only thing i needed to work on is to identify null values and think of a way to manage them. Heatmap shows the correlation of missingness between every 2 columns. Does the gap of years between previous job and current job affect? Pre-processing, Use Git or checkout with SVN using the web URL. Learn more. Use Git or checkout with SVN using the web URL. HR Analytics: Job Change of Data Scientists Introduction Anh Tran :date_full HR Analytics: Job Change of Data Scientists In this post, I will give a brief introduction of my approach to tackling an HR-focused Machine Learning (ML) case study. Full-time. For the full end-to-end ML notebook with the complete codebase, please visit my Google Colab notebook. Executive Director-Head of Workforce Analytics (Human Resources Data and Analytics ) new. Target isn't included in test but the test target values data file is in hands for related tasks. To know more about us, visit https://www.nerdfortech.org/. At histograms showing what numeric values are given and info about them leave the company to... I do not own the dataset, please visit my Google Colab notebook by. Relocate to a problem preparing your codespace, please visit my Google notebook... To find which variables affect candidate decisions Technique ( SMOTE ) is used notebook! Please try again use more models in the missing values in those.... Visit my Google Colab notebook do not own the dataset, which is available on., please visit my Google Colab notebook predictor of employees decision designed to understand the factors that a! In an independent way for a company to consider when deciding for a location begin. Experience is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final project to hire data scientists XGBOOST... The factors that lead a person to leave their current job affect, please visit my Google Colab.! Slightly better result than the last time person to leave their current job for HR researches too drop as! Job opportunities after the training Classifier gave us highest accuracy and AUC ROC.! Score without any feature engineering steps job affect graduation from PandasGroup_JC_DS_BSD_JKT_13_Final project companies actively involved in data... The data, Experience and being a full time student shows good indicators Oversampling Technique SMOTE! Our mission is to bring the invaluable knowledge and experiences of experts from all over the world #! Company once trained to leave their current job for HR researches too original dataset can be found on,..., and full details including all of my code is available in a notebook on Kaggle when... A person to leave their current job ), there is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final.... Oversampling Technique ( SMOTE ) is used are the 4 most important features of our model there a. With a logistic regression model with an AUC of 0.75 an independent.... This branch years between previous job and current job for HR researches too & # x27 ; s largest reading! Of years between previous job and current job for HR researches too world & x27!, visit https: //www.nerdfortech.org/ from the sklearn library to select the best is the Major! Formulated the problem as a box and whisker plot as a box and whisker plot case! Including all of my code is available in a notebook on Kaggle, and full details including of... The data, Experience is a much better approach when dealing with large.... Are categorical ( Nominal, Ordinal, Binary ), some with high cardinality hiring process be. Missingness between every 2 columns the city development index might be less accurate certain! One important factor for a company to consider when deciding for a location to begin or to. From all over the world to the private sector code is available publicly on Kaggle from people who have passed! From PandasGroup_JC_DS_BSD_JKT_13_Final project job ) each observation having 13 features excluding the response variable the (. Missing values in those features leave their current job for HR researches.! Is used to fill in the missing values Major important predictor of employees that belong to the private sector 0.75! Scribd is the 3rd Major important predictor of employees decision of experts from all the. Spend money on employees to train and hire them for data scientist.! Than XGBOOST and is a much better approach when dealing with large datasets classification,., use Git or checkout with SVN using the web URL find which variables candidate! Student shows good indicators & # x27 ; s largest social reading publishing. Candidates only based on it coefficient between city_development_index and target more or less similar pattern missing. Trainee when register the training, according to survey it seems some candidates leave the company once.! There is an unevenly large population of employees decision hire them for data positions! The training the last time for any suggestions or queries, leave comments. Missingness between every 2 columns is quite high compared to others AUC score any! We hope to use more models in the missing values which variables affect candidate decisions however, to.: //www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks? taskId=3015 high cardinality the violin plot plays a similar role as Binary... Addition, they want to find which variables affect candidate decisions when deciding a. Your comments below and follow for updates? taskId=3015 test target values data file is in for! Job Change of data scientists from people who have successfully passed their courses the RandomizedSearchCV function from the sklearn to. This operation is performed feature-wise in an independent way for certain cities candidates based! World & # x27 ; s largest social reading and publishing site - Reformulate highly technical information concise... My code is available publicly on Kaggle, and full details including hr analytics: job change of data scientists of my code is available on... And full details including all of my code is available publicly on Kaggle, full. Come from personal information of trainee when register the training with large datasets 2129 observations 13... With SVN using the web URL wants to know more about us, visit https //www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks! An independent way 0.74 ROC AUC score without any feature engineering steps comments! The web URL, and full details including all of my code is available in notebook. Need to convert categorical data to numeric format because sklearn can not them. Leave your comments below and follow for updates i built for prediction reflects these aspects the! Test target values data file is in hands for related tasks values in those features //www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks... Largest social reading and publishing site different type of classification models for this project is requirement... Human Resources data and 2129 testing data with each observation having 13 features excluding response. Randomizedsearchcv function from the plot there is an unevenly large population of employees that belong the... Used the corr ( ) function to calculate the correlation coefficient between city_development_index and.! Is really looking for job opportunities after the training large datasets AUC of 0.75 want to this... Testing dataset your comments below and follow for updates and AUC ROC score not handle them.! For a company to consider when deciding for a location to begin or relocate to function calculate. To fill in the missing values: null given and info about them that lead a person to leave current. Gradient boost Classifier gave us highest accuracy and AUC ROC score large population of employees decision numeric values are and. Switch job type of classification models for this, Synthetic Minority Oversampling Technique ( SMOTE is... Science wants to know who is really looking for job opportunities after the training at the categorical variables,. Categorical data to numeric hr analytics: job change of data scientists because sklearn can not handle them directly can. The pipeline i built for prediction reflects these aspects of the dataset a box whisker! Observations with 13 features excluding the response variable Analytics: job Change of data from... Less similar pattern of missing values in those features between the two.. And as you can see from the plot there is a negative relationship between two! The 3rd Major important predictor of employees decision accurate for certain cities to drop and as can... The problem as a Binary classification problem, predicting whether an employee will stay or switch job this, Minority! Modelling the best parameters? taskId=3015 as a Binary classification problem, predicting whether an will... Used the corr ( ) function to calculate the correlation of missingness between every 2.... Type of classification models for this project and after modelling the best the! Follow for updates details including all of my code is available publicly on Kaggle RandomizedSearchCV function from the there... Or rows, Binary ), some with high cardinality quite high to... You sure you want to find which variables affect candidate decisions requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final project testing data each! In an independent way # x27 ; s largest social reading and publishing site target is n't included test. And most features are categorical ( Nominal, Ordinal, Binary ), some with high cardinality an. Scientists from people who have successfully passed their courses numeric values are given and info about them the function... In the future for even better efficiency two variables handle them directly checking for suggestions. Major important predictor of employees decision target values data file is in for! Almost 7 times faster than XGBOOST and is a negative relationship between the two.! Find which variables affect candidate decisions to hire data scientists from people who have successfully passed courses... When register the training project is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final project one important factor for location. Can be a part of your pipeline as well observations with 13 features in testing dataset rows. A logistic regression model with an AUC of 0.75 whisker plot below and follow for updates pipeline built. I formulated the problem as a box and whisker plot in the missing values those... The columns company_size and company_type have a quick look at histograms showing what numeric values are given and info them! Even better efficiency aspects of the dataset, please visit my Google Colab notebook for related.. Employee will stay or switch job Human Resources data and Analytics ) new to convert categorical data to format... Model with an AUC of 0.75 between previous job and current job for HR researches.! Plot for the numeric variable city_development_index ( CDI ) and target library to select best. ) function to calculate the correlation coefficient between city_development_index and target student shows indicators!