Fill This Form To Receive Instant Help

Help in Homework

Recruitment Prediction Analysis of Undergraduate Engineering Students of Hyderabad Using Data Mining Techniques

  • Words: 19743

Published: May 31, 2024

Introduction

Engineering students are mainly looking for recruitment from their educational institute campus. Most of the engineering under graduate students are joining industrial sectors. The industrial sector offers good salaries to the engineering under graduate students. it is nevertheless to say that engineering under graduate students have been paid at most from under graduate students from any other educational stream. As per their age and degree of education, they have been paid the most at this stage. Mainly the information technology based companies are offering the highest salary even before the under graduate students are receiving their graduation degrees. Therefore, it can be said that the engineering students at the standard of under graduation get offers in very huge range.

Recruitment from educational institute campus is a very important issue in the study of graduation in engineering. Engineering courses are quite incomplete other than the recruitment process from institutional campus. The reputations of educational institutes depend on the statistics of recent campus recruitment drives. The institute with the highest no of students placed in the industrial companies is the most renowned engineering institute.

Mainly the IIT students get the highest offers. After IITs, students of NITs, IIITs get the offers in terms of alary. After that, all national and state level Government institutes offers recruitments. Private institutes are also providing good recruitments (Henry, 2017).

The recruitment of under graduate engineering students has some equal impact on the educational institute as well as under graduate engineering student. The student’s need for the recruitment for their carrier and future and the educational institutes need is the reputation of the institute. For the needs from both end, recruitment of the under graduate engineering student becomes the most important issue in today’s point of view. According the recruitment status of previous years, engineering industries also go for the recruitment drive. Mainly three entities are attached in the recruitment process of under graduate engineering students. They are – career of under graduate students of engineering studies, reputation of institute, and company need for employees.

In India under graduate engineering students are also seeking for job. India is a very good market of information technology based industries. India’s information technology based industries are clustered in some cities of India. These cities are as follows – Bangalore, Hyderabad, Chennai, Kolkata, and Gurgaon etc. Engineering institutes in these cities have the highest possibility of recruitments. Industries in those cities hire under graduate engineering students from the nearest engineering institute. Hyderabad is such a city where the numbers of engineering institutes are quite sufficient and students of these engineering students get job offers from these industries. In recent years, institutes of Hyderabad provides sufficient recruitment offers to their under graduate engineering students (KUMAR, 2016).

Position of understudies is quite possibly the main target of an instructive foundation. Notoriety and yearly confirmations of a foundation perpetually rely upon the arrangements it gives it understudies. That is the reason every one of the organizations, laboriously, endeavour to fortify their arrangement division in order to work on their foundation on an entirety. Any help with this specific region will emphatically affect a foundation's capacity to put its understudies. This will consistently be useful to both the understudies, just as the establishment. In this examination, the goal is to break down earlier year's understudy's information and use it to anticipate the position chance of the current understudies. This model is proposed with a calculation to foresee something similar. Information relating to the examination was gathered structure a similar organization for which the situation forecast is done, and furthermore reasonable information pre-preparing strategies were applied. This proposed model is likewise contrasted and other customary arrangement calculations, for example, Decision tree and Random woods concerning exactness, accuracy and review. From the outcomes got it is tracked down that the proposed calculation performs fundamentally better in correlation with different calculations referenced.

In the time of a logically changing ferocious world, everyone attempts to get at the top position and wills to have a fair occupation. To achieve this, every juvenile requirement is to get a good procuring source and fortified life. Planning is a good calling, and to achieve this calling, each understudy needs to score top, go through more money to get assertion in a supposed, outstandingly ready, and incredible system Institution. For sure, even low scores understudies moreover go through more money to get Admission to a top establishment. To overcome such kind of issue, this investigation will be incredibly helpful. This investigation particularly revolves around the rule attributes which are must major in an understudy to get enlistment. In this investigation, data is accumulated from Bachelor of Technology last year planning understudies. After data grouping, data mining methodologies ID3 decision tree computation is applied on understudy's dataset primarily on nine credits like as Academic grade, Practical Knowledge, Skilled Certificates, Project accomplishment, Subject Knowledge in created test and Interview, Fear in Written test and meeting, Communication aptitude in a gathering, Confidence in the gathering, Practical premium for selection figure (Pessach, 2020). Resulting to playing out the decision tree estimation and significant plans on the understudy's dataset, this assessment starts that information procure for the 'skilled underwriting' property is most elevated among any excess attributes and is the root centre point in the decision tree.

Current Recruitment Status of Hyderabad

In 2020, placement percentage of IIIT Hyderabad in the branch of under graduate engineering reached the mark of 100%. In the year 2020, the highest paid offer for IIIT Hyderabad under graduate engineering student with LPA around Rupees 85. The top recruiters of IIIT Hyderabad include Samsung, Adobe and Walmart. The highest CTC rose by 85% during IIIT Hyderabad placements 2020 as compared to 2019. The placement percentage during IIIT Hyderabad placements 2020 and 2019 for Bachelor of Technology course stands unchanged at 100%. The number of students placed during IIIT Hyderabad placements 2020 rose by 35 as compared to 2019.

Importance of the study

This study is quite important to the three entities – under graduate engineering students, management of engineering institutes, and engineering industries. The students get the information about the recruitment status of engineering. The management of engineering industries are maintaining the reputations of the institutes. Engineering industries understand the market status of the Hyderabad. Data mining process is the process to analyse data and extract conclusion from the data. The current study helps all the related entities. The study helps under graduate engineering students to get know the proper situation of recent recruitment of engineering students. The engineering students can know the companies with highest recruitment. The management of engineering institutes get the status of recruiter companies, the previous recruitment history of the company. This study helps the management of engineering institute to aware about the fake engineering companies. From the point of view from the engineering industries, current study helps them to identify the good engineering institutes from where they can get the under graduate engineering students as per their requirements (Agarwal, 2016).

This article contains the study of such an important topic that has direct impact on the students, engineering institutes, as well as engineering companies. The study uses the data mining technique to extract conclusion from recruitment dataset of Hyderabad. The study uses ID3 data mining algorithm for this analysis. According to the analysis, the data mining based system can predict the recruitment status of under graduate engineering students in Hyderabad.

Literature Review

We're living in an age of an ever-shifting competitive environment, in which everyone aims to get the top position and desires a pleasant life. To succeed, every young person desires a respectable career and a protected existence. Engineering is a great career, and all students want to get in and score well on the entrance exams so they may enter a reputable university with state-of-the-art facilities. Even low-score students pay extra to get into a highly selective university. The findings of this study concentrate on what qualities a student must have to get hired. Data is gathered from final year engineering students in Bachelor of Technology programs in Hyderabad. This study concludes that the 'skilled certificate' attribute yields the greatest information gains, which are the topmost attribute and the root node in the decision tree.

From these findings, we learn that to get recruitment, every student should have a certificate demonstrating competence. This will be helpful to students and academic institutions that do engineering research. Additionally, this study may assist us to reduce unemployment in our nation. This study article proposes a thorough investigation of different recruiting prediction techniques and student achievement (He et al. 2021). One of the most essential jobs in machine learning projects is data collection. On the other hand, the quality, completeness, and accuracy of the data chosen have a direct impact on the algorithm. To make this prediction, a variety of input factors are utilized, including the proportion of students enrolled in academics and the number of hours they work each day as well as their communication and logical skills and coding abilities. They all assist us to determine the student's professional path. To gather the data, a variety A little quantity of data is gathered at random from several institutions. A portion of the data is gathered from the company's organization database, a portion of the data is gathered from workers who work in various organizations.

Throughout a student's academic career, performance is one of the most important elements. Performance is a crucial component of learning from elementary school through college.

Continuous monitoring of learning and teaching activities in an educational setting is required to provide students with a quality education because of the huge quantity of data in educational databases, this may be a challenge. All of these issues can be solved with the emerging technologies of big data and machine learning. To discover huge hidden values from different complicated datasets on a massive scale, big data requires a new type of combination. Students' performance in the education sector is now being evaluated using a variety of techniques suggested by scholars (Casuat and Festijo,E.D., 2019). data mining has emerged as one of the most widely used methods to forecast student performance in recent years Because it is so simple to obtain students' log data and do automated data analysis, data mining methods may be utilized in online learning. Education data mining is a term used in big data and machine learning technologies to describe the use of data mining in the education sector. Education data will be converted into highly valuable information that will have a beneficial effect on the education industry and encourage research methods such as forecasting student performance.

Plentiful evidence of this may be found in earlier studies on educational data mining. Diverse data mining methods are used in the research to assist diverse educational activities. The ability to anticipate a student's performance is one of the most significant and helpful ideas in educational data mining. Predicting the student's performance requires predicting an unknown value, which is the student's Because student performance prediction is a critical component of educational data mining in this study, it is the focus of this review As a further step, the survey proceeds to identify the outcomes that will achieve the study's goals as well as answer the research questions that will make the study valid.

One Hot Encoding is a procedure that transforms categorical values in the acquired data into numerical or another cardinal. To make predictions, these translated values are fed into machine learning algorithms Because a large number of machine learning algorithms cannot be run on Both the input and output variables must be numeric to make an SVM, logistic regression, and random forest all handle categorical data, but since we also use SVM and decision trees, categorical variables must be converted to numerical values. Following is an example of how One Hot encoding works (Tarmizi et al. 2019). Encoder assigns values like 1 and 0 if data contains good and poor values. This kind of encoding is known as integer encoding. For-loop or vector technique may be used if there are more than two values. The values will be assigned as 0, 1, 2, and 3 in the example above. To illustrate the encoder will assign values of [0, 1, 0, and 2] to the encoded data.

Algorithms Used for Machine Learning Purposes

According to what was said previously, there are two kinds of machine learning Supervised learning is one of them, whereas unsupervised If you want to understand how to train a computer, you may use well-labelled data and supervise the learning process As a consequence, the computer is fed fresh data so that the supervised-learning algorithm can analyse the data from training and generate the right output from labelled They're called Classification and Regression. Here, in unsupervised algorithms, the data are grouped based on their similarity and their patterns without any labelled training. There are two types of unsupervised algorithm learning.

Nowadays, the level of competition is rising. Students at every institution need to compete in the technology sector and achieve their objectives. There is also a strong desire amongst all the students of this organization to pursue careers. This means that all institutions should cooperate with students from the beginning. As a result of this, every institution should continuously assess how well its students are doing to decide whether or not the students are on the correct track to achieve their goals and improve their weak areas in the process. This information should be gathered via a pre-evaluation before the student begins his or her professional path through a business placement (Bonaccorso, 2017).

As part of the recruitment process, recruiters don't concentrate on a single area of expertise since there are so many different types of jobs to fill such as software engineer (software developer), technical assistance (technical support), network engineer (network engineer), business analyst. An employer will study and evaluate each student in all areas, including their preferences in a specific field, before assigning him to the most suitable position. Many third-party web portals analyse the student's performance and assign a position based on it.

AMCAT is a platform that assesses the student only on a few criteria such as a student's aptitude, linguistic skills and reasoning. To be considered for the position of Data Scientist, however, a student's performance should also be assessed on other factors such as technical questions on Data Science and Machine Learning as well as python and python-related interests (Yadav et al. 2018). For analysis, classification and prediction, conventional algorithms cannot provide the best results due to a large number of input factors as well for preparing data and analysing it, Data Science is employed. Several machine learning techniques may be used to classify and predict class labels. These include SVM, Logistic Regression, Random Forest, Decision tree. Data Science is the study of algorithms and scientific techniques for extracting insights from massive amounts of data. Information may be extracted from structured and unstructured data using data science. It's possible to deal with huge quantities of data and analyses it, as well as visualize Artificial Intelligence is a science that allows computers to learn on their own To anticipate the class label, machine learning combines statistical techniques with the data.

A study by Rahman et al. (20187) suggests the use of an advanced machine learning algorithm to help predict the long-term success of students. Three methods were used: SVMs, XGBoost, and Decision Trees. To find out which method was most accurate, they studied the metrics from all three algorithms and discovered that the SVM algorithm offered greater accuracy with 90.3%, and then XG Boost supplied correctness with 88.33%. To properly forecast the career path of students, it is necessary to gather numerous data points including student performance in different academic areas, focus on certain skill sets, general programming abilities, analysis skills, memory, and personal information like relationships, interests, sports, competitions, hackathons, workshops, certifications, and other reading preferences. The outcome of the forecast was shown through a web application the students created. There will be more sophisticated, improved versions of existing software in the future.

A machine learning method for the prediction of campus placement was suggested by the researchers Sugiharti et al. (2017). The study used the decision tree and random forest method to the dataset that included the student's results. When the algorithm is applied to prior year data sets of the students, it uses parameters such as the proportion of students in the freshman class who completed all of their required credits to create the model and to choose the parameters for the current study. In the Decision tree test, the accuracy after the investigation was 84%, and in the Random Forest, it was 86%. In terms of accuracy, the productivity of the two approaches is similar. The Random Forest method has been shown to provide better placement prediction results from this point forward.

Machine learning methods proposed by Saouabi and Abdellah, (2019) aim to determine whether or not the student is placed. A machine learning classification method known as the Naive Bayes Classifier is utilized in this study together with a machine learning classification technique known as the K Nearest Neighbours algorithm for the prediction of a student's placement rank. In addition to USN, Tenth, and PUC/Diploma results, the algorithms consider variables like CGPA, Technical and Aptitude Skills. The various algorithms use predictive algorithms to anticipate outcomes and then distinguish algorithms' performance based on the dataset. Future studies will aim to include some additional features to provide better placement prediction.

Students' academic performance and employability skills have a bearing on the employability of engineering graduates, and Namounn A. and Alshanqiti, (2021) suggests various Machine Learning Algorithms for predicting this based on these factors. To construct a model for predicting the employability of engineering graduate students, an ANN was used.

This study demonstrates that previous academic achievement has a significant impact on future results. To forecast the student's work of the author, different datasets are examined to assess the student's academic performance and character such as past grades, time spent on research, parent status, GPA, school assistance, further education, internet use, travel time, etc. In this study, several machine learning techniques (including linear regression, K-means clustering, and neural networks) were used to data sets with which to analyses students.

Azure's machine learning cloud studio has shown to be particularly effective for real-time applications, and it may turn out to be a very important tool in academics. K-means and neural networks provide impressive results.

Alghamlas and Alabduljabbar, (2019) uses Logistic Regression to forecast the future employability of a group of job applicants. The overall aim of this research is to make an automated employment process to assess employability based on the logistic regression approach. This research finds a way to use machine learning techniques for predicting employability. The methodology, in this case, used four aptitude measures: the aptitude factor for Aptitude (β1), the aptitude factor for Communication (β2), the aptitude factor for Technical proficiency (β3), and the aptitude factor for Personality (β4). This research attempts to show how machine prediction may be used to the likelihood of being hired in the job application process. Conversion to another area of prediction will provide different results since the self-regulating factors will vary.

A six-predictor logistic regression model developed by D. Satish Kumar et al. (2015) [7] predicts the likelihood of MBA student campus placement. Classification is being used. In this model, six different factors were utilized to estimate the placement possibilities. The six criteria are CGPA in Undergraduate and Postgraduate programs, Specialization in Undergraduate and Postgraduate programs, Soft Skill Score, and Gender. The data in this study was examined using the R program. To forecast the campus placement probability, the study predicts that four variables influence that: CGPA, specialization in both PG and UG, and gender.

Rojanavasu, (2019) suggests the use of various data mining methods to study the results of students in the Computer Course. The study combines various data mining approaches, through which it shows the predictive power of categorization methods. This study showed that the Multilayer Perception algorithm was the best appropriate method for classifying student data. The multilayer perception algorithm offers a better than average 87% prediction. This study will examine the student's data parameters, which include English, Mathematics, Programming language, and Practical marks. The quest for a suitable classification method to assess the performance of Naïve Bayes Simple, Multilayer Perception, SMO, J48, and REP Tree resulted in attempts to apply a classification technique to predict the student's academic achievement.

Engineering students' academic performance data is mined by Hasan et al. (2020) to assist with the recruiting process. This study incorporates six classification methods (BayesNet, Naïve Bayes, Multilayer Perceptron, IB1, Decision Table, and PART Classification) on student data that is being utilized in this research. The findings of the pilot study reveal that IB1 Classifier is the best suitable method for classifying data similar to that of a student in the 1st grade. Six parameters—name, branch, passing percentage of 10th class, passing percentage of 12th class, and final grade—are used in this study for analyzing the student data set. This study may help future academic and business organizations and industries.

When it comes to studying student data, additional types of data mining, including clustering, prediction, and association rules, may be used in the future.

The paper by Alban and Mauricio, (2019) advocates for the use of data mining methods is like Selected, Waiting, and Not Selected to investigate the student recruiting process. The pupils are classified according to their skills using a classification system. This study will assist Lecturers in deciding which students to choose for recruiting and getting them prepared for the procedure. This study will benefit students by enhancing their abilities and decreasing their failure rate. Students will benefit from this research in the future since it will help instructors enhance their students' learning abilities.

Casuat and Festijo, (2019) presented two categorization model construction algorithms: Random Tree and J48, which use the Decision Tree concept. The models on the left utilize academic achievement to predict the level of recruiting. Academic success is evaluated by various characteristics, including test scores, communication competence, and placement preparation hours are taken. A side-by-side comparison of the Random Tree and J48 classification models demonstrates that the Random Tree classification model works better than the J48 model. The random Tree classifier model is accurate 85% of the time, with correctness recognized 74% of the time. It’s predicted that by using a variety of data mining methods, such as K-Nearest Neighbour classification, Naive Bayesian classification, etc., improvement will be made in the accuracy of future predictions.

Five classification methods proposed by Tarmizi et al. (2019) were tested on the IT employability dataset to see which algorithm was the most accurate in predicting the employability of IT, graduate students. According to this study, the employability of IT graduates may be predicted using nine variables, including such as location, gender, core qualifications, skills, languages, mathematics, and humanities. In this study, the researchers investigate the different categorization algorithms and determine which works best on an IT employability dataset. The analysis demonstrates the use of logistic regression with an accuracy of 78.4. The three academic variables IT Core, IT Professional, and Gender were found as important predictors for employability, according to logistic regression analysis.

Data collected from a new study may be used to create additional rules and provide more accurate predictions about an individual's potential for finding employment inside the IT industry. In addition, categorization algorithms that create new data models for improved prediction may be further researched.

According to Yadav et al. (2018), students who have completed an MCA programme should use various classifiers and build employability models based on the right classifier to forecast their employability. For this prediction of employability, several classification approaches were employed including Bayesian methods, Multilayer Perceptron, and Sequential Minimal Optimization (SMO), as well as ensemble methods and decision trees. This study involves measuring academic achievement, socioeconomic status, work skills, and emotional traits, among other factors. According to WEKA, WEKA J48 (a cropped C4. 5 decision tree) algorithm is the most suitable algorithm for predicting student employability. In the next study, students who received a B.Sc. and B.E. degree will be included.

Rahman et al. (2017) have devised six classifications for forecasting graduates' employment status, with the first two belong to the K-Nearest Neighbour family (Naive Bayes and Decision Tree), the next three to the Neural Network family (Logistic Regression, Neural Network, and Support Vector Machine), and the last two to the Logistic Regression family (Logistic Regression and Support Vector Machine) (Higher Education Institutes). Supervised and unsupervised Machine Learning algorithms were employed in this study, with the K- Nearest Neighbour, Naïve Bayes, Decision Tree, Neural Network, Logistic Regression, and Support Vector Machine as their respective K-Nearest Neighbour, Naïve Bayes, Decision Tree, Neural Network, Logistic Regression, and Support Vector Machine. This study utilized a variable called Gender, which is coded as Program Academic. When looking forward, it is important to consider additional factors such as course topics studied, exam scores, and job status.

Sugiharti et al. (2017) suggests the use of Decision Trees as a way to classify engineering students for data mining for their employability. The collections of conventional criteria (e.g., socioeconomic circumstances, academic achievement, and certain extra emotional skill parameters) are created to make it easier to identify gifted children. This document is like a student in regards to his or her academic performances, with other facets such as extra activities.

Overview of Data Mining

Process of extracting and finding patterns in big data sets using techniques at the confluence of machine learning, statistics, and database systems; Information extraction from data sets is the aim of data mining, an interdisciplinary area of computer science and statistics. As a part of the knowledge discovery databases (KDD) process, data mining is used to analyse data. Along with a basic data analysis phase, it also includes database and data management elements as well as data pre-processing, model and inference concerns as well as interestingness metrics and complexity considerations. An examination of big datasets semi-automatically or automatically to uncover previously undiscovered, intriguing patterns such as groupings of records (cluster analysis), anomalies (anomaly identification) and relationships (association rule mining, sequential pattern mining). Database methods such as spatial indices are often used. Patterns that emerge may be utilised for additional research, such as machine learning or predictive analytics. As the last stage in the process of gaining information from data, it is necessary to confirm that the patterns identified by algorithms are present in the larger data set. However, not all patterns discovered by data mining algorithms are reliable. Data mining algorithms often discover patterns in the training set that aren't present in the broader data set. This is referred to as overfitting.

Data mining is the process of analysing data to extract meaningful summary. Therefore, in the recent years, data mining is the process that can be useful for different sectors. Data mining is being used now a day for all sectors like business, healthcare, education, etc.

Data mining is a machine learning based process that can extract information from raw data. This machine learning based process extracts data pattern from the raw data. According to the extracted data pattern some conclusion can be drawn. These conclusions are very much useful for the data analysts. Therefore, data mining is very much useful process for the data analysis. The data mining is very much useful to get the conclusions (Shehu, 2016).

The data mining is the process that can be divided in five steps. They are stated below.

The first step is to organize collected raw data and at the same time save the raw data in the related data warehouse. After that, the raw data has been loaded into local machine or in the cloud as a way that data analysts can access the data very easily. The information technology professionals process the data for extracting valuable conclusions. The data mining professionals also determine the organization strategy of the data. After that, the data mining application software has been used to sort the data as per the requirements of the users. The processing of the data is completely depends on the requirements of the users. The final conclusion from the raw data has been presented in the interactive ways. Different graphical process has been used to represent the raw data analysis result using data mining.

Information Mining is like Data Science did by an individual, in a particular circumstance, on a specific informational index, with a target. This interaction incorporates different sorts of administrations, for example, text mining, web mining, sound and video mining, pictorial information mining, and online media mining. It is done through programming that is basic or exceptionally explicit. By re-appropriating information mining, practically everything should be possible quicker with low activity costs. Specific firms can likewise utilize new advances to gather information that is difficult to find physically. There are huge loads of data accessible on different stages, yet very little information is available. The greatest test is to dissect the information to remove significant data that can be utilized to tackle an issue or for organization improvement. There are many amazing instruments and strategies accessible to mine information and discover better understanding from it.

Various Types of Data Mining

Data mining are different types depend on the databases used. The different types of Data types used for Data mining process are discussed as follows (Moro, 2016).

  • Social Database:

Social media has been used rapidly in recent years by the common people. In this social media platform, people open up their thoughts. The activity of the common people in the social media platform is a very important kind of data on which data mining has been applied.

  • Data Warehouse:

Data Warehouse is the process that can assemble the data of different sectors related to a business sectors. The colossal proportion of data comes from various spots like Marketing and Finance. The removed data is utilized for logical purposes and helps in unique for a business affiliation. The data dissemination focus is planned for the assessment of data rather than trade getting ready.

  • Data Repositories:

Data repositories are the databases where information related to some datasets has been saved. Some common terminologies have been saved in the data repositories. The main reason of maintain the data repository is to create a similar platform for the data analysts professionals. They are using same terminologies all over the world. Therefore, exchange of information becomes very easy (Rossi, 2016).

  • Relational Database:

A blend of an article masterminded data base model and social informational index model is called a thing social model. It maintains Classes, Objects, Inheritance, etc.

One of the fundamental focuses of the Object-social data model is to close the opening between the Relational informational index and the thing arranged model practices consistently utilized in many programming lingos, for example, C++, Java, C#, and so forth

  • Worth based Database:

A contingent informational index suggests a data base organization structure (DBMS) that might perhaps fix an informational index trade on the off chance that it isn't performed fittingly. Regardless of the way that this was a noteworthy capacity an incredibly broadened timeframe back, today, by far most of the social data base structures support esteem based informational index works out.

There are several ways that Data Mining may be utilised in education to improve our knowledge of the learning process. As a result of this study, a variety of data mining techniques were implemented on student databases. In this study, there are three main sections. In the first phase, clustering will be used to profile pupils based on their grades. Depending on the kind of grades that the school has assigned to a student, several distinct segments are created. Groups of pupils are created based on their grades and business ratings. Refers to a comprehensive approach to recruiting appropriate individuals for one or more positions inside an organisation, whether these recruitments are permanent or temporary, they are followed by attracting, choosing and appointing them. According to one definition, the word includes both pre-hire activities such as job criteria and personal specifications and post-hire activities like orientation and on-boarding. To deliver tailored promos based on an individual's expertise, student recruitment intelligence systems use data mining to identify students' points of interest. Knowledge-discovery Database is the process of obtaining usable information from raw data. A data mining project's main objectives are usually Prediction and Description. In unsupervised learning, clustering is a common method for grouping related data. Based on the information discovered in the data characterising the objects or their connections, cluster analysis clusters objects.

An industry-wide practice of turning raw data into usable information, data mining is a technique employed by businesses. Businesses may learn more about their consumers by employing software to search for trends in huge amounts of data. This allows them to create more successful marketing campaigns, boost sales, and save expenses. On-the-shelf and off- the-shelf solutions are required to make good use of the data mining process. To construct machine learning models, data mining is utilised (Van Der Aalst, 2016).

Mining huge amounts of information to find significant patterns and trends is known as data mining. Uses range from database marketing to credit risk management to spam email screening to determine the attitude or opinion of a user's online behaviour. It is possible to split down the data mining process into five stages. An organization's first step is to gather information and then put it into a data warehouse. Afterwards, the data is stored and managed, either on their servers or in the cloud. Business analysts, management teams, and information technology experts examine the data and decide how they want to arrange it based on their preferences and requirements. As a consequence of this sorting, end-users may share their data in a manner that is simple to understand, such as a graph or table. According to what customers want, data mining algorithms look for connections and patterns in the data. Companies, for example, may employ data mining tools to categorise information and generate categories. Numerous supermarkets provide their customers with free loyalty cards, which entitle them to discounted rates that are not accessible to non-members. When shops have cards in hand, it's simple for them to keep track of who buys what and when, and at what price. Data analysis will allow retailers to give consumers coupons based on their purchasing patterns and choose whether to put goods on sale or sell them at a full price based on the analysis of data (Stančin, 2019).

Data Mining Tools and Techniques

A variety of Data Mining tools and techniques are now a day implemented in various machine learning processes. These are stated below.

  1. Classification:

A data and metadata analysis is utilised to extract essential information. It's a data mining technique that helps categorise data into several categories.

  1. Clustering:

As a data mining method, clustering analyses data to find data that are similar to each other. This method aids in the understanding of the differences and similarities between the data sets that have been analysed.

  1. Regression:

Analysis of the connection between variables is carried by using regression analysis. In the presence of other factors, it is used to determine the probability of a particular variable occurring (Chen, 2019).

  1. Association Rules:

To discover the relationship between two or more items using this data mining method. In the data set, it reveals a hidden pattern.

  1. Outer Detection:

For example, a non-expected pattern or non-expected behaviour is seen in a dataset using this method. There are several applications for this method, such as intrusion detection, fraud and defect detection systems, etc. Other terms for outer detection include outlier detection and outlier mining.

  1. Prediction:

These additional data mining methods were utilised in conjunction with prediction to arrive at a prediction. It examines previous events or occurrences in a correct order to anticipate a future event based on the analysis.

Advantages of Data Mining:

To convert raw data into valuable information, an organisation will employ data mining. To understand more about their consumers and create better business strategies that increase sales and decrease expenses, companies may use software that searches for patterns in big data sets. Data mining has several benefits, including the ability to gather, store, and analyse user data. For the development of machine learning models, the data mining technique is used. Researchers may be motivated to work faster when using data mining methods to analyse data. As a result, they have more time to devote to other initiatives. It is possible to track shopper behaviour. The majority of the time, while creating particular shopping patterns, you'll run across new issues. As a result, data mining is utilised to find solutions to these challenges. These purchasing trends may be discovered using mining techniques. As a result of this procedure, an area is created where all of the unexpected shopping patterns may be computed. Data extraction may be useful for identifying purchasing trends (Guruvayur, 2017). When we run marketing efforts, we use data mining to determine the best course of action. It may also be used to identify consumer groupings. To create these new consumer categories, it is possible to utilise surveys. It is important to note that these investigations are a kind of information gathering. Mining methods are used in marketing efforts. This is done to better understand their consumers' requirements and behaviour. As a result, consumers can choose their brand's clothing from this point forward As a result of this method and you will be able to live independently. In any case, it may offer useful information while making choices.

    • The Data Mining system engages relationship to get data based data.
    • Data mining enables relationship to roll out remunerating improvements in action and creation.
    • Differentiated and other authentic data applications, data mining are a cost useful.
    • Data Mining helps the unique course of an affiliation.
    • It facilitates the automated disclosure of concealed plans similarly as the conjecture of examples and practices.
    • It might be incited in the new structure similarly as the current stages.
    • It is a quick cycle that simplifies it for new customers to take apart titanic proportions of data in a short period of time.

Burdens of Data Mining:

Though there are lots of advantages of using Data Mining processes, there are some limitations as well as burdens are also existed. These are stated below (Fernandes, 2017).

    • There is a probability that the affiliations may offer important data of customers to various relationships for cash. As indicated by the report, American Express has sold MasterCard obtaining of their customers to various affiliations.
    • Various data mining assessment writing computer programs is difficult to work and needs advance getting ready to work on.
    • Different data mining instruments work specifically habits on account of the different computations used in their arrangement. As such, the assurance of the right data mining instruments is an outstandingly troublesome task.
    • The data mining procedures are not accurate, with the objective that it may provoke genuine results in explicit conditions.

Decision Trees

Before discussing about any other machine learning techniques, it is important to discuss Decision Tree first. In essential words, a decision tree is a development that contains centre points (rectangular boxes) and edges(arrows) and is worked from a dataset (table of segments tending to features/qualities and lines thinks about to records). Each centre point is either used to make a decision (known as decision centre point) or addresses an outcome (known as leaf centre) (Sharma, 2016).

ID3 Algorithm

ID3 (Iterative Dichotomiser 3) is a decision tree learning method developed by Ross Quinlan. Machine learning and natural language processing utilise ID3, which is a predecessor to the C4.5 method. Every sort of iteration employs a greedy approach to divide the dataset, choosing the locally best attribute to do so. However, employing backtracking during the search for an optimum decision tree may increase its efficiency. This may lead to the overfitting of training data. These trees are typically short, although not necessarily the tiniest decision tree imaginable. ID3 is more difficult to utilise with continuous data than with factored data. It may take a lot of effort to find the optimum value to divide the data by if the values of an attribute are continuous. As the entropy decreases, so do the information gain, which is a measure of how effectively a feature separates or classifies the target classes. The entropy is zero in binary classification (when the target column has just two kinds of classes) and it is one of the target columns that have the same number of values for both types of classifications (Wang, 2017).

    • ID3 addresses Iterative Dichotomiser 3 and is named such considering the way that the estimation iteratively (on and on) dichotomizes (divides) features into no less than two social events at every movement.
    • Created by Ross Quinlan, ID3 uses a progressive unquenchable approach to manage manufacture a decision tree. In clear words, the various levelled approach infers that we start developing the tree from the top and the insatiable procedure infers that at each cycle we select the best component right presently to make a centre.
    • Most all around ID3 is simply used for gathering issues with apparent arrangements in a manner of speaking.

Estimations in ID3

As referred to in advance, the ID3 computation picks the best component at every movement while building a Decision tree.

One question may arise that how the ID3 chose the appropriate component. The ID3 uses ‘Information Gain’ for choosing the top segment. The ‘Information gain’ can be termed as ‘Essential gain’ (Rajeshkanna, 2020).

The ‘Information gain’ identifies the decline of the related entropy. At the same time, it can measure how much well a given segment secludes or masterminds the goal classes. The component with the most raised Information Gain is picked as the best one.

In fundamental words, Entropy is the extent of unrest and the Entropy of a dataset is the extent of disarray in the target part of the dataset.

The target data pattern has two types of classes – homogenous and non-homogenous. The entropy value is related to 0 for the first class of homogenous data. If the entropy value is related to 1 then the target data belongs from the second class of non-homogenous data.

Working Procedure of ID3 Algorithm

Dichotomization implies separating into two entities that are opposed to each other.

To build a tree, the algorithm repeatedly separates characteristics into two groups: the dominant attribute and others. Afterwards, it estimates each attribute's entropy and information gain. As a result, the most prominent one is placed as a decision node on the tree's branches. A new set of characteristics would be computed, including entropy and gain. This process is repeated until a decision is reached for that branch. ID3 cannot guarantee an optimal solution; it may become trapped in local optimums, which is why it cannot guarantee an optimal solution. If the values of an attribute are continuous, ID3 is more difficult to utilise since there are many more locations to divide the data on this property, and finding the optimal value to split by may be time-consuming (Rojanavasu, 2019).

There are nodes (features), branches (rules), and leaves (outcomes) in a decision tree algorithm (discrete and continuous). A node may be divided into two or more sub-nodes using a variety of methods. This improves the homogeneity of the sub-nodes that are created. After splitting the nodes into all relevant factors, the decision tree chooses the split those results in the most homogenous sub-nodes possible. In a decision tree, data is partitioned into subsets based on comparable values, starting at the root node (homogeneous). The ID3 algorithm calculates the homogeneity of a sample by using entropy as a measure. There is no way to further divide the algorithm if Entropy E(s) = 0, therefore ID3 utilises the Lowest Entropy to split it. If Entropy E(s) = 1, then the method is completely homogeneous. Based on the reduction in entropy that occurs when a data-set is divided based on an attribute, the information gain may be computed. A decision tree is constructed by identifying the characteristic that yields the greatest information (i.e., the most homogeneous branches)

Data Analysis and Interpretation

There are several ways that Data Mining may be utilised in education to improve our knowledge of the learning process. As a result of this study, a variety of data mining techniques were implemented on student databases. In this study, there are three main sections. In the first phase, clustering will be used to profile pupils based on their grades.

Depending on the kind of grades that the school has assigned to a student, several distinct segments are created. Groups of pupils are created based on their grades and business ratings. According to one definition, the word includes both pre-hire activities such as job criteria and personal specifications and post-hire activities like orientation and on-boarding. To deliver tailored promos based on an individual's expertise, student recruitment intelligence systems use data mining to identify students' points of interest (Matsumoto, 2017). Knowledge- discovery Database is the process of obtaining usable information from raw data. A data mining project's main objectives are usually Prediction and Description. In unsupervised learning, clustering is a common method for grouping related data. Based on the information discovered in the data characterising the objects or their connections, cluster analysis clusters objects.

K-Nearest Neighbours Algorithm or kNN Algorithm

One of the simplest Machine Learning algorithms, K-Nearest Neighbour uses the supervised learning method to find the nearest neighbour to you. For example, the K-NN method assumes a resemblance between a new case/data and existing instances and places it in a category that is most similar to the existing cases. Using the K-NN method, all the available data is stored and a new data point is classified based on its resemblance to existing data points. As a result, fresh data may be readily categorised using the K-NN method.

However, the K-NN method is mainly utilised for Classification issues. As a nonparametric method, K-NN makes no assumptions about the data. Because the dataset is stored instead of being used immediately, it is sometimes referred to as a lazy learner algorithm. If you use the KNN algorithm during the training phase, it simply saves the information and classifies fresh data into a comparable category (Zhang, 2017).

Working Procedure of K-Nearest Neighbours Algorithm

When it comes to classification and regression issues, the k-nearest neighbour's method (KNN) is a basic supervised machine learning technique that may be utilised for both. Despite its simplicity, it suffers from a fundamental flaw: it becomes considerably slower as the amount of data in use increases.

KNN works by calculating distances between a query and all the instances in the data, choosing a given number of examples (K) that are closest to the query, and then voting on which label is more common (in classification) (in the case of regression). Selecting the appropriate K for our data in classification and regression is done by testing many different Ks and selecting the one that performs best for our data set (Wang, 2017).

Supervised learning algorithm KNN (K-nearest neighbour) classifies data based on how its neighbours are categorised. A similarity metric is used to categorise new instances in KNN.

To incorporate the closest neighbours in the voting process, K in KNN is a parameter, n is the total number of data points, and n is the number of data points used to calculate K. Here we use d=√ ((x2-x1)²+ (y2-y1)²) to find the distance between any two points.

Random Forest Algorithm

For regression and classification, a random forest is a machine-learning method that may be used to solve issues. They do this by combining a large number of classifiers to solve a problem. In a random forest method, a large number of decision trees are used. Using bagging or bootstrap aggregation, the random forest method creates a "forest." To enhance machine learning algorithms' accuracy, a meta-algorithm called bagging is used. RF Algorithms (random forest) determine the final result by using prediction trees. By taking the average or mean of the output from different trees, it forecasts the future. As the number of trees grows, so does the accuracy of the result. A random forest algorithm overcomes the drawbacks of a decision tree method. Overfitting of datasets is reduced, resulting in higher levels of accuracy (Belgiu, 2016).

Working Procedure of Random Forest Algorithm

Random forest algorithms are built on the foundation of decision trees. Tree-like structures are used to aid with decision-making. Understanding how random forest algorithms operate will be easier if we have a basic understanding of decision trees. Each node in the tree has a different function, and there is a root node that connects all the nodes. Any training dataset is divided into branches using the decision tree method. After a leaf node is reached, the sequence repeats itself till the end. The leaf node cannot be further separated. Each of the nodes in the decision tree represents a characteristic that is used to forecast the result. There is a connection between the leaves and decision nodes. As shown in the accompanying figure, a decision tree has three kinds of nodes: Random forests classified using an ensemble approach. Decision trees are trained using training data. When the nodes are divided, random observations and features will be chosen from this data set. In a random forest system, several decision trees are used. There are three types of decision nodes: decision nodes, leaf nodes, and a root node in every tree. Ultimately, each tree's leaf node represents the ultimate result of that particular decision tree. The final product is determined by a majority-voting method.

This means that the output selected by the majority of decision trees will be used as the system's output (Golino, 2016).

Training involves the construction of numerous individual decision trees using random forests (RF). Each tree's predictions are aggregated to provide a final forecast, such as a class's mode for classification, or a regression model's mean prediction. Ensemble methods are used to make a final judgement based on a set of findings. The reduction in node impurity is multiplied by the likelihood of accessing that node to determine the feature significance.

This may be determined by taking a sample count and dividing it by a total number of samples to arrive at a node probability value. The more essential a characteristic is, the higher the value.

Here,

  • nij is the importance of node j
  • wj is the weighted number of samples reaching node j
  • Cj is the impurity value of node j
  • left(j) is the child node from left split on node j
  • right(j) is the child node from right split on node j

Support Vector Machine Algorithm or SVM Algorithm

This algorithm's goal is to find a hyper plane in an N-dimensional space that classifies the data points. It is possible to choose from a wide variety of hyper planes to divide the two classes of data points (Tao, 2018). What we're looking for is the greatest margin or distance between data points of the two classes on a plane. Maximizing the margin distance gives subsequent data point’s greater confidence in their classification. Data points may be classified using hyper planes, which act as decision boundaries. On each side of the hyper plane, various kinds of data points may be assigned to the data points. In addition, the hyper plane’s size is determined by the number of features in the image. When a hyper plane is near a support vector, it may affect the hyper plane’s location, orientation, and other properties of the hyper plane. The hyper plane’s location will be altered if the support vectors are removed. They're the key elements that guide us in creating our SVM model.

Working Procedure of Support Vector Machine Algorithm

Support-vector machines (SVMs, also known as support-vector networks) are supervised learning models that evaluate data for classification and regression analysis. For example, machine learning tasks often include data categorizations. Assume that a set of data points belongs to one of two classes and that the objective is to determine which class a new data point will belong to. Data points are regarded as p-dimensional vectors in support vector machines, and we wish to know whether such points can be separated using an n-dimensional hyper plane. As a result, we have a linear classifier. It's possible to categorise the data in a variety of ways using hyper planes. Choosing the optimal hyper plane may be as simple as choosing one that indicates the greatest difference between the two classes, or margin, between them Our hyper plane is chosen in such a way to maximise each side's distance from the hyper plane to the closest data point It is called the maximum-margin hyper plane if it exists, and its linear classifier is called a maximum-margin classification system, or the perceptron of optimum stability if it doesn't (Anitha, 2021).

 

The output of a linear function is used in SVM to identify one class and another. This reinforcement range of values ([-1, 1]) works as a margin in SVM since the threshold values are altered to 1 and -1. As part of the SVM method, we're trying to optimise how much space there is between the data points and the hyper plane. Margin maximisation may be achieved by minimising hinge loss. To determine the gradients, we calculate partial derivatives of the loss function concerning the weights. Weights may be updated using gradients (Farooq, 2017).

Cost Function Formula

Loss Function Formula

Gradient Formula

Data Collection

There is lots of engineering colleges are established in various parts of Hyderabad. A set of information is prepared with the collected of Bachelor of Technology students who are in their final year of study. All of the students who are included in the dataset are cleared their other semester in their first attempts that means there is no backlog students or year gap students are there in the prepared set of information. More specifically we can say that the prepared dataset is excludes the students who are failed to clear their previous semester exams in their first attempt. To collect this information from the students, a set of questionnaires are prepared with the help of English Language. Based on the provided information by the students, this proposed dataset is prepared. The dataset has the aggregate of percentage for the students who are in their final year, their practical knowledge. It has also been defined that some certification or course is being completed or achieved by the students in their course duration and if there is some then the number of courses or certifications is also registered. There are some other information is also collected such as – the projects done by a student in their course duration, their subject knowledge when they attend any sort of interview for the recruitment, their communication fear while attending an interview and their confidence in front of interviewer and their overall interest in practical implementation of their knowledge in various corporate fields.

Methods

For any sort of machine learning related work our prior concern is to get ready the system for the processing. With the help of python version 3.6 we have implemented our work process. There are some libraries essentially required for the processing in python environment and these are hereby used for the implementation for the proposed work. These libraries are:

  • NumPy: To compute the numerical processing this library is used. We can solve numerical simulation of are linear algebra, Fourier transformation multi-dimensional matrices etc. with the help of NumPy library. Numerical functions are hereby defined by the library itself.
  • Scikit-learn: It a library which has the compatibility to work with NumPy library.

With the help of this library we are able to simulate or execute computational features such as classification of data, clustering algorithm support, regression etc. like machine learning processes.

  • Pyplot: Pyplot is a python library for plotting different types of matrix, data, results etc.

For the starting of the proposed work we need to install those libraries with other internal libraries provided by python. It is mentioned that that the proposed work is developed with the help of the dataset of all final year students of Bachelor of Technology who are in their final year of the course. The information is collected from the students who cleared their exam without any backlog subjects or year gap in their academics. A machine learning approach has been implemented on the mentioned dataset. Among varieties of algorithm modelling approaches, some of the approaches are chosen and later they are implemented in the model to find the best fitted approach over the proposed dataset. The best fitted approach is determined with the help of the model performances. That means the models are giving some performance on the prediction in terms of accuracy and comparing each model with another a proposal can be made on the best fitted machine learning modelling approach for the proposed dataset that has been utilized in this scope of work.

First of all for the modelling approach implementation, there are some phases of work that are being done stepwise so that we can reach our goal. On the very first stage of work the prior concern of a machine learning related task is to create the dataset such a way that it is appropriate for the modelling approach to run over it. Once the dataset is ready the proposed work is continued. Again for the implementation of the work we need to pre-process the dataset. In the phase of data pre-processing, the information are cleaned, structured and converted from various alphabetical terms to numerical terms. There we proposed some numerical values to alter with the alphabets or the sequence of the alphabets. The proposed dataset is loaded in the python environment for the processing. Once the dataset is loaded in python, it is now appended in rows so that the pre-processing of the information can be initiated. For the whole dataset we tried to find out the alphabetical terms ‘X’, ‘A’, ‘B’, ‘C’,

‘D’, ‘E’, ‘F’ and replace them with numerical digits from 1 to 7 respectively. There are also sequence of alphabet characters are also present in the dataset such as ‘None’, ‘One’, ‘Two’ and these are again replaced with numerical digits from 0 to 2 respectively. In the similar way ‘Yes’, ‘No’, ‘Maybe’, ‘Average’, ‘Good’, ‘High’ etc. are also converted into numerical digits for the proposed scope of work. Once it is done, we will implement Label Encoder in here. As the machine learning approach is a computerized operation the information that supplies for machine learning processes are needs to be machine readable format. That is why the implementation of Label Encoder is essential at the data pre-processing phase for the conversion of the labels in numerical format so that machine can easily understand it. It is the way how the information set is understandable by the machine learning modelling approaches. After the transformation of the information set the pre-processed information is now ready for the applying the machine learning approaches. But there is another stage and in this stage the pre-processed data is divided into two parts. One part is for training purposes and here we have created 90 per cent of the information is for training. Again remaining 10 per cent information set is considered for testing purposes. Here after implementation of machine learning modelling approach, our job is to train the model with the help of the newly divided training dataset and once the model is trained, it is run over test dataset to get the final output that the accuracy for the prediction from the models.

Later on we have implemented Argument Parser in this scope of work for working with the machine learning models. Here we have defined KNN or K-Nearest Neighbour approach as a default model for the whole task and later we have implemented Random Forest and Support Vector Machine or SVM approach for the same dataset to fulfil our desired machine learning approach.

Thomas Cover who is defined a supervised algorithmic approach for the recognition of data pattern usually uses for classification and regression purposes is termed k Nearest Neighbour approach or in approach. By this kind of model approach we can easily differentiate two or more objects lying in same class or not. By a neighbour plurality defines their neighbours. Those types of neighbours are called k Nearest Neighbour. It has been mentioned that we have done our proposed task by using K-Nearest Neighbour or KNN approach which is one of the easiest and simple supervised learning algorithm models. Like other techniques this approach is helps in finding out the similarities between new information with existing information. The new items which are similar to the existing one are put into the same category in this approach. In this way the new information is very easily classified with the most similar categories with the help of this machine learning approach. In this approach the size of the ‘k’ is needs to be defined first and in our case it has been defined as 4 that is the number of neighbours. Then the model is finding out the k number of neighbours with calculating the Euclidean distance among them. In this way the nearest neighbours are found out. It can also be said that the larger number of the value of the k is more precise than minimal numbers like 1 or 2 etc. It is chosen effectively as this model is robust to the information that used in training set with some noise and its effectiveness is great if there is a large amount of information is available for the machine learning work.

Our next implemented machine learning approach is Support Vector Machine or SVM approach. It is a supervised machine learning model which we are basically using for classification type of machine learning model approach is popularly known as Support Vector Machine or SVM Approach. In working principle of SVM it includes that a labelled information in our case what we are doing with Label Encoder to create label data is being classified by SVM. Here we have implemented SVM approach with the help of Linear Kernel. For the proposed dataset, where we want to use linear classification, SVM with linear kernel is used as because it makes a non-probabilistic binary linear classification as a solution. It is basically separated the objected by a particular distance which is clearly mapped and if some new type of object is inserted then also SVM created the same gap for them. Apart from that SVM also has the ability to perform the classification which is non- linear in nature. Even SVM has a great advantage having some efficient advantages when it is dealing with a large scale data, sparse datasets if we are able to do fully label the data which we wanted for classification by .SVM model approach.

In our proposed work we have implemented another popular machine learning modelling approach i.e. Random Forest approach. It is also known as a significant supervised classification model for machine learning algorithm. In this particular algorithm with the help of data we will create decision trees. These created trees create the forest from where we can classify and fulfil our purpose. The large number of decision trees in a forest gives us the more precise result for a proposed work. The efficient model can be generated through assuming maximum number of trees in the forest and with the help of this, we can not only prevents the overfitting problem but also get high accuracy value. Here we implement this approach with 300 numbers of estimators that is the number of trees in the forest.

Here each of the implemented models will first trained with the divided training data and later it is implemented over test set of data where they individually provides their prediction score in terms of accuracy.

Result

As we have mentioned earlier that we have got individual classification result or report from each of the trained model after they implemented over test set of information. This generated report is actually provides us the information about their individual quality of classification of the dataset. There are numbers of predictions are made by the implemented algorithm models i.e. Random Forest, K-Nearest Neighbour and Support Vector Machine. From their total numbers of predictions there are some that are truly predicted and some are predicted false. In the classification result the models is not only generate an accuracy score but also generates some classification matrices here. These matrices are defined as precision, recall and f1-score. These are calculated with the help of false and true predictive results.

These results are part of Confusion Matrix. In our case we have implemented Confusion Matrices for each of the implemented algorithm to visualize the correct predicted information provided in our proposed dataset. There are four numbers of information is provided in Confusion Matrix and these are:

  • True Negative or TN: An occurrence is called TN if the prediction is negative for the negative case provided from the dataset.
  • True Positive or TP: TP defines when the case and prediction both are positive.
  • False Positive of FP: Sometimes the provided case is positive but the implemented model negatively predicted them. This case is raised in this scope.
  • False Negative or FN: Finally when the case provided is negative but the model predicted that it is a positive one that means a positive prediction is provided here by the implemented models.

 

Based on the above mentioned members of the Confusion Matrix the classification matrices are generated. Like the precision is defined by,

Precision = True Positive / (True Positive + False Positive)

That means the ratio of True Positive and the summation of True Positive and False Positive are defined as Precision.

Again another matrix i.e. recall is the known as the ability of the implemented model to recognise all the positive instances. It is defined as the following way,

Recall = True Positive / (True Positive + False Negative)

That means the ratio of True Positive and the summation of True Positive and False Negative are defined as Recall.

Another matrix that is f1-score is the mean of recall and precision. It can be defined as the following way:

F1-Score = 2*(Precision * Recall) / (Precision + Recall)

It can be assumed that if the f1-score is 1 then best prediction score is generated by the implemented model whereas the worst prediction score of the model is defined when the f1- score is 0.

After the implementation of K-Nearest Neighbour algorithm we will print the classification report so that we can get to know that how much accuracy is provided by the trained KNN model in this scope of work.

After the implementation of trained KNN machine learning model over test dataset it is resulted 50% as the accuracy score for this scope of work. For the investigation of the classification report we have also implemented Confusion Matrix where the predictions are graphically implemented so that the result can be easier to understand. Here we have found that the True Positive value is 2, True Negative value is 3, False Positive value is 1 and False Negative value is 4.

After the implementation of K-Nearest Neighbour algorithm approach we have now implemented Random Forest modelling approach to find out the classification report so that we can get to know that how much accuracy is provided by the trained Random Forest model in this scope of work.

After the implementation of trained Random Forest machine learning model over test dataset it is resulted 90% as the accuracy score for this scope of work. For the investigation of the classification report we have also implemented Confusion Matrix where the predictions are graphically implemented so that the result can be easier to understand. Here we have found that the True Positive value is 5, True Negative value is 4, False Positive value is 0 and False Negative value is 1.

 

 

Finally we have implemented Support Vector Machine algorithmic approach to get the classification report to find out the classification report so that we can get to know that how much accuracy is provided by the trained SVM model in this scope of work.

After the implementation of trained SVM machine learning model over test dataset it is secured 80% as the accuracy score for this scope of work. For the investigation of the classification report we have also implemented Confusion Matrix where the predictions are graphically implemented so that the result can be easier to understand. Here we have found that the True Positive value is 1, True Negative value is 7, False Positive value is 1 and False Negative value is 1.

 

After evaluating each of implemented models in this scope of work we have found that Random Forest is comparatively performs better than K-Nearest Neighbour approach and Support Vector Machine approach in terms of accuracy. It secures 90% accuracy which the best found accuracy compared to other two models while prediction is made on the dataset of final year students of Bachelor of Technology. So we can conclude that Random Forest is best fitted machine learning approach while comparing it with K-Nearest Neighbour and Support Vector Machine modelling approach in our proposed work.

Conclusion

In this proposed scope of work we have collected a dataset of Bachelor of Technology students who are in their final year of study and appearing for their recruitment. We have implemented a comparative machine learning modelling approach to find out the best fitted algorithm for the scope of work in terms of predictions. We have implemented K-Nearest Neighbour approach, Random Forest approach and Support Vector Machine approach here.

After the implementation of the models over our proposed dataset we have found that Random Forest performs better than other two modelling approaches i.e. K-Nearest Neighbour approach and Support Vector Machine approach. So here we can conclude that Random Forest modelling approach is more precise and best fitted machine learning model among other two implemented models.

As the future scope of the work we can even create advance machine learning process with the help of deep learning processes. For this we will work on collection of more numbers of information for the same and predict them with the help of advanced machine learning approach.

References

  • Alban, M. and Mauricio, D., 2019. Predicting university dropout through data mining: A Systematic Literature. Indian Journal of Science and Technology, 12(4), pp.1-12.
  • Alghamlas, M. and Alabduljabbar, R., 2019, May. Predicting the Suitability of IT Students' Skills for the Recruitment in Saudi Labor Market. In 2019 2nd International Conference on Computer Applications & Information Security (ICCAIS) (pp. 1-5). IEEE.
  • Casuat, C.D. and Festijo, E.D., 2019, December. Predicting students' employability using machine learning approach. In 2019 IEEE 6th International Conference on Engineering Technologies and Applied Sciences (ICETAS) (pp. 1-5). IEEE.
  • Hasan, R., Palaniappan, S., Mahmood, S., Abbas, A., Sarker, K.U. and Sattar, M.U., 2020. Predicting student performance in higher educational institutions using video learning analytics and data mining techniques. Applied Sciences, 10(11), p.3894.
  • He, S., Li, X. and Chen, J., 2021, May. Application of Data Mining in Predicting College Graduates Employment. In 2021 4th International Conference on Artificial Intelligence and Big Data (ICAIBD) (pp. 65-69). IEEE.
  • Namoun, A. and Alshanqiti, A., 2021. Predicting student performance using data mining and learning analytics techniques: a systematic literature review. Applied Sciences, 11(1), p.237.
  • Rahman, N.A.A., Tan, K.L. and Lim, C.K., 2017, October. Predictive analysis and data mining among the employment of fresh graduate students in HEI. In AIP Conference Proceedings (Vol. 1891, No. 1, p. 020007). AIP Publishing LLC.
  • Rojanavasu, P., 2019. Educational data analytics using association rule mining and classification. In 2019 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT-NCON) (pp. 142-145). IEEE.
  • Saouabi, M. and Abdellah, E., 2019. Proposition of an employability prediction system using data mining techniques in a big data environment. International Journal of Mathematics & Computer Science, 14(2), pp.411-424.
  • Sugiharti, E., Firmansyah, S. and Devi, F.R., 2017. Predictive evaluation of performance of computer science students of unnes using data mining based on naÏve bayes classifier (NBC) algorithm. Journal of Theoretical and Applied Information Technology, 95(4), p.902.
  • Tarmizi, S.S.A., Mutalib, S., Hamid, N.H.A. and Rahman, S.A., 2019. A review on student attrition in higher education using big data analytics and data mining techniques. International Journal of Modern Education and Computer ence, 11(8), pp.1-14.
  • Yadav, S., Jain, A. and Singh, D., 2018, December. Early prediction of employee attrition using data mining techniques. In 2018 IEEE 8th International Advance Computing Conference (IACC) (pp. 349-354). IEEE.
  • Henry, O., & Ferry, M. (2017). When cracking the jee is not enough. Processes of elimination and differentiation, from entry to placement, in the Indian Institutes of Technology (IITS). South Asia Multidisciplinary Academic Journal, (15).
  • KUMAR, P. M., VEERANAGAIAH, C., & CHANDANA, K. (2016). Implementation of Online Placement System.
  • Pessach, D., Singer, G., Avrahami, D., Ben-Gal, H. C., Shmueli, E., & Ben-Gal, I. (2020). Employees recruitment: A prescriptive analytics approach via machine learning and mathematical programming. Decision Support Systems, 134, 113290.
  • Agarwal, N., Grottke, M., Mishra, S., & Brem, A. (2016). A systematic literature review of constraint-based innovations: State of the art and future perspectives. IEEE Transactions on Engineering Management, 64(1), 3-15.
  • Shehu, M. A., & Saeed, F. (2016). An adaptive personnel selection model for recruitment using domain-driven data mining. Journal of Theoretical and Applied Information Technology, 91(1), 117.
  • Moro, S., Rita, P., & Vala, B. (2016). Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach. Journal of Business Research, 69(9), 3341-3351.
  • Rajeshkanna, A., & Arunesh, K. (2020, July). ID3 decision tree classification: An algorithmic perspective based on error rate. In 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC) (pp. 787-790). IEEE.
  • Bonaccorso, G. (2017). Machine learning algorithms. Packt Publishing Ltd.
  • Rossi, R. A., & Ahmed, N. K. (2016). An interactive data repository with visual analytics. ACM SIGKDD Explorations Newsletter, 17(2), 37-41.
  • Van Der Aalst, W. (2016). Data mining. In Process Mining (pp. 89-121). Springer, Berlin, Heidelberg.
  • Stančin, I., & Jović, A. (2019, May). An overview and comparison of free Python libraries for data mining and big data analysis. In 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) (pp. 977-982). IEEE.
  • Chen, W., Yan, X., Zhao, Z., Hong, H., Bui, D. T., & Pradhan, B. (2019). Spatial prediction of landslide susceptibility using data mining-based kernel logistic regression, naive Bayes and RBFNetwork models for the Long County area (China). Bulletin of Engineering Geology and the Environment, 78(1), 247-266.
  • Guruvayur, S. R., & Suchithra, R. (2017, May). A detailed study on machine learning techniques for data mining. In 2017 International Conference on Trends in Electronics and Informatics (ICEI) (pp. 1187-1192). IEEE.
  • Fernandes, M. (2017). Data Mining: A Comparative Study of its Various Techniques and its Process. International Journal of Scientific Research in Computer Science and Engineering, 5(1), 19-23.
  • Sharma, H., & Kumar, S. (2016). A survey on decision tree algorithms of classification in data mining. International Journal of Science and Research (IJSR), 5(4), 2094-2097.
  • Wang, Y., Li, Y., Song, Y., Rong, X., & Zhang, S. (2017). Improvement of ID3 algorithm based on simplified information entropy and coordination degree. Algorithms, 10(4), 124.
  • Rojanavasu, P. (2019). Educational data analytics using association rule mining and classification. In 2019 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT-NCON) (pp. 142-145). IEEE.
  • Matsumoto, T., Sunayama, W., Hatanaka, Y., & Ogohara, K. (2017, July). Data analysis support by combining data mining and text mining. In 2017 6th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI) (pp. 313-318). IEEE.
  • Zhang, S., Li, X., Zong, M., Zhu, X., & Cheng, D. (2017). Learning k for knn classification. ACM Transactions on Intelligent Systems and Technology (TIST), 8(3), 1-19.
  • Wang, X., Zhang, Y., Yu, S., Liu, X., Yuan, Y., & Wang, F. Y. (2017, October). E-learning recommendation framework based on deep learning. In 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC) (pp. 455-460). IEEE.
  • Belgiu, M., & Drăguţ, L. (2016). Random forest in remote sensing: A review of applications and future directions. ISPRS journal of photogrammetry and remote sensing, 114, 24-31.
  • Golino, H. F., & Gomes, C. M. (2016). Random forest as an imputation method for education and psychology research: its impact on item fit and difficulty of the Rasch model. International Journal of Research & Method in Education, 39(4), 401-421.
  • Tao, P., Sun, Z., & Sun, Z. (2018). An improved intrusion detection algorithm based on GA and SVM. Ieee Access, 6, 13624-13631.
  • Anitha, P., & Kaarthick, B. (2021). Oppositional based Laplacian grey wolf optimization algorithm with SVM for data mining in intrusion detection system. Journal of Ambient Intelligence and Humanized Computing, 12(3), 3589-3600.
  • Farooq, M., & Steinwart, I. (2017). An SVM-like approach for expectile regression. Computational Statistics & Data Analysis, 109, 159-181.

Get high-quality help

img

Barry Silbert

imgVerified writer
Expert in:Information Science and Technology

4.7 (135 reviews)

I recently got 90% on the research paper these guys wrote. The writer was really intense and made sure it met guidelines required.


img +122 experts online

Learn the cost and time for your paper

- +

In addition to visual imagery, Cisneros also employs sensory imagery to enhance the reader's experience of the novel. Throughout the story

Remember! This is just a sample.

You can get your custom paper by one of our expert writers.

+122 experts online
img