Data Analytics

CRISP Framework for Data Analytics

What is CRISP DM?

CRISP-DM stands for Cross-Industry Process for Data Mining. The CRISP-DM model is the most popular model used for data mining in the data analytics industry. This model was initially developed in 1996 as a project led by five companies (Integral Solutions Ltd, Teradata, Daimler AG, NCR Corporation, and OHRA) in the European Union.

Six Phases in the CRISP-DM Model

There are six phases in the CRISP-DM Model:

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

Phase 1: Business Understanding

The first phase is “Business understanding.” To solve the problems using the analytics approach, you should understand the business domain, the business objectives, and the current business problems that can be solved using a data-driven approach. It helps to make a good decision during data understanding, data preparation, modeling, and evaluation phases.

The Business Understanding phase focuses on the following objectives of the Data-mining project:

Determine the business objectives: It is essential to know what your customer wants and define the success criteria. For example: If you are in the retail banking business, your business objective could be to reduce the NPA (Non-performing Assets) by 50%. The business success measurement criterion for credit card defaults is to keep the rate of defaulters below 0.5% for the current financial year.
Assess the current situation: During the business understanding phase, you have to consider the availability of the human resources (data scientists, data analysts, business analysts, data engineers, project managers), Data sources, and the necessary permissions to use the data sources, risks and contingency plans, glossary of business and data mining terms and cost-benefit analysis that justifies the project.
Identify the Data mining goals: The data mining goals are derived from the business objectives and are stated in the language of data mining terms. For retail banking the data mining goals is ‘to predict the credit card customers who are going to be delinquent’, and ‘to identify the factors that are strong characteristics of delinquent customers’.
Create a Project Plan: A project plan is the blueprint of the different phases of the project along with detailed work breakdown structure. The project plan consists of the schedule, tasks, resources required, dependencies between the tasks, risks and mitigation actions. You should consider several iterations for data cleaning, preparation, model building and model evaluation to achieve the desired model performance.

Phase 2: Data Understanding

The second phase focuses on the collection of data from various data sources and the exploration of data. Once you have a good clarity of the business objectives and data mining objectives, you can proceed with the data collection and exploration. The tasks in the Data understanding phase are described below:

· Collect Data: Identify the list of data sources required for analysis, document the data collection method, issues encountered in the data collection, and resolutions provided.

· Describe Data: The next task is to describe the data, the number of columns, rows, description of the columns, and verify if the data satisfies your analysis objectives.

· Explore Data: This task focuses on the exploration of data / Exploratory Data Analysis — EDA. The primary goal in this step is to search for missing values, duplication in records, get a statistical summary, find the relationship between numeric variables (scatter plots) and the distribution of data

· Verify Data Quality: It is important to verify the quality of data to check if there are missing values, or incorrect values and check if you can proceed with the analysis with the quality of data available.

Phase 3: Data Preparation

The third phase focuses on the preparation of the data set for modeling. In this phase, you will define the attributes that are required for data modeling, assemble the data if they are coming from various sources and clean the data to ensure that your final data set is ready. Most of the time is spent in this phase before we proceed with model building. The tasks performed in this phase are described below:

· Select Data: Identify the data sets that will be used and reasons for including/excluding the data. Typically, data sets that have poor quality may be dropped.

· Clean Data: When you inspect the data, you will discover some missing values or incorrect values. The missing and incorrect data is imputed and /or the observations that are erroneous are deleted.

· Construct Data: This step is also called as feature engineering where you create new features by combining existing features (to derive meaningful features or attributes that are used in the industry (for example financial ratios such as P/E Ratio for shares).

· Integrate Data: Sometimes, you need to combine data from different data sources to create a final data set. This is called the integration of data.

· Format Data: Very often data needs to be converted from its existing format to a new format — for example, string data needs to be converted to a numeric format or date format. In this step, data formatting is done to ensure that we have all attributes in a good format.

Phase 4: Modeling

The fourth phase focuses on building machine learning models. Here you build and assess a lot of machine learning models and interpret the output of the models. The tasks performed in this phase are described below:

· Select Modeling Techniques: For a given data mining objective, you will need to select which modeling techniques are appropriate to achieve the objectives. For example — if you are building a loan delinquency model, you might consider building various classifiers such as Decision Tree, Random Forest, Logistic Regression, and Linear Discriminant Analysis

· Generate test design: As a recommended practice, you need to split the data into training, test, and validation sets which will be used to train, test and validate the models.

· Build Model: The models are built using the algorithms available in the tool that you have selected (for example Python Libraries / R Libraries). The model is often tuned using a set of hyperparameters. You can define appropriate parameter values to get a good model.

· Assess Model: The primary goal of this task is to compare the performance of various models and select the model that will give you the optimal results and is meeting the success criteria defined in the business understanding phase.

Sometimes you might have to revisit Data Collection and Data Preparation Phases to add more meaningful data and prepare the data to further improve the accuracy of the model. The Data collection, Data Preparation and Model building phases are iterative and can occur multiple times in a project.

Phase 5: Evaluation

The fifth phase focuses on the evaluation of the models for business purposes. The tasks performed in this phase are described below:

· Evaluate Results: In this step, you will evaluate the models which are meeting the business success criteria.

· Review Process: It is essential to review the entire process which has been followed to build the model and look for any unexpected mis outs or any important aspects that you have missed before implementing the solution

· Determine Next Steps: This step focuses on identifying the immediate next steps — do you wish to iterate over the previous steps or continue with the deployment of the model.

Phase 6: Deployment

The last phase is a deployment where the solutions offered by the model are deployed. In this phase, customers can view the results of the model and take further actions. The tasks performed in this phase are described below:

· Plan Deployment: The deployment plan is created that describes how the model will be deployed in production. The deployment plan should cover the techniques that will be used to deploy the model (embedded models, batch prediction, on-demand prediction, or as a web service)

· Plan Monitoring and Maintenance: It is important to prepare and adopt the monitoring plan for the model in production and plan for retraining the model in case of issues observed in production

· Produce Final Report: A project report that documents the summary of the project, results of the data mining, and the benefits realized is prepared in this step.

· Review Project: A project report also documents the lessons learned, challenges, and issues faced and recommendations is created for future reference.

Isha Mistry
Jun, 21 2022

Add New Comments

Please login in order to make a comment.