- Acquisition: Acquiring the data is an important task. After the data sources have been identified (in the Business Understanding and Problem Discovery stage), the data sources must be combined into one data source. In some cases, aligning and combining the data sources may require in-depth domain knowledge and expertise.
- Pre-Processing: Data that has been acquired may not be in a form that is readable by tools and libraries used to create machine learning models. There are usually two steps to it. The first one is to translate the data to a form that is related to the ML problem domain. For example, for text processing, if your original data is in a scanned image format, the first step will be to convert those images of text documents into text that can be used by text algorithms. The second step will be to convert the data to either support the specific algorithms (for example: change categorical variables to numerical) or other transformation techniques (for example: standardization, scaling) that will help in improving the results.
- Cleaning: In the real world, data is usually corrupt due to various reasons. Inaccurate readings from sensors, inconsistencies across readings and invalid data are some of the data issues that can be found. A thorough analysis on how to fix these values with the help of a domain and data expert should be carried out.
- Pipeline: A pipeline is a sequence of tasks that can be used to automate repeated tasks. The tasks involved could be extracting data from different sources to a single place, pre-processing data into a form that can be stored and retrieved efficiently and the data can be loaded into a necessary format that can be used by the machine learning algorithms.
- Exploratory Data Analysis: Engage in exploratory data analysis to gain understanding about the data. Understanding the data is very important and could lead to better design and selection of ML process. Also, it gives an in-depth understanding of what could be useful information for further steps.
- Feature Engineering: Feature engineering is a continuous process and would occur in various stages of an ML process. In the Data Acquisition and Understanding phase, feature engineering might be to identify the features that are irrelevant and do not add any information. For example, in high dimensional data, feature engineering may look to remove features whose variance is close to 0.
- Data Split: The data should be split in such a way that there is a portion of data called ‘training data’. Training data is used for training the QuAM and there is a separate portion of data called ‘test data’ to evaluate how good the QuAM is.
ML Modelling & Evaluation
- Algorithm Selection: Algorithm Selection is a process of narrowing down a suite of algorithms that would suit the problem and data. With many algorithms across various domains in ML, narrowing down helps us to focus on certain selected algorithms and working with them to arrive at a solution.
- Feature engineering: This part of feature engineering focuses on preparing the dataset to be compatible with the ML algorithm. Data transformation, dimensionality reduction, handling of outliers, handling of categorical variables are some examples of feature engineering techniques.
- QuAM/Model Training: Once an algorithm has been selected and data is prepared for the algorithm, we need to build the Question and Answer Machine (QuAM) — a combination of an algorithm and data. In the ML world, a QuAM is also referred to as a model. QuAM training includes using the training data to learn a QuAM that can generalize well.
- Evaluation: Identifying the evaluation criteria is an important task. If your task is classification and the success of a model is defined by the number of currently identified instances, then you can use accuracy as your evaluation metric. If there is a cost associated with identifying false positive or false negatives, then other measures such as precision and recall can be used.
- Refinement: Refine the model by identifying the best parameters for each of the algorithms on which you have trained the QuAM. This step is called hyperparameter tuning and is used to find the optimal parameters for a model.
Delivery and Acceptance
This is the stage where we confirm if the ML problem is addressing the business problem. Having a conversation with the employer or client is vital to understanding if the business problem is addressed.
- ML Solution: From the delivery perspective, an ML solution is to be delivered to the client. The solution could be in one or all three of the forms below:
- Prototype: Source code of the prototype is provided along with readme and dependency files on how to use the prototype. The prototype need not necessarily be a production-level code, but should be clean enough with comments, and relatively stable so that the engineering teams can use it to build a product.
- Documentation: Good documentation always accompanies a prototype. Some of the technical details should be listed and explained.
- Project Report: This is a complete list of methodologies used and decisions taken along the lifetime of the project, as well as the reason(s) behind those decisions. This gives a high-level idea of what was achieved in the project.
- Knowledge Transfer: Identify in-house training required for understanding ML solution and present it to the client. This is the appropriate time to clarify questions regarding the ML solution, and acts as a feedback checkpoint prior to incorporating the ML solution into full operation.
- Handoff: Turn over all materials to client so that they can execute it.
Total MLPL Framework
Viewed together, the entire framework looks like this:
In the third and final part of the MLPL Series, we take a look at lifecycle switches and what you can realistically expect on your journey to an ML solution.
Amii’s MLPL Framework leverages already-existing knowledge from the teams at organizations like Microsoft, Uber, Google, Databricks and Facebook. The MLPL has been adapted by Amii teams to be a technology-independent framework that is abstract enough to be flexible across problem types and concrete enough for implementation. To fit our clients’ needs, we’ve also decoupled the deployment and exploration phases, provided process modules within each stage and defined key artifacts that result from each stage. The MLPL also ensures we’re able to capture any learnings that come about throughout the overall process but that aren’t used in the final model.
If you want to learn more about this and other interesting ML topics, we highly recommend Amii’s recently launched online course Machine Learning: Algorithms in the Real World Specialization, taught by our Managing Director of Applied Science, Anna Koop. Visit the Training page to learn about all of our educational offerings.