Professional Documents
Culture Documents
Download Machine Learning Infrastructure And Best Practices For Software Engineers Take Your Machine Learning Software From A Prototype To A Fully Fledged Software System 1St Edition Anonymous online ebook texxtbook full chapter pdf
Download Machine Learning Infrastructure And Best Practices For Software Engineers Take Your Machine Learning Software From A Prototype To A Fully Fledged Software System 1St Edition Anonymous online ebook texxtbook full chapter pdf
https://ebookmeta.com/product/applied-machine-learning-and-ai-
for-engineers-jeff-prosise/
https://ebookmeta.com/product/machine-learning-crash-course-for-
engineers-1st-edition-eklas-hossain/
https://ebookmeta.com/product/software-defined-radio-for-
engineers-travis-f-collins/
https://ebookmeta.com/product/engineer-your-software-1st-edition-
scott-a-whitmire/
Machine Learning Interviews: Kickstart Your Machine
Learning and Data Career 1st Edition Susan Shu Chang
https://ebookmeta.com/product/machine-learning-interviews-
kickstart-your-machine-learning-and-data-career-1st-edition-
susan-shu-chang/
https://ebookmeta.com/product/robotics-for-software-engineers-
meap-version-3-andreas-bihlmaier/
https://ebookmeta.com/product/machine-learning-simplified-a-
gentle-introduction-to-supervised-learning-andrew-wolf/
https://ebookmeta.com/product/practical-machine-learning-for-
computer-vision-end-to-end-machine-learning-for-images-1st-
edition-valliappa-lakshmanan/
https://ebookmeta.com/product/machine-learning-system-design-
meap-v03-valerii-babushkin/
Machine Learning Infrastructure and Best
Practices for Software Engineers
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the author nor Packt
Publishing or its dealers and distributors, will be held liable for any damages
caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
Published by
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK
ISBN 978-1-83763-406-4
www.packtpub.com
Writing a book with a lot of practical examples requires a lot of extra time,
which is often taken from family and friends. I dedicate this book to my family
– Alexander, Cornelia, Viktoria, and Sylwia – who always supported and
encouraged me, and to my parents and parents-in-law, who shaped me to be
who I am.
– Miroslaw Staron
Contributors
I would like to thank my family for their support in writing this book. I would also
like to thank my colleagues from the Software Center program who provided me
with the ability to develop my ideas and knowledge in this area – in particular,
Wilhelm Meding, Jan Bosch, Ola Söder, Gert Frost, Martin Kitchen, Niels Jørgen
Strøm, and several other colleagues. One person who really ignited my interest in
this area is of course Mirosław “Mirek” Ochodek, to whom I am extremely grateful.
I would also like to thank the funders of my research, who supported my studies
throughout the years. I would like to thank my Ph.D. students, who challenged me
and encouraged me to always dig deeper into the topics. I’m also very grateful to
the reviewers of this book – Hongyi Zhang and Sushant K. Pandey, who provided
invaluable comments and feedback for the book. Finally, I would like to extend my
gratitude to my publishing team – Hemangi Lotlikar, Sushma Reddy, and Anant
Jaint – this book would not have materialized without you!
Preface
Part 1: Machine Learning Landscape in Software
Engineering
Feature engineering
Feature engineering for numerical data
PCA
t-SNE
ICA
Locally linear embedding
Linear discriminant analysis
Autoencoders
Feature engineering for image data
Summary
References
10
11
12
13
ML is not alone
The UI of an ML model
Data storage
Deploying an ML model for numerical data
Deploying a generative ML model for images
Deploying a code completion model as an
extension
Summary
References
Part 4: Ethical Aspects of Data Management and ML
System Development
14
15
16
17
Index
In this book, my goal is to show how machine learning models can be trained,
evaluated, and tested – both in the context of a small prototype and in the context
of a fully-fledged software product. The primary objective of this book is to bridge
the gap between theoretical knowledge and practical implementation of machine
learning in software engineering. It aims to equip you with the skills necessary to
not only understand but also effectively implement and innovate with AI and
machine learning technologies in your professional pursuits.
A significant portion of the book is dedicated to best practices. These practices are
not just theoretical guidelines but are derived from real-life experiences and case
studies that my research team discovered during our work in this field. These best
practices offer invaluable insights into handling common pitfalls and ensuring the
scalability, reliability, and efficiency of machine learning systems.
Furthermore, we delve into the ethics of data and machine learning algorithms.
We explore the theories behind ethics in machine learning, look closer into the
licensing of data and models, and finally, explore the practical frameworks that can
quantify bias in data and models in machine learning.
This book is not just a technical guide; it is a journey through the evolving
landscape of machine learning in software engineering. Whether you are a novice
eager to learn, or a seasoned professional seeking to enhance your skills, this
book aims to be a valuable resource, providing clarity and direction in the exciting
and ever-changing world of machine learning.
Who this book is for
This book is meticulously crafted for software engineers, computer scientists, and
programmers who seek practical applications of artificial intelligence and machine
learning in their field. The content is tailored to impart foundational knowledge on
working with machine learning models, viewed through the lens of a programmer
and system architect.
The book presupposes familiarity with programming principles, but it does not
demand expertise in mathematics or statistics. This approach ensures accessibility
to a broader range of professionals and enthusiasts in the software development
domain. For those of you without prior experience in Python, this book
necessitates acquiring a basic understanding of the language. However, the
material is structured to facilitate a rapid and comprehensive grasp of Python
essentials. Conversely, for those proficient in Python but not yet seasoned in
professional programming, this book serves as a valuable resource for transitioning
into the realm of software engineering with a focus on AI and ML applications.
What this book covers
Chapter 1, Machine Learning Compared to Traditional Software, explores where
these two types of software systems are most appropriate. We learn about the
software development processes that programmers use to create both types of
software and we also learn about the classical four types of machine learning
software – rule-based, supervised, unsupervised, and reinforcement learning.
Finally, we also learn about the different roles of data in traditional and machine
learning software.
Chapter 4, Data Acquisition, Data Quality, and Noise, dives deeper into topics
related to data quality. We go through a theoretical model for assessing data
quality and we provide methods and tools to operationalize it. We also look into
the concept of noise in machine learning and how to reduce it by using different
tokenization methods.
Chapter 5, Quantifying and Improving Data Properties, dives deeper into the
properties of data and how to improve them. In contrast to the previous chapter,
we work on feature vectors rather than raw data. The feature vectors are already
a transformation of the data; therefore, we can change such properties as noise or
even change how the data is perceived. We focus on the processing of text, which
is an important part of many machine learning algorithms nowadays. We start by
understanding how to transform data into feature vectors using simple algorithms,
such as bag of words, so that we can work on feature vectors.
Chapter 6, Processing Data in Machine Learning Systems, dives deeper into the
ways in which data and algorithms are entangled. We talk a lot about data in
generic terms, but in this chapter, we explain what kind of data is needed in
machine learning systems. We explain the fact that all kinds of data are used in
numerical form – either as a feature vector or as more complex feature matrices.
Then, we will explain the need to transform unstructured data (e.g., text) into
structured data. This chapter will lay the foundations for going deeper into each
type of data, which is the content of the next few chapters.
Chapter 7, Feature Engineering for Numerical and Image Data, focuses on the
feature engineering process for numerical and image data. We start by going
through the typical methods such as Principal Component Analysis (PCA),
which we used previously for visualization. We then move on to more advanced
methods such as the t-Student Distribution Stochastic Network
Embeddings (t-SNE) and Independent Component Analysis (ICA). What
we end up with is the use of autoencoders as a dimensionality reduction technique
for both numerical and image data.
Chapter 8, Feature Engineering for Natural Language Data, explores the first steps
that made the transformer (GPT) technologies so powerful – feature extraction
from natural language data. Natural language is a special kind of data source in
software engineering. With the introduction of GitHub Copilot and ChatGPT, it
became evident that machine learning and artificial intelligence tools for software
engineering tasks are no longer science fiction.
Chapter 10, Training and Evaluation of Classical ML Systems and Neural Networks,
goes a bit deeper into the process of training and evaluation. We start with the
basic theory behind different algorithms and then we show how they are trained.
We start with the classical machine learning models, exemplified by the decision
trees. Then, we gradually move toward deep learning where we explore both the
dense neural networks and some more advanced types of networks.
Chapter 12, Designing Machine Learning Pipelines and their Testing, describes how
the main goal of MLOps is to bridge the gap between data science and operations
teams, fostering collaboration and ensuring that machine learning projects can be
effectively and reliably deployed at scale. MLOps helps to automate and optimize
the entire machine learning life cycle, from model development to deployment and
maintenance, thus improving the efficiency and effectiveness of ML systems in
production. In this chapter, we learn how machine learning systems are designed
and operated in practice. The chapter shows how pipelines are turned into a
software system, with a focus on testing ML pipelines and their deployment at
Hugging Face.
Chapter 15, Ethics in Machine Learning Systems, focuses on the bias in machine
learning systems. We start by exploring sources of bias and briefly discussing
these sources. We then explore ways to spot biases, how to minimize them, and
finally, how to communicate potential biases to the users of our system.
Chapter 17, Summary and Where to Go Next, revisits all the best practices and
summarizes them per chapter. In addition, we also look into what the future of
machine learning and AI may bring to software engineering.
If you are using the digital version of this book, we advise you to type
the code yourself or access the code from the book’s GitHub repository
(a link is available in the next section). Doing so will help you avoid any
potential errors related to the copying and pasting of code.
We also have other code bundles from our rich catalog of books and videos
available at https://github.com/PacktPublishing/. Check them out!
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names,
filenames, file extensions, pathnames, dummy URLs, user input, and Twitter
handles. Here is an example: “The model itself is created one line above, in the
model = LinearRegression() line.”
def fibRec(n):
if n < 2:
return n
else:
return fibRec(n-1) + fibRec(n-2)
>python app.py
BEST PRACTICES
Appear like this.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us
at [email protected] and mention the book title in the subject of your
message.
Errata: Although we have taken every care to ensure the accuracy of our content,
mistakes do happen. If you have found a mistake in this book, we would be
grateful if you would report this to us. Please visit
www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the
internet, we would be grateful if you would provide us with the location address or
website name. Please contact us at [email protected] with a link to the
material.
If you are interested in becoming an author: If there is a topic that you have
expertise in and you are interested in either writing or contributing to a book,
please visit authors.packtpub.com.
Do you like to read on the go but are unable to carry your print books
everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that
book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your
favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters,
and great free content in your inbox daily
https://packt.link/free-ebook/978-1-83763-406-4
3. That’s it! We’ll send your free PDF and other benefits to your email directly
Part 1:Machine Learning Landscape in
Software Engineering
Traditionally, Machine Learning (ML) was considered to be a niche domain in
software engineering. No large software systems used statistical learning in
production. This changed in the 2010s when recommendation systems started to
utilize large quantities of data – for example, to recommend movies, books, or
music. With the rise of transformer technologies, this has changed. Commonly
known products such as ChatGPT popularized these techniques and showed that
they are no longer niche products, but have entered the mainstream software
products and services. Software engineering needs to keep up and we need to
know how to create the software based on these modern machine learning
models. In this first part of the book, we look at how machine learning changes
software development and how we need to adapt to these changes.
In this chapter, we’ll explore where these two types of software systems are most
appropriate. We’ll learn about the software development processes that
programmers use to create both types of software. We’ll also learn about the four
classical types of machine learning software – rule-based learning, supervised
learning, unsupervised learning, and reinforcement learning. Finally, we’ll learn
about the different roles of data in traditional and machine learning software – as
input to pre-programmed algorithms in traditional software and input to training
models in machine learning software.
The best practices introduced in this chapter provide practical guidance on when
to choose each type of software and how to assess the advantages and
disadvantages of these types. By exploring a few modern examples, we’ll
understand how to create an entire software system with machine learning
algorithms at the center.
The first pivotal moment was the focus on big data in the late 2000s and early
2010s. With the introduction of smartphones, companies started to collect and
process increasingly large quantities of data, mostly about our behavior online.
One of the companies that perfected this was Google, which collected data about
our searches, online behavior, and usage of Google’s operating system, Android.
As the volume of the collected data increased (and its speed/velocity), so did its
value and the need for its veracity – the five Vs. These five Vs – volume, velocity,
value, veracity, and variety – required a new approach to working with data. The
classical approach of relational databases (SQL) was no longer sufficient.
Relational databases became too slow in handling high-velocity data streams,
which gave way to map-reduce algorithms, distributed databases, and in-memory
databases. The classical approach of relational schemas became too constraining
for the variety of data, which gave way for non-SQL databases, which stored
documents.
The second pivotal moment was the rise of modern machine learning algorithms –
deep learning. Deep learning algorithms are designed to handle unstructured data
such as text, images, or music (compared to structured data in the form of tables
and matrices). Classical machine learning algorithms, such as regression, decision
trees, or random forest, require data in a tabular form. Each row is a data point,
and each column is one characteristic of it – a feature. The classical models are
designed to handle relatively small datasets. Deep learning algorithms, on the
other hand, can handle large datasets and find more complex patterns in the data
because of the power of large neural networks and their complex architectures.
First, we import a generic machine learning model from a library. This generic
model has all elements that are specific to it, but it is not trained to solve any
tasks. An example of such a model is a decision tree model, which is designed to
learn dependencies in data in the form of decisions (or data splits), which it uses
later for new data. To make this model somewhat useful, we need to train it. For
that, we need data, which we call the training data.
Second, we evaluate the trained model on new data, which we call the test data.
The evaluation process uses the trained model and applies it to check whether its
inferences are correct. To be precise, it checks to which degree the inferences are
correct. The training data is in the same format as the test data, but the content
of these datasets is different. No data point should be present in both.
In the third step, we use the model as part of a software system. We develop
other non-machine learning components, and we connect them to the trained
model. The entire software system usually consists of data procurement
components, real-time validation components, data cleaning components, user
interfaces, and business logic components. All these components, including the
machine learning model, provide a specific functionality for the end user. Once the
software system has been developed, it needs to be tested, which is where the
input data comes into play. The input data is something that the end user inputs
to the system, such as by filling in a form. The input data is designed in such a
way that has both the input and expected output – to test whether the software
system works correctly.
Finally, the last step is to deploy the entire system. The deployment can be very
different, but most modern machine learning systems are organized into two parts
– the onboard/edge algorithms for non-machine learning components and the user
interface, and the offboard/cloud algorithms for machine learning inferences.
Although it is possible to deploy all parts of the system on the target device (both
machine learning and non-machine learning components), complex machine
learning models require significant computational power for good performance and
seamless user experience. The principle is simple – more data/complex data
means more complex models, which means that more computational power is
needed:
Figure 1.1 – Typical flow of machine learning software development
As shown in Figure 1.1, one of the crucial elements of the machine learning
software is the model, which is one of the generic machine learning models, such
as a neural network, that’s been trained on specific data. Such a model is used to
make predictions and inferences. In most systems, this kind of component – the
model – is often prototyped and developed in Python.
Models are trained for different datasets and, therefore, the core characteristic of
machine learning software is its dependence on that dataset. An example of such
a model is a vision system, where we train a machine learning algorithm such as a
convolutional neural network (CNN) to classify images of cats and dogs.
Since the models are trained on specific datasets, they perform best on similar
datasets when making inferences. For example, if we train a model to recognize
cats and dogs in 160 x 160-pixel grayscale images, the model can recognize cats
and dogs in such images. However, the same model will perform very poorly (if at
all!) if it needs to recognize cats and dogs in colorful images instead of grayscale
images – the accuracy of the classification will be low (close to 0).
On the other hand, when we develop and design traditional software systems, we
do not rely on data that much, as shown in Figure 1.2. This figure provides an
overview of a software development process for traditional, non-machine learning
software. Although it is depicted as a flow, it is usually an iterative process where
Steps 1 to 3 are done in cycles, each one ending with new functionality added to
the product.
The first step is developing the software system. This includes the development of
all its components – user interface, business logic (processing), handling of data,
and communication. The step does not involve much data unless the software
engineer creates data for testing purposes.
The second step is system testing, where we use input data to validate the
software system. In essence, this step is almost identical to testing machine
learning software. The input data is complemented with the expected outcome
data, which allows software testers to assess whether the software works
correctly.
The third step is to deploy the software. The deployment can be done in many
ways. However, if we consider traditional software that is similar in function to
machine learning software, it is usually simpler. It usually does not require
deployment on the cloud, just like machine learning models:
Figure 1.2 - Typical flow of traditional software development
One of the main parts of traditional software is the algorithm, which is developed
by software engineers from scratch, based on the requirements or user stories.
The algorithm is usually written as a sequential set of steps that are implemented
in a programming language. Naturally, all algorithms use data to operate on it, but
they do it differently than machine learning systems. They do it based on the
software engineer’s design – if x, then y or something similar.
BEST PRACTICE #1
Use machine learning algorithms when your problem is focused on data, not on the algorithm.
However, if the problem requires traceability and control, use the traditional
approach. Examples of such systems are control software in cars (anti-lock
braking, engine control, and so on) and embedded systems.
If the problem requires new data to be generated based on the existing data, a
process known as data manipulation, use the machine learning approach.
Examples of such systems are image manipulation programs (DALL-E), text
generation programs, deep fake programs, and source code generation programs
(GitHub Copilot).
If the problem requires adaptation over time and optimization, use machine
learning software. Examples of such systems are power grid optimization software,
non-playable character behavior components in computer games, playlist
recommendation systems, and even GPS navigation systems in modern cars.
However, if the problem requires stability and traceability, use the traditional
approach. Examples of such systems are systems to make diagnoses and
recommendation systems in medicine, safety-critical systems in cars, planes, and
trains, and infrastructure controlling and monitoring systems.
Reinforcement learning: This is a group of models that are applied to data to solve a
particular task given a goal. For these models, we need to provide this goal in addition to the
data. It is called the reward function, and it is an expression that defines when we achieve the
goal. The model is trained based on this fitness function. Examples of such models are
algorithms that play Go, Chess, or StarCraft. These algorithms are also used to solve hard
programming problems (AlphaCode) or optimize energy consumption.
BEST PRACTICE #2
Before you start developing a machine learning system, do due diligence and identify the right
group of algorithms to use.
As each of these groups of models has different characteristics, solves different
problems, and requires different data, a mistake in selecting the right algorithm
can be costly. Supervised models are very good at solving problems related to
predictions and classifications. The most powerful models in this area can compete
with humans in selected areas – for example, GitHub Copilot can create programs
that can pass as human-written. Unsupervised models are very powerful if we
want to group entities and make recommendations. Finally, reinforcement learning
models are the best when we want to have continuous optimization with the need
to retrain models every time the data or the environment changes.
Although all these models are based on statistical learning, they are all
components of larger systems to make them useful. Therefore, we need to
understand how this probabilistic and statistical nature of machine learning goes
with traditional, digital software products.
The implementation is very simple and is based on the algorithm – in our case, the
fibRec function. It is simplistic, but it has its limitations. The first one is its
recursive implementation, which costs resources. Although it can be written as an
iterative one, it still suffers from the second problem – it is focused on the
calculations and not on the data.
Now, let’s see how the machine learning implementation is done. I’ll explain this
by dividing it into two parts – data preparation and model training/inference:
In the case of machine learning software, we prepare data to train the algorithm.
In our case, this is the dfTrain DataFrame. It is a table that contains the numbers
that the machine learning algorithm needs to find the pattern.
Please note that we prepared two datasets – dfTrain, which contains the numbers
to train the algorithm, and lstSequence, which is the sequence of Fibonacci
numbers that we’ll find later.
# algorithm to train
# here, we use linear regression
model = LinearRegression()
# now, the actual process of training the model
model.fit(dfTrain[['first number', 'second number']],
dfTrain['result'])
# printing the score of the model, i.e. how good the model is when
trained
print(model.score(dfTrain[['first number', 'second number']],
dfTrain['result']))
The magic of the entire code fragment is in the bold-faced code – the model.fit
method call. This method trains the logistic regression model based on the data
we prepared for it. The model itself is created one line above, in the model =
LinearRegression() line.
Now, we can make inferences or create new Fibonacci numbers using the
following code fragment:
This code fragment contains a similar line to the previous one – model.predict().
This line uses the previously created model to make an inference. Since the
Fibonacci sequence is recursive, we need to add the newly created number to the
list before we can make the new inference, which is done in the
lstSequence.append() line.
Now, it is very important to emphasize the difference between these two ways of
solving the same problem. The traditional implementation exposes the algorithm
used to create the numbers. We do not see the Fibonacci sequence there, but we
can see how it is calculated. The machine learning implementation exposes the
data used to create the numbers. We see the first sequence as training data, but
we never see how the model creates that sequence. We do not know whether that
model is always correct – we would need to test it against the real sequence –
simply because we do not know how the algorithm works. This takes us to the
next part, which is about just that – probabilities.
The probability, which is the result of the model, means that the answer we
receive is a probability of something. For example, if we classify an image to check
whether it contains a dog or a cat, the result of this classification is a probability –
for example, there is a 93% probability that the image contains a dog and a 7%
probability that it contains a cat. This is illustrated in Figure 1.3:
To use these probabilistic results in other parts of the software, or other systems,
the machine learning software usually uses thresholds (for example, if x<0.5) to
provide only one result. Such thresholds specify which probability is acceptable to
be able to consider the results to belong to a specific class. For our example of
image classification, this probability would be 50% – if the probability of
identifying a dog in the image is larger than 50%, then the model states that the
image contains a dog (without the probability).
Another random document with
no related content on Scribd:
The Project Gutenberg eBook of Chambers's
Journal of Popular Literature, Science, and Art,
Fifth Series, No. 115, Vol. III, March 13, 1886
This ebook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this ebook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.
Author: Various
Language: English