Author: Sascha Grunert, SUSE Software Solutions

Editor’s note: Sascha is part of SIG Release and is working on many other
different container runtime related topics. Feel free to reach him out on
Twitter @saschagrunert.

A story of data science-ing 90,000 GitHub issues and pull requests by using
Kubeflow, TensorFlow, Prow and a fully automated CI/CD pipeline.


Choosing the right steps when working in the field of data science is truly no
silver bullet. Most data scientists might have their custom workflow, which
could be more or less automated, depending on their area of work. Using
Kubernetes can be a tremendous enhancement when trying to automate
workflows on a large scale. In this blog post, I would like to take you on my
journey of doing data science while integrating the overall workflow into

The target of the research I did in the past few months was to find any
useful information about all those thousands of GitHub issues and pull requests
(PRs) we have in the Kubernetes repository. What I ended up with was a
fully automated, in Kubernetes running Continuous Integration (CI) and
Deployment (CD) data science workflow powered by Kubeflow and Prow.
You may not know both of them, but we get to the point where I explain what
they’re doing in detail. The source code of my work can be found in the
kubernetes-analysis GitHub repository, which contains everything source
code-related as well as the raw data. But how to retrieve this data I’m talking
about? Well, this is where the story begins.

Getting the Data

The foundation for my experiments is the raw GitHub API data in plain JSON
format. The necessary data can be retrieved via the GitHub issues
, which returns all pull requests as well as regular issues in the
REST API. I exported roughly 91000 issues and pull requests in
the first iteration into a massive 650 MiB data blob. This took me about 8
of data retrieval time because for sure, the GitHub API is rate
. To be able to put this data into a GitHub repository, I’d chosen
to compress it via xz(1). The result was a roundabout 25 MiB sized
, which fits well into the repository.

I had to find a way to regularly update the dataset because the Kubernetes
issues and pull requests are updated by the users over time as well as new ones
are created. To achieve the continuous update without having to wait 8 hours
over and over again, I now fetch the delta GitHub API data between the
last update and the current time. This way, a Continuous Integration job
can update the data on a regular basis, whereas I can continue my research with
the latest available set of data.

From a tooling perspective, I’ve written an all-in-one Python executable,
which allows us to trigger the different steps during the data science
experiments separately via dedicated subcommands. For example, to run an export
of the whole data set, we can call:

> ./main export
INFO | Getting GITHUB_TOKEN from environment variable
INFO | Dumping all issues
INFO | Pulling 90929 items
INFO | 1: Unit test coverage in Kubelet is lousy. (~30%)
INFO | 2: Better error messages if go isn't installed, or if gcloud is old.
INFO | 3: Need real cluster integration tests
INFO | 4: kubelet should know which containers it is managing
… [just wait 8 hours] …

To update the data between the last time stamp stored in the repository we can

> ./main export --update-api
INFO | Getting GITHUB_TOKEN from environment variable
INFO | Retrieving issues and PRs
INFO | Updating API
INFO | Got update timestamp: 2020-05-09T10:57:40.854151
INFO | 90786: Automated cherry pick of #90749: fix: azure disk dangling attach issue
INFO | 90674: Switch core master base images from debian to distroless
INFO | 90086: Handling error returned by request.Request.ParseForm()
INFO | 90544: configurable weight on the CPU and memory
INFO | 87746: Support compiling Kubelet w/o docker/docker
INFO | Using already extracted data from data/data.pickle
INFO | Loading pickle dataset
INFO | Parsed 34380 issues and 55832 pull requests (90212 items)
INFO | Updating data
INFO | Updating issue 90786 (updated at 2020-05-09T10:59:43Z)
INFO | Updating issue 90674 (updated at 2020-05-09T10:58:27Z)
INFO | Updating issue 90086 (updated at 2020-05-09T10:58:26Z)
INFO | Updating issue 90544 (updated at 2020-05-09T10:57:51Z)
INFO | Updating issue 87746 (updated at 2020-05-09T11:01:51Z)
INFO | Saving data

This gives us an idea of how fast the project is actually moving: On a Saturday
at noon (European time), 5 issues and pull requests got updated within literally 5

Funnily enough, Joe Beda, one of the founders of Kubernetes, created the
first GitHub issue mentioning that the unit test coverage is too low. The
issue has no further description than the title, and no enhanced labeling
applied, like we know from more recent issues and pull requests. But now we have
to explore the exported data more deeply to do something useful with it.

Exploring the Data

Before we can start creating machine learning models and train them, we have to
get an idea about how our data is structured and what we want to achieve in

To get a better feeling about the amount of data, let’s look at how many issues
and pull requests have been created over time inside the Kubernetes repository:

> ./main analyze --created
INFO | Using already extracted data from data/data.pickle
INFO | Loading pickle dataset
INFO | Parsed 34380 issues and 55832 pull requests (90212 items)

The Python matplotlib module should pop up a graph which looks like this:

created all

Okay, this looks not that spectacular but gives us an impression on how the
project has grown over the past 6 years. To get a better idea about the speed of
development of the project, we can look at the created-vs-closed metric. This
means on our timeline, we add one to the y-axis if an issue or pull request got
created and subtracts one if closed. Now the chart looks like this:

> ./main analyze --created-vs-closed

created vs closed all

At the beginning of 2018, the Kubernetes projects introduced some more enhanced
life-cycle management via the glorious fejta-bot. This automatically
closes issues and pull requests after they got stale over a longer period of
time. This resulted in a massive closing of issues, which does not apply to pull
requests in the same amount. For example, if we look at the created-vs-closed
metric only for pull requests.

> ./main analyze --created-vs-closed --pull-requests

created vs closed pull requests

The overall impact is not that obvious. What we can see is that the increasing
number of peaks in the PR chart indicates that the project is moving faster over
time. Usually, a candlestick chart would be a better choice for showing this kind
of volatility-related information. I’d also like to highlight that it looks like
the development of the project slowed down a bit in the beginning of 2020.

Parsing raw JSON in every analysis iteration is not the fastest approach to do
in Python. This means that I decided to parse the more important information,
for example the content, title and creation time into dedicated issue and
PR classes. This data will be pickle serialized into the repository
as well, which allows an overall faster startup independently of the JSON blob.

A pull request is more or less the same as an issue in my analysis, except that
it contains a release note.

Release notes in Kubernetes are written in the PRs description into a separate
release-note block like this:

I changed something extremely important and you should note that.

Those release notes are parsed by dedicated Release Engineering Tools like
during the release creation process and will be part of the various files and the Release Notes Website. That seems like a
lot of magic, but in the end, the quality of the overall release notes is much
higher because they’re easy to edit, and the PR reviewers can ensure that we
only document real user-facing changes and nothing else.

The quality of the input data is a key aspect when doing data science. I decided
to focus on the release notes because they seem to have the highest amount of
overall quality when comparing them to the plain descriptions in issues and PRs.
Besides that, they’re easy to parse, and we would not need to strip away
the various issue and PR template text noise.

Labels, Labels, Labels

Issues and pull requests in Kubernetes get different labels applied during its
life-cycle. They are usually grouped via a single slash (/). For example, we
have kind/bug and kind/api-change or sig/node and sig/network. An easy
way to understand which label groups exist and how they’re distributed across
the repository is to plot them into a bar chart:

> ./main analyze --labels-by-group

labels by group all top 25

It looks like that sig/, kind/ and area/ labels are pretty common.
Something like size/ can be ignored for now because these labels are
automatically applied based on the amount of the code changes for a pull
request. We said that we want to focus on release notes as input data, which
means that we have to check out the distribution of the labels for the PRs. This
means that the top 25 labels on pull requests are:

> ./main analyze --labels-by-name --pull-requests

labels by name pull requests top 25

Again, we can ignore labels like lgtm (looks good to me), because every PR
which now should get merged has to look good. Pull requests containing release
notes automatically get the release-note label applied, which enables further
filtering more easily. This does not mean that every PR containing that label
also contains the release notes block. The label could have been applied
manually and the parsing of the release notes block did not exist since the
beginning of the project. This means we will probably loose a decent amount of
input data on one hand. On the other hand we can focus on the highest possible
data quality, because applying labels the right way needs some enhanced maturity
of the project and its contributors.

From a label group perspective I have chosen to focus on the kind/ labels.
Those labels are something which has to be applied manually by the author of the
PR, they are available on a good amount of pull requests and they’re related to
user-facing changes as well. Besides that, the kind/ choice has to be done for
every pull request because it is part of the PR template.

Alright, how does the distribution of those labels look like when focusing only
on pull requests which have release notes?

> ./main analyze --release-notes-stats

release notes stats

Interestingly, we have approximately 7,000 overall pull requests containing
release notes, but only ~5,000 have a kind/ label applied. The distribution of
the labels is not equal, and one-third of them are labeled as kind/bug. This
brings me to the next decision in my data science journey: I will build a binary
classifier which, for the sake of simplicity, is only able to distinguish between
bugs (via kind/bug) and non-bugs (where the label is not applied).

The main target is now to be able to classify newly incoming release notes if
they are related to a bug or not, based on the historical data we already have
from the community.

Before doing that, I recommend that you play around with the ./main analyze -h
subcommand as well to explore the latest set of data. You can also check out all
the continuously updated assets I provide within the analysis repository.
For example, those are the top 25 PR creators inside the Kubernetes repository:

users by created pull request

Building the Machine Learning Model

Now we have an idea what the data set is about, and we can start building a first
machine learning model. Before actually building the model, we have to
pre-process all the extracted release notes from the PRs. Otherwise, the model
would not be able to understand our input.

Doing some first Natural Language Processing (NLP)

In the beginning, we have to define a vocabulary for which we want to train. I
decided to choose the TfidfVectorizer from the Python scikit-learn machine
learning library. This vectorizer is able to take our input texts and create a
single huge vocabulary out of it. This is our so-called bag-of-words,
which has a chosen n-gram range of (1, 2) (unigrams and bigrams). Practically
this means that we always use the first word and the next one as a single
vocabulary entry (bigrams). We also use the single word as vocabulary entry
(unigram). The TfidfVectorizer is able to skip words that occur multiple times
(max_df), and requires a minimum amount (min_df) to add a word to the
vocabulary. I decided not to change those values in the first place, just
because I had the intuition that release notes are something unique to a

Parameters like min_df, max_df and the n-gram range can be seen as some of
our hyperparameters. Those parameters have to be optimized in a dedicated step
after the machine learning model has been built. This step is called
hyperparameter tuning and basically means that we train multiple times with
different parameters and compare the accuracy of the model. Afterwards, we choose
the parameters with the best accuracy.

During the training, the vectorizer will produce a data/features.json which
contains the whole vocabulary. This gives us a good understanding of how such a
vocabulary may look like:


"hostname address",
"hostname and",
"hostname as",
"hostname being",
"hostname bug",


This produces round about 50,000 entries in the overall bag-of-words, which is
pretty much. Previous analyses between different data sets showed that it is
simply not necessary to take so many features into account. Some general data
sets state that an overall vocabulary of 20,000 is enough and higher amounts do
not influence the accuracy any more. To do so we can use the SelectKBest
feature selector to strip down the vocabulary to only choose the top features.
Anyway, I still decided to stick to the top 50,000 to not negatively influence
the model accuracy. We have a relatively low amount of data (appr. 7,000
samples) and a low number of words per sample (~15) which already made me wonder
if we have enough data at all.

The vectorizer is not only able to create our bag-of-words, but it is also able to
encode the features in term frequency–inverse document frequency (tf-idf)
format. That is where the vectorizer got its name, whereas the output of that
encoding is something the machine learning model can directly consume. All the
details of the vectorization process can be found in the source code.

Creating the Multi-Layer Perceptron (MLP) Model

I decided to choose a simple MLP based model which is built with the help of the
popular TensorFlow framework. Because we do not have that much input data,
we just use two hidden layers, so that the model basically looks like this:


There have to be multiple other hyperparameters to be taken into account
when creating the model. I will not discuss them in detail here, but they’re
important to be optimized also in relation to the number of classes we want to
have in the model (only two in our case).

Training the Model

Before starting the actual training, we have to split up our input data into
training and validation data sets. I’ve chosen to use ~80% of the data for
training and 20% for validation purposes. We have to shuffle our input data as
well to ensure that the model is not affected by ordering issues. The technical
details of the training process can be found in the GitHub sources. So now
we’re ready to finally start the training:

> ./main train
INFO | Using already extracted data from data/data.pickle
INFO | Loading pickle dataset
INFO | Parsed 34380 issues and 55832 pull requests (90212 items)
INFO | Training for label 'kind/bug'
INFO | 6980 items selected
INFO | Using 5584 training and 1395 testing texts
INFO | Number of classes: 2
INFO | Vocabulary len: 51772
INFO | Wrote features to file data/features.json
INFO | Using units: 1
INFO | Using activation function: sigmoid
INFO | Created model with 2 layers and 64 units
INFO | Compiling model
INFO | Starting training
Train on 5584 samples, validate on 1395 samples
Epoch 1/1000
5584/5584 - 3s - loss: 0.6895 - acc: 0.6789 - val_loss: 0.6856 - val_acc: 0.6860
Epoch 2/1000
5584/5584 - 2s - loss: 0.6822 - acc: 0.6827 - val_loss: 0.6782 - val_acc: 0.6860
Epoch 3/1000
Epoch 68/1000
5584/5584 - 2s - loss: 0.2587 - acc: 0.9257 - val_loss: 0.4847 - val_acc: 0.7728
INFO | Confusion matrix:
[[920 32]
[291 152]]
INFO | Confusion matrix normalized:
[[0.966 0.034]
[0.657 0.343]]
INFO | Saving model to file data/model.h5
INFO | Validation accuracy: 0.7727598547935486, loss: 0.48470408514836355

The output of the Confusion Matrix shows us that we’re pretty good on
training accuracy, but the validation accuracy could be a bit higher. We now
could start a hyperparameter tuning to see if we can optimize the output of the
model even further. I will leave that experiment up to you with the hint to the
./main train --tune flag.

We saved the model (data/model.h5), the vectorizer (data/vectorizer.pickle)
and the feature selector (data/selector.pickle) to disk to be able to use them
later on for prediction purposes without having a need for additional training

A first Prediction

We are now able to test the model by loading it from disk and predicting some
input text:

> ./main predict --test
INFO | Testing positive text:
Fix concurrent map access panic
Don't watch .mount cgroups to reduce number of inotify watches
Fix NVML initialization race condition
Fix brtfs disk metrics when using a subdirectory of a subvolume
INFO | Got prediction result: 0.9940581321716309
INFO | Matched expected positive prediction result
INFO | Testing negative text:
action required
1. Currently, if users were to explicitly specify CacheSize of 0 for
KMS provider, they would end-up with a provider that caches up to
1000 keys. This PR changes this behavior.
Post this PR, when users supply 0 for CacheSize this will result in
a validation error.
2. CacheSize type was changed from int32 to *int32. This allows
defaulting logic to differentiate between cases where users
explicitly supplied 0 vs. not supplied any value.
3. KMS Provider's endpoint (path to Unix socket) is now validated when
the EncryptionConfiguration files is loaded. This used to be handled
by the GRPCService.
INFO | Got prediction result: 0.1251964420080185
INFO | Matched expected negative prediction result

Both tests are real-world examples which already exist. We could also try
something completely different, like this random tweet I found a couple of
minutes ago:

./main predict "My dudes, if you can understand SYN-ACK, you can understand consent"
INFO | Got prediction result: 0.1251964420080185
ERROR | Result is lower than selected threshold 0.6

Looks like it is not classified as bug for a release note, which seems to work.
Selecting a good threshold is also not that easy, but sticking to something >
50% should be the bare minimum.

Automate Everything

The next step is to find some way of automation to continuously update the model
with new data. If I change any source code within my repository, then I’d like
to get feedback about the test results of the model without having a need to run
the training on my own machine. I would like to utilize the GPUs in my
Kubernetes cluster to train faster and automatically update the data set if a PR
got merged.

With the help of Kubeflow pipelines we can fulfill most of these
requirements. The pipeline I built looks like this:


First, we check out the source code of the PR, which will be passed on as output
artifact to all other steps. Then we incrementally update the API and internal
data before we run the training on an always up-to-date data set. The prediction
test verifies after the training that we did not badly influence the model with
our changes.

We also built a container image within our pipeline. This container image
copies the previously built model, vectorizer, and selector into a container and
runs ./main serve. When doing this, we spin up a kfserving web server,
which can be used for prediction purposes. Do you want to try it out by yourself? Simply
do a JSON POST request like this and run the prediction against the endpoint:

> curl 
-d '{"text": "my test text"}'
{"result": 0.1251964420080185}

The custom kfserving implementation is pretty straightforward, whereas the
deployment utilizes Knative Serving and an Istio ingress gateway
under the hood to correctly route the traffic into the cluster and provide the
right set of services.

The commit-changes and rollout step will only run if the pipeline runs on
the master branch. Those steps make sure that we always have the latest data
set available on the master branch as well as in the kfserving deployment. The
rollout step creates a new canary deployment, which only accepts 50% of the
incoming traffic in the first place. After the canary got deployed successfully,
it will be promoted as the new main instance of the service. This is a great way
to ensure that the deployment works as intended and allows additional testing
after rolling out the canary.

But how to trigger Kubeflow pipelines when creating a pull request? Kubeflow has
no feature for that right now. That’s why I decided to use Prow,
Kubernetes test-infrastructure project for CI/CD purposes.

First of all, a 24h periodic job ensures that we have at least daily
up-to-date data available within the repository. Then, if we create a pull
request, Prow will run the whole Kubeflow pipeline without committing or rolling
out any changes. If we merge the pull request via Prow, another job runs on the
master and updates the data as well as the deployment. That’s pretty neat, isn’t

Automatic Labeling of new PRs

The prediction API is nice for testing, but now we need a real-world use case.
Prow supports external plugins which can be used to take action on any GitHub
event. I wrote a plugin which uses the kfserving API to make predictions
based on new pull requests. This means if we now create a new pull request in
the kubernetes-analysis repository, we will see the following:

pr 1

pr 2

Okay cool, so now let’s change the release note based on a real bug from the
already existing dataset:

pr 3

pr 4

The bot edits its own comment, predicts it with round about 90% as kind/bug
and automatically adds the correct label! Now, if we change it back to some
different – obviously wrong – release note:

pr 5

pr 6

The bot does the work for us, removes the label and informs us what it did!
Finally, if we change the release note to None:

pr 7

pr 8

The bot removed the comment, which is nice and reduces the text noise on the PR.
Everything I demonstrated is running inside a single Kubernetes cluster, which
would make it unnecessary at all to expose the kfserving API to the public. This
introduces an indirect API rate limiting because the only usage would be
possible via the Prow bot user.

If you want to try it out for yourself, feel free to open a new test
in kubernetes-analysis. This works because I enabled the plugin
also for issues rather than only for pull requests.

So then, we have a running CI bot which is able to classify new release notes
based on a machine learning model. If the bot would run in the official
Kubernetes repository, then we could correct wrong label predictions manually.
This way, the next training iteration would pick up the correction and result in
a continuously improved model over time. All totally automated!


Thank you for reading down to here! This was my little data science journey
through the Kubernetes GitHub repository. There are a lot of other things to
optimize, for example introducing more classes (than just kind/bug or nothing)
or automatic hyperparameter tuning with Kubeflows Katib. If you have any
questions or suggestions, then feel free to get in touch with me anytime. See you

Read more


Please enter your comment!
Please enter your name here