Getting Started with Cloud-native AI using Kubeflow
In this article, I would like to introduce to you Kubeflow: A complete and cloud-native platform that simplifies AI operations. Join me in setting up Kubeflow on GKE for your organization and get started with cloud-native AI today.
Prerequisites
- Up-to-date version of
git
,kubectl
andkustomize
installed - Up-to-date version of
gcloud
and thegke-gcloud-auth-plugin
component installed and initialized - A Google Cloud project with sufficient funds to host the GKE cluster
- A domain name and the capabilities to create DNS records for that domain
Background
AI on Kubernetes
Over the course of the last decade, Kubernetes has established itself as one of the foundational pillars of modern cloud computing. Countless enterprises and organizations have adopted it as a platform to serve their internal and external IT infrastructure, benefiting from its declarative management, self-healing and scaling capabilities. But the advance of Artificial Intelligence (AI) in recent years has presented the community with several challenges.
While containerization, workload scheduling, horizontal scaling and the general abstraction for compute resources are all relevant to AI engineers, Kubernetes was designed first and foremost with web applications in mind, and not machine learning tasks and inference serving. For this reason, the primitives Kubernetes provides for users to manage workloads have turned out to be unintuitive for individuals tasked with managing AI workloads if they don't already have a deep understanding of the system. Furthermore, AI workloads have very different performance characteristics compared to the web servers and database servers we are used to. They are, generally, very resource demanding, tightly coupled and often short-lived.
While advanced techniques, like cluster autoscaling, can help with some of these challenges, there is a clear need for Kubernetes primitives representing higher level concepts that AI researchers and data scientists are more familiar with.
Why Kubeflow
This is where Kubeflow enters the picture. Originally developed by Google, Kubeflow is, essentially, a collective of open-source projects that each provide a solution for different steps in a machine-learning workflow. However, Kubeflow also positions itself as a complete platform, integrating all of those projects into a cohesive and easy to use solution while providing additional features and security.
Kubeflow makes it easy to train and serve models, supporting most of the commonly used machine learning frameworks. It also helps researchers to reify their machine-learning workflows using "pipelines", leading to faster and more reproducible model development cycles.
Kubeflow Platform Architecture
If all of that sounds to you like a very complex system with a lot of moving parts, for better or for worse, you are right. Kubeflow is composed of several components and is built on a base of quite a few third-party projects:
- cert-manager (generating TLS certificates for admission webhooks)
- Istio (internal routing, authorization)
- OAuth2-Proxy (provide OIDC client)
- Dex (authentication)
- Knative (serving models)
While distributions tailored for specific cloud environments exist, I have found that those actually add even more complexity to this already complex system. Therefore, we are going to deploy a “vanilla” distribution, which is maintained by the Kubeflow contributors at https://github.com/kubeflow/manifests.
Cluster Setup
In the following steps, we will create a Kubernetes cluster with several nodes including some with GPUs. Please make sure you are familiar with the costs involved and ensure that you are authorized to spend the required funds, before you continue.
Considerations
But before we can install anything in a Kubernetes cluster, we will need, well, a cluster! So, let's head over to the Google Cloud Console and create one.
Google cloud makes it very easy to create Kubernetes clusters with the Google Kubernetes Engine (GKE), but our installation has some special requirements that need to be taken care of. Here are our considerations:
- Resource requirements: Kubeflow itself is quite resource heavy, and the specific requirements will change over time, because workloads will be created and deleted as your users work with it.
- Accelerators: You will probably need to expose one or more GPUs or TPUs in your cluster, but you probably don't want one on all of your general-purpose nodes.
- Permissions: In order to store model artifacts in Google Cloud Storage (GCS), at least some of your nodes need the WRITE permission for GCS.
Configuration
To create your cluster using the Google Cloud Console UI, go to https://console.cloud.google.com/kubernetes/add and make sure the correct project is selected in the top left. Then, choose a name and region and configure your node pools:
-
The first node pool (let's call it
default
) should be configured with- “Number of nodes (per zone)” set to 4 (12 nodes total is a good start for a Kubeflow installation)
- “Enable cluster autoscaler” checked
- “Size limit type” set to “Total limits”
- “Minimum number of nodes” set to 0
- “Maximum number of nodes” set to 16
- Machine type
e2-medium
(or use a different one if desired) - “Enable nodes on spot VMs” checked (recommended)
-
Add a second node pool using the button in the top bar (let's call this one
gpu
) and configure it with- “Number of nodes (per zone)” set to 0 (nodes in this group should only be provisioned on-demand)
- “Enable cluster autoscaler” checked
- “Size limit type” set to “Total limits”
- “Minimum number of nodes” set to 0
- “Maximum number of nodes” set to 3 (or more, depending on anticipated load)
- “Machine configuration family” set to “GPUs”
- “GPU type” set to L4 and “Number of GPUs” set to 1
- “GPU Driver installation” set to “Google-managed” and driver version set to “Latest”
- Machine type
g2-standard-4
(or use a different one if desired) - “Enable nodes on spot VMs” checked (again, recommended)
- Under “Security” on the left navigation, select “Set access for each API” and change “Storage” to “Read Write”.
To provision your cluster, click “Create” on the bottom of the page and wait for your cluster to be ready.
If you get an error like “accelerator l4 is not available in region X”, go back to the “Cluster basics” page and check “Specify default node locations”. Then, check all the regions except for the one mentioned in the error message and try creating the cluster again.
Connect to Your Cluster
Finally, to add the new cluster to your kubectl
configuration, run
gcloud container clusters get-credentials "<cluster name>" \
--region "<cluster region>" \
--project "<cluster project id>"
Verify that the configuration works, by running any command such as kubectl get nodes
.
Need help running Kubeflow in production?
Book a call with us to discuss your Kubernetes deployment and scaling needs for your Cloud-Native AI.
Kubeflow Installation
Basic Installation
Although Kubeflow is a very complex system, they provide a Kustomize manifest that contains everything required.
You can find it here.
To use this manifest, you need to clone the repository locally.
Additionally, make sure to check out a tagged release of the repository.
At the time of writing this article, the latest release is v1.9.1
, so let's use that version.
git clone git@github.com:kubeflow/manifests.git
cd manifests
git checkout v1.9.1
Since this manifest contains several Custom Resource Definitions (CRDs), related controllers, webhook configurations and Custom Resources (CRs), simply applying it usually does not work, because once a webhook configuration exists for a CRD, the appropriate controller must be running before a CR can be created. A simple workaround for this issue is to just apply the manifests multiple times until there is no error. To do this conveniently, you can execute the command in a loop, until it terminates with exit code 0:
- Bash
- Fish
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 20; done
while not kustomize build example | kubectl apply -f -
echo "Retrying to apply resources"
sleep 20
end
After this command has finished, it may take several minutes for the complete platform to become available.
You can observe the status of all components by running kubectl get pods -A
or kubectl get pods -n kubeflow
.
Once all pods are reported as running and ready, Kubeflow has been successfully installed!
In a terminal, run kubectl port-forward -n istio-system svc/istio-ingressgateway 8080:80
.
Then, open http://localhost:8080/ in a browser.
Sign in with username user@example.com
and password 12341234
.
Exposing Via Ingress
If you just want to see what Kubeflow looks like or experiment a little, you can already stop here.
Kubeflow is usable behind kubectl port-forward
.
However, you probably intend other users to join the platform and expecting everyone to install kubectl
and have access to the cluster API would be unreasonable.
Fortunately, exposing Kubeflow is pretty straight-forward with an Ingress
, we just need to jump through some extra hoops to make everything work in GCP.
Specifically, we will also need:
- A
BackendConfig
that tells GCP how to health-check the service, - a
ManagedCertificate
to provide a TLS certificate signed by a trusted Certificate Authority (CA) and - a
FrontendConfig
to configure the load balancer to automatically redirect HTTP to HTTPS.
Finally, we need to patch the istio-ingressgateway
service to use the BackendConfig
and change its type to NodePort
.
At this point you should decide what DNS name you assign to your Kubeflow installation.
For this example, I'm going to use kf.example-company.com
, so make sure to replace it with your chosen name.
To start with, we will create a new directory in the directory where you cloned the kubeflow/manifests
repository before: mkdir custom
.
We do this because it is best practice to not modify any resources in-place but instead create a new overlay and make changes there.
In this directory, create the following files:
apiVersion: networking.gke.io/v1beta1
kind: FrontendConfig
metadata:
name: kubeflow
spec:
redirectToHttps:
enabled: true
apiVersion: networking.gke.io/v1
kind: ManagedCertificate
metadata:
name: kubeflow
spec:
domains:
- kf.example-company.com
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: kubeflow
annotations:
networking.gke.io/managed-certificates: kubeflow
networking.gke.io/v1beta1.FrontendConfig: kubeflow
spec:
rules:
- host: kf.example-company.com
http:
paths:
- pathType: Prefix
path: '/'
backend:
service:
name: istio-ingressgateway
port:
number: 80
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
name: istio-healthcheck
spec:
healthCheck:
type: HTTP
checkIntervalSec: 15
port: 15021
requestPath: /healthz/ready
namespace: istio-system
resources:
- frontend-config.yaml
- managed-certificate.yaml
- ingress.yaml
- backend-config.yaml
apiVersion: v1
kind: Service
metadata:
name: istio-ingressgateway
namespace: istio-system
annotations:
cloud.google.com/neg: '{"ingress": true}'
beta.cloud.google.com/backend-config: '{"default": "istio-healthcheck"}'
spec:
type: NodePort
resources:
- ../example
- ingress
patches:
- path: istio-ingressgateway-service-patch.yaml
Apply the new manifest by running kustomize build custom | kubectl apply -f -
.
Since Kubeflow is already installed, there is no need to do this in a loop, but it will still take a few minutes to complete.
After the command has completed, open https://console.cloud.google.com/kubernetes/ingresses in your web browser, and select the ingress you just created.
Note down the IP address and use it to create a DNS record for this Kubeflow installation.
The exact procedure for this is obviously different for each domain registrar, but in the end, your record should look something like this:
kf.example-company.com. 60 IN A <your ingress IP address>
It will take some time for GCP to register all LoadBalancer as healthy and request the TLS certificate from the CA.
Once everything is done, enter the domain in your browser, and you will be able to log in, just like before.
Since our Kubeflow deployment is now reachable for the public, you will probably want to set up a more robust authentication flow than the built-in default username and password.
Of course, you can do this by extending the staticPasswords
list located in the Dex configuration, but where Dex really shines at is delegating authentication to an external Identity Provider IdP, such as Google, GitHub, Microsoft or basically any provider that implements the OIDC protocol.
So let's do that!
Configuring External Authentication
Like many other organizations, we use several Microsoft products, so it makes sense for us to use Azure Active Directory (AAD) here. But Dex supports a wide variety of SSO providers, including generic SAML and OIDC, so feel free to pick one that works for you.
First, visit AAD App Registrations in your browser and create a new registration.
For the redirect URI, choose the platform “Web” and enter https://kf.example-company.com/dex/callback
(replace with your chosen domain).
Then press “Register”.
Note down the “Application (client) ID” and “Directory (tenant) ID”.
In the side navigation, select “Certificates & secrets”, go to “Client secrets”, press “New client secret” and follow the directions.
Note down the client secret value.
Now we just need to configure Dex to act as this app registration. To do so, create the following file:
kind: Component
patches:
- path: config-map.yaml
Next, copy common/dex/overlays/oauth2-proxy/config-map.yaml
to custom/microsoft-auth/config-map.yaml
.
In your copy, make the following changes:
- Change
issuer
tohttps://kf.example-company.com/dex
- Change
enablePasswordDB
tofalse
- Remove the
staticPasswords
- Add the
microsoft
connector, as seen below (replace the client ID, tenant ID and client secret with the ones you noted down before), or add your own connector if you chose to use a different IdP
connectors:
- type: microsoft
id: microsoft
name: Microsoft
config:
clientID: xxx
clientSecret: xxx
tenant: xxx
redirectURI: https://kf.example-company.com/dex/callback
This is enough for Dex to work with the Microsoft IdP, but because we changed the issuer
field in the Dex configuration, it now identifies as a different issuer internally.
So, we have to update the oauth2-proxy
configuration to accept the new issuer name.
For this, update the file you created before, adding a patch for the RequestAuthentication
resource:
kind: Component
patches:
- path: config-map.yaml
- target:
kind: RequestAuthentication
name: dex-jwt
namespace: istio-system
patch: |
- op: replace
path: /spec/jwtRules/0/issuer
value: "https://kf.example-company.com/dex"
Then, copy common/oauth2-proxy/base/oauth2_proxy.cfg
to custom/oauth2_proxy.cfg
and replace http://dex.auth.svc.cluster.local:5556/dex
with https://kf.example-company.com/dex
everywhere.
Finally, update the root kustomization file:
resources:
- ../example
- ingress
components:
# import the dex config we just created as a *component*
- microsoft-auth
patches:
- path: istio-ingressgateway-service-patch.yaml
# make sure to specify the *merge* behavior here, otherwise Kustomize will report a name conflict error
configMapGenerator:
- name: oauth2-proxy
namespace: oauth2-proxy
files:
- oauth2_proxy.cfg
behavior: merge
Again, apply the changes by running kustomize build custom | kubectl apply -f -
.
Finally, restart the dex
deployments:
kubectl rollout restart -n auth deployment dex
Open the site in your browser (make sure to clear your cookies if you have signed in with the example credentials before) and log in. On first try, you will be prompted to authorize Dex to access personal information stored in your Microsoft account, which you must agree to.
Enabling Registration Flow
You should now be able to log in to Kubeflow with you Microsoft account, but as you will quickly realize, you can't access any namespaces!
RBAC in Kubeflow is an evolving topic and one solution likely won't work for every use-case, but the simplest way to give users access to a namespace is by enabling the “Registration Flow” feature on the centraldashboard
component.
To do this, add the following file:
apiVersion: apps/v1
kind: Deployment
metadata:
name: centraldashboard
namespace: kubeflow
spec:
template:
spec:
containers:
- name: centraldashboard
env:
- name: REGISTRATION_FLOW
value: 'true'
Don't forget to add this patch to your kustomization.yaml
:
resources:
- ../example
- ingress
components:
- microsoft-auth
patches:
- path: istio-ingressgateway-service-patch.yaml
# here we enable the registration flow feature
- path: centraldashboard-deployment-patch.yaml
configMapGenerator:
- name: oauth2-proxy
namespace: oauth2-proxy
files:
- oauth2_proxy.cfg
behavior: merge
And apply the changes by running kustomize build custom | kubectl apply -f -
.
Once the command has completed, reload the page in your browser, and you should be presented with the Kubeflow registration flow. Complete it, and you will see that you are now “owner” of a namespace!
Excursion: Exploring distributed training with Kubeflow
We provide some additional information on how one can use Kubeflow in our glasskube/label-prediction repo, where we tried to create a small machine learning project ourselves.
Final Thoughts
This concludes our guide on how to set up vanilla Kubeflow on GCP. If you followed along, you now have a complete and functional Kubeflow installation running on a Kubernetes cluster that automatically scales up or down to accommodate your users CPU and GPU requirements. We exposed the deployment using a GCP load balancer and secured it using Microsoft as our IdP.
While the instance we created might be opinionated in some respects, I believe that it should serve as a reasonable foundation on which you can build your own variant. Check out our dedicated Kustomize tutorial if you want to learn more about how you can further customize your Kubeflow deployment. If you want to learn more about Kubeflow, why not check out one of the official user-guides: https://www.kubeflow.org/docs/started/introduction/
Need help running Kubeflow in production?
Book a call with us to discuss your Kubernetes deployment and scaling needs for your Cloud-Native AI.