Commit 5a334da5 authored by Tony Wildish's avatar Tony Wildish
Browse files

Merge branch 'master' into 'master'

Add GCP Slurm demo

See merge request TSI/gcp-ace!10
parents 10a79fb8 04c7ef29
......@@ -6,3 +6,4 @@ env.sh
download
ssh.sh
sync-from-vm.sh
list-of*.txt
......@@ -52,6 +52,10 @@ In your VM, set an environment variable to hold the cookie:
## Downloading an entire specialisation or professional certificate
The hard part here is getting the certification 'slug'. The 'slug' is the part of the URL in the browser that identifies the course or specialisation/professional certification. The only way I've found is to go into the web page for the specialisation and then grub around in the developer view in Safari to pull it out of the data files.
Specifically, look at the 'Sources' view, under www.coursera.org -> Fetches, then look in the graphqlBatch files (probably the second to last file). From the top, you'll see a json object data -> XdpV1Resource -> get -> xdpMetadata -> sdpMetadata -> slug. That's the magic word!
There's another way to get course information, and that's to use the Coursera API. That's documented, somewhat, at https://build.coursera.org/app-platform/catalog. Or you can reverse-engineer the coursera-dl code to figure it out. That leads to scripts like **get-course-list.sh** and **get-specialisation-list.sh**. You can investigate those for yourself if you want.
Anyway, for the Associate Cloud Engineer, the slug is **cloud-engineering-gcp**. For the Professional Cloud Architect, it's **gcp-cloud-architect**.
Then, download the specialisation:
......@@ -67,8 +71,6 @@ coursera-dl --cauth $cauth \
This will download the specialisation into the **download/$slug** directory. Note that where specialisations share courses, you will inevitably download one copy of the course per specialisation.
There's a full list of the slugs for the specialisations we can access with our licenses in the **specialisations.txt** file.
## Downloading individual courses
### Finding the course 'slug'
To find the course slug, simply go into the course in the browser and look at the URL. For example, in the second course in the specialisation, **Essential Google Cloud Infrastructure: Foundation**, the URL is **https://www.coursera.org/learn/gcp-infrastructure-foundation/home/welcome**. The slug is the part that identifies the course, so **gcp-infrastructure-foundation**.
......@@ -86,7 +88,7 @@ coursera-dl --cauth $cauth --path download --subtitle-language en $slug
_et voila_, the course material appears magically under the **download* subdirectory!
Here's a list of the slugs for the ACE specialisation
Here's a list of the slugs for the ACE specialisation (N.B. This is missing the K8s course that was added)
Course title | slug
-------------|-----
......@@ -104,9 +106,9 @@ for slug in \
gcp-infrastructure-foundation \
gcp-infrastructure-core-services \
gcp-infrastructure-scaling-automation \
preparing-cloud-associate-cloud-engineer-exam \
preparing-cloud-associate-cloud-engineer-exam
do
coursera-dl --cauth $cauth --path download --subtitle-language en $slug
coursera-dl --cauth $cauth --subtitle-language en --download-notebooks --video-resolution 720p --path download $slug
done
```
......@@ -121,4 +123,4 @@ Reliable Google Cloud Infrastructure: Design and Process | cloud-infrastructure-
Preparing for the Google Cloud Professional Cloud Architect Exam | preparing-cloud-professional-cloud-architect-exam
## Syncing the files back to your laptop
How do you get the files back to your laptop? Take a look at the next code example, **../02-synchronising-data**!
\ No newline at end of file
How do you get the files back to your laptop? Take a look at the next code example, **../02-synchronising-data**!
#!/bin/bash
for course in \
gcp-fundamentals \
gcp-infrastructure-foundation \
gcp-infrastructure-core-services \
gcp-infrastructure-scaling-automation \
preparing-cloud-associate-cloud-engineer-exam
do
[ -d download/$course ] && continue
echo $course
coursera-dl --cauth $cauth \
--subtitle-language en \
--download-notebooks \
--video-resolution 720p \
--path {download/courses/,}$course
echo "Sleeping..."
done
#!/bin/bash
start=0
step=500
while [ $start -lt 5800 ]
do
echo $start
file=$(printf "course-list-%04d.json" $start)
echo "File: $file"
if [ ! -f $file ]; then
curl -o $file "https://api.coursera.org/api/courses.v1?start=$start&limit=$step"
fi
start=$[$start + $step]
done
(
echo "Slug,Name"
for file in course-list*.json
do
length=$(cat $file | jq ".elements | length")
index=0
while [ $index -lt $length ]
do
slug=$(cat $file | jq ".elements[$index] | .slug" | tr -d '"' )
name=$(cat $file | jq ".elements[$index] | .name" | tr -d '"' )
echo "$slug,\"$name\""
index=$[$index + 1]
done
done
) | tee list-of-courses.txt
#!/bin/bash
start=0
step=300
while [ $start -lt 999 ]
do
echo $start
file=$(printf "specialization-list-%04d.json" $start)
echo "File: $file"
if [ ! -f $file ]; then
curl -o $file "https://api.coursera.org/api/onDemandSpecializations.v1?start=$start&limit=$step"
fi
start=$[$start + $step]
done
(
echo "Slug,Name"
for file in specialization-list*.json
do
length=$(cat $file | jq ".elements | length")
index=0
while [ $index -lt $length ]
do
slug=$(cat $file | jq ".elements[$index] | .slug" | tr -d '"' )
name=$(cat $file | jq ".elements[$index] | .name" | tr -d '"' )
echo "$slug,\"$name\""
index=$[$index + 1]
done
done
) | tee list-of-specialisations.txt
#!/bin/bash
cd download/courses
for d in *
do
g=$(egrep "^$d," ../../list-of-courses.txt | \
uniq | \
tr ' ' '-' | \
tr -d ":\"'\(\)" | \
sed -e 's%^[^,]*,%%' -e 's%&%and%' -e 's%-$%%' | \
tr '[A-Z]' '[a-z]'
)
if [ "$d" == "$g" ]; then
echo "# $d: no change"
else
echo mv $d $g
fi
done | tee a | \
egrep -v '^#' | tee a.sh
#!/bin/bash
cd download
for d in *
do
g=$(egrep "^$d," ../list-of-specializations.txt | \
uniq | \
tr ' ' '-' | \
tr -d ':"\'\(\)' | \
sed -e 's%^[^,]*,%%' -e 's%&%and%' -e 's%-$%%' | \
tr '[A-Z]' '[a-z]'
)
if [ "$d" == "$g" ]; then
echo "# $d: no change"
else
echo mv $d $g
fi
done | tee a | \
egrep -v '^#' | tee a.sh
#!/bin/bash
for spec in \
architecting-google-kubernetes-engine \
gcp-data-engineering \
cloud-engineering-gcp \
gcp-cloud-architect \
security-google-cloud-platform \
networking-google-cloud-platform \
gcp-data-machine-learning \
machine-learning-tensorflow-gcp \
advanced-machine-learning-tensorflow-gcp \
from-data-to-insights-google-cloud-platform \
architecting-hybrid-cloud-infrastructure-anthos \
developing-apps-gcp
gcp-cloud-architect
do
[ -d download/$spec ] && continue
echo $spec
......@@ -22,4 +12,5 @@ do
--specialization \
--video-resolution 720p \
--path {download/,}$spec
echo "Sleeping..."
done
specialisations:
architecting-google-kubernetes-engine
gcp-data-engineering
cloud-engineering-gcp
gcp-cloud-architect
security-google-cloud-platform
networking-google-cloud-platform
gcp-data-machine-learning
machine-learning-tensorflow-gcp
advanced-machine-learning-tensorflow-gcp
from-data-to-insights-google-cloud-platform
architecting-hybrid-cloud-infrastructure-anthos
developing-apps-gcp
# Deploying a Slurm cluster on GCP
[Slurm](https://slurm.schedmd.com) is an open-source batch scheduler with a lively support community. It's used by many of the largest computers in the world, and is easy to learn. Many workflow managers (e.g. Nextflow) support Slurm out of the box, so if you're looking to run batch jobs in the cloud, this is a very easy way to get started.
The cluster consists of a **login** node, a **controller** node, and a configurable number of workers. The worker nodes are created on-demand, when there are jobs in the batch queue. They run the job, then, if there's no new job for them within 5 minutes, they're terminated. This means you can configure a rather large cluster and only pay for batch nodes when you're actually running jobs. There's some overhead to the startup and shutdown, but it's a good start.
This demo uses the sample code at https://github.com/SchedMD/slurm-gcp.git to deploy a Slurm cluster. Check that repository for full documentation, this README only covers the highlights.
There are two options for deploying Slurm from that repository, using Deployment Manager, and using Terraform. The Deployment Manager version is officially deprecated, however it still works, and that's what we use here.
## Configuring the project
There's wrapper code here to set up the project, based on the example in the **../00-managing-projects** directory.
First, edit **env.sh** to set the parent folder ID, based on where you're allowed to create projects in the EBI GCP organisation.
Then there's some logic to set the project name, either using a recorded project name in the **.projectname** file or calculating one based on the username and date. If the name is calculated, it's stored in the **.projectname** file for future use. If you want to use a specific project name, just echo it into the **.projectname** file before you start.
Finally, you need to set the **cluster_name**. This is used in a few places, in particular there should be a **${cluster_name}.yaml** file, which we cover later. The example here uses a cluster called **ips2**, hence the **ips2.yaml** file.
Once that's done, run the **./create-project.sh** script to create the project and enable a few APIs etc. You'll need to have the **gcloud SDK** installed for this, see the [README in the managing-projects directory](../00-managing-projects/README.md) for more details.
## Configure the cluster
The **${cluster_name}.yaml** file is a Deployment Manager configuration file which defines the parameters of your Slurm cluster. In this example, we create a cluster called **ips2**, hence the **ips2.yaml** file.
Full documentation for the file contents is with the **slurm-gcp** repository, but this is a fully working example that you can use out of the box. That said, there are a few things you'll probably want to modify:
- For the controller and login machines, you can specify the machine type, the disk type and size, and a few other parameters.
- the batch workers are based on a disk image which is created by the deployment manager. You can tune the parameters of the image if you want.
Slurm uses __partitions__ where LSF uses __queues__. Each partition is associated with different machine types (number of CPUs, amount of memory...) and the number of machines associated with each partition can be configured separately. The **ips2.yaml** example configures 4 partitions. The **debug** partition is the default (first in the list), then there are 3 partitions which differ only in the amount of memory they have. They all have 16 CPUs, but they have 16, 32, or 64 GB RAM. Note that the **mem32** partition actually uses a custom machine type, you won't find that machine type listed in Google's documentation!
## Creating the cluster
Once you have the cluster configured, run the **create-cluster.sh** script. This does some sanity checks and downloads the slurm-gcp repository if it's not there. It copies the config file into place, substituting a couple of values from the **env.sh** file. Then it uses Deployment Manager to create the cluster, and finally copies a slurm test script onto the login node.
Deployment Manager will terminate successfully after the cluster is deployed, however it can take up to 10 minutes before Slurm is available, since there's a lot more initialising going on behind the scenes. Log in to the login node by running the **./ssh.sh** script and run **ls** to see if the **test-slurm.sh** script is there. If not, log out, wait a few minutes, and try again. Once slurm is fully installed the disks from the controller node are exported and mounted on the login node and you will be able to see the test script.
## Running jobs
There's a very simple test script, **test-slurm.sh**, which will detect the list of partitions and submit several copies of a small job to each partition. You can use this to check that things are working. If all goes well, track the state of your jobs with the **squeue** command, and find their output in the **$HOME/slurm/test** directory on the login node when they're done.
Your /home directory is shared with the batch nodes, so you can install your software and data there and expect to find it when you run.
Spinning up batch nodes can take some time, a few minutes, since it's not only GCP that has to work but slurm too that has to notice the new nodes are online. Be patient, it'll get there!
Slurm will, by default in this configuration, submit each job to a separate node. This means if you allow 10 batch nodes in a partition and you submit 10 jobs, you will spin up 10 worker nodes. If your jobs are extremely short,like the test example here, that's a huge overhead, but if your jobs run for longer than 20 minutes or so it's relatively small.
## Recommendations & best practices
If you want to change some parameters of the cluster, such as the size of the controller or login nodes, you can do that by editing the yaml config file and issuing a **gcloud deployments-manager deployment update ...** command. If you want to change the number of partitions or the machine types associated with the partition, that won't work, because that information is used post-deployment, and the deployment manager won't detect the need to rebuild the cluster.
If you do decide to update your cluster, be aware that this may destroy/recreate, rather than update in place. You may lose any software or data you've copied onto the cluster.
Consider your slurm clusters disposable. You can create new ones any time by creating a new yaml config file and running the deployment manager, you can get the commands from the scripts here. As long as your software is easy to install there's little overhead to this, so you can create and destroy clusters whenever you like.
To destroy your cluster, run **gcloud deployments-manager delete ${cluster_name} --quiet**. Note that this will fail if there are any batch worker nodes still running, since the deployment manager didn't create them, and won't destroy them, but can't then destroy the network components that those workers are still using.
Start with a small slurm cluster, with only a few batch nodes, and experiment until you have the right types for your workflow. Then destroy your test cluster and build a bigger one for production running.
Start with a small disk on the controller node, and avoid the temptation to make the controller and login machines too powerful. These nodes, and the disk they have, are permanently running, so cost money even if you don't use the cluster. Again, you can destroy your cluster and create a new one easily enough, so there's no need to worry about getting the size just right straight away.
When you move to production running, think carefully about your cluster size and your needs for job completion, only allow as many workers as you need to get your work done in time. E.g., running 1000 jobs on 1000 machines will cost a lot more than running them on 10 machines and letting them queue, because the longer-running machines will benefit from sustained use discounts, which can cut the cost of computing by 30%.
Running on fewer machines with lots of jobs in the queue will also amortise the startup/shutdown time of the batch nodes, because as soon as one job ends another can start. If your jobs are short, this can have a big effect.
\ No newline at end of file
#!/bin/sh
. ./env.sh
config_file="${cluster_name}.yaml"
controller_node="${cluster_name}-controller"
#
# Ssanity checks...
if [ "$cluster_name" == "" ]; then
echo "Expected to find \$cluster_name set, but it's not"
exit 1
fi
if [ ! -f $config_file ]; then
echo "Expected to find a deployment configuration file, $config_file, but it's not there"
exit 1
fi
dir="slurm-gcp"
if [ ! -d $dir ]; then
git clone https://github.com/SchedMD/slurm-gcp.git
[ -f $dir/cluster.yaml ] && mv $dir/cluster.yaml{,.orig}
fi
cat $config_file | \
sed \
-e "s/CLUSTER_NAME/$cluster_name/" \
-e "s/ZONE/$zone/" | \
tee $dir/$config_file >/dev/null
cd $dir
gcloud config configurations activate $project && \
gcloud deployment-manager deployments create $cluster_name --config $config_file && \
gcloud compute scp ../test-slurm.sh ${controller_node}:
\ No newline at end of file
#!/bin/bash
cd `dirname $0`
#
# Source the environment script to pick up the variable definitions.
. ./env.sh
#
# Check if this project already exists or not. If it does, don't try
# to create it again.
#
# N.B. the filter argument can take wildcards, otherwise it's an exact
# and complete match, not a substring match
#
# See 'gcloud topic filters' for help
exists=$(gcloud projects list --filter="name=$project")
if [ "$exists" != "" ]; then
echo "Project '$project' exists..."
exit 0
fi
echo "Project '$project' does not exist: creating..."
gcloud projects create $project \
--folder=$folder \
--labels=owner=$USER # labels are arbitrary key/value pairs
#
# Link the billing account to the project, or we are limited to free resources only
gcloud beta billing projects link $project --billing-account=$billing_account
#
# Now that the project exists, create a configuration locally so I can switch to
# it easily. I prefer to overspecify, rather than risk any defaults creeping in
gcloud config configurations create $project
gcloud config set core/project $project
gcloud config set core/account ${USER}@ebi.ac.uk
gcloud config set compute/region $region
gcloud config set compute/zone $zone
#
# It's not enough to create a project, you have to explicitly enable the services
# you want to use. You do this by enabling APIs.
#
# 'gcloud services list --available' gives a full list of the 300+ services that exist!
gcloud services enable \
deploymentmanager.googleapis.com \
compute.googleapis.com \
file.googleapis.com
\ No newline at end of file
#!/bin/bash
cd `dirname $0`
#
# Source the environment script to pick up the variable definitions.
. ./env.sh
#
# First delete the project itself
gcloud config configurations activate $project
gcloud projects delete $project
#
# Then delete the configuration too. For that, you have to be in a
# different configuration, you can't delete the active configuration.
gcloud config configurations activate default
gcloud config configurations delete $project
#
# The billing account is the only one we're allowed to use with our EBI accounts.
billing_account="00F8FC-525826-16C3A7" # Main EBI Billing Account
#
# The folder is the location in the GCP EBI organisation tree. Get it from the
# GCP console -> IAM & Admin -> Manage Resources view
folder=187759886721 # ebi.ac.uk -> Technical Services -> TSI -> Cloud Certification
#
# @Gift, this is for you...
# folder=122124219091 # ebi.ac.uk -> Service Teams -> Protein Sequence Resources
#
# Choose a name for your project. Here I just build a generic name that is likely
# to be unique by adding the current date to it.
project_name_file=".projectname"
if [ -f $project_name_file ]; then
project=`cat $project_name_file`
else
timestamp=`date +%g-%m-%d`
# timestamp=`date +%g-%m-%d-%H:%M` # If I want higher timestamp resolution...
project="$USER-slurm-$timestamp"
echo $project > $project_name_file
fi
#
# Set a region and zone. Choose London unless you have a good reason not to, it's best
# that we stay within UK jurisdiction if we can.
region="europe-west2"
zone="${region}-a"
#
# This is the name of the cluster to create
cluster_name="ips2"
# Copyright 2017 SchedMD LLC.
# Modified for use with the Slurm Resource Manager.
#
# Copyright 2015 Google Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
imports:
- path: slurm.jinja
resources:
- name: slurm-cluster
type: slurm.jinja
properties:
cluster_name : CLUSTER_NAME
zone : ZONE
controller_machine_type : n1-standard-2
controller_disk_type : pd-standard
controller_disk_size_gb : 200
external_controller_ip : False
login_machine_type : n1-standard-1
login_disk_type : pd-standard
login_disk_size_gb : 20
external_login_ips : False
login_node_count : 1
login_node_scopes :
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
compute_image_machine_type : n1-standard-2
compute_image_disk_type : pd-standard
compute_image_disk_size_gb : 20
partitions :
- name : debug
machine_type : n1-standard-2
max_node_count : 4
zone : ZONE
compute_disk_size_gb : 20
- name : mem16
machine_type : n2-highcpu-16
max_node_count : 10
zone : ZONE
compute_disk_size_gb : 20
- name : mem32
machine_type : n2-custom-16-32768
max_node_count : 10
zone : ZONE
compute_disk_size_gb : 20
- name : mem64
machine_type : n2-standard-16
max_node_count : 10
zone : ZONE
compute_disk_size_gb : 20
export CLOUDSDK_PYTHON=$(which python3)
cd `dirname $0`
source ./env.sh
gcloud config configurations activate $project
shift
exec gcloud compute ssh "${cluster_name}-login0" -- "$@"
#!/bin/bash
dest="$HOME/slurm/test"
if [ ! -d $dest ]; then
mkdir -p $dest
fi
(
echo "#!/bin/bash"
echo "#SBATCH --output $dest/test-slurm-batch-%J.out"
echo "#SBATCH --error $dest/test-slurm-batch-%J.err"
echo "#SBATCH --nodes=1"
echo "#SBATCH --exclusive"
echo " "
echo "hostname"
echo "sleep 60"
) | tee test-slurm-batch.sh
for i in `seq 0 5`
do
for partition in $(sinfo --noheader --format "%R")
do
sbatch --partition $partition < test-slurm-batch.sh
done
done
sleep 2
squeue
echo " "
echo "Partition info:"
sinfo --format "%P %z %m"
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment