High Performance ATS Container In A Multi-Tenant Enviroment

Note - Below are findings I documented for an internal POC related to ATS and Docker containers. The main focus is launching a high performance ATS container on the same host as other high performance applications. Or applications that require direct access to CPU/Memory/PCI-Devices/etc. 
Note that orchestration or boot time configurations are not covered here. I only cover the OS/HOST commands needed to pass resources to the container.
These commands or similar can be incorporated into an orchestration engine of choice
All these commands are run on a Docker host.
Also, the steps below reflect my lab machine, but easily translate to any other hosting environment.

+++++

Objective - To create a cache instance using the Docker container framework and ensuring that all cpu/memory/network resources are NUMA aligned.

Resulting container will have the following resources assigned/pinned.

8 CPU(cores)
32G Memory
Bonded SRIOV VFs within the corresponding NUMA node.
18G ramdisk per container to server as cache disk



Host information:

HOST OS - Ubuntu 16.04.3 LTS (Xenial Xerus)
Docker Version -  Docker CE 17.x



Steps:

Install docker network plugin.

https://github.com/Mellanox/docker-passthrough-plugin


Start Docker plugin to support NIC passthrough:


docker run -v /run/docker/plugins: /run/docker/plugins --net=host --privileged mellanox/passthrough-plugin

This plugin also supports carving out SRIOV VFs on the fly and assigning the VFs to the container.
For this document I will use the plugin to passthrough a HOST created bond interface to each container . Each container will have its own bond interface.

NUMA 0 Steps
(modify the below system paths to reflect your host architecture)


# enable VFs on numa0 PFs port 0 and 1.. I used 8 VFs per PF port.. but any supported value can be used here
  
echo 8 >  '/sys/devices/pci0000:00/0000:00:02.0/0000:03:00.0/sriov_numvfs'
echo 8 >  '/sys/devices/pci0000:00/0000:00:02.0/0000:03:00.1/sriov_numvfs'


# create bond interface for the eventual container
 
ip link add dev bndn0c0 type bond
ip link set down dev bndn0c0

# set bond mode and hashing policy
  
echo 2 > /sys/devices/virtual/net/bndn0c0/bonding/mode
echo 'layer2+3' /sys/devices/virtual/net/bndn0c0/bonding/xmit_hash_policy


# select VFs to form the bond.. set spoofchk off and set MACs to be the same since we are using link aggregation.. I used VF 3.
# set the VFs down. needed to add the VFs to the bond

ip link set enp3s0f0 vf 3 spoofchk off mac 32:fe:d8:7a:83:93
ip link set enp3s0f1 vf 3 spoofchk off mac 32:fe:d8:7a:83:93
ip link set down enp3s16f6
ip link set down enp3s16f7
ip link set enp3s16f6 master bndn0c0
ip link set enp3s16f7 master bndn0c0


# add bond interface to the vlan of choice. in this case 1400.
  
vconfig add bndn0c0 1400


# bring the bond interfaces up
ip link set bndn0c0 up
ip link set bndn0c0.1400 up



# create a cgroup for CPU and memory pinning
# this will reserve the cpu and memory for this cgroup. multiple containers
# can use this cgroup. however we will only assign this cgroup to a single container


cgcreate -g cpuset:ats-numa0-cnt0-cgroup
echo 32,34,36,38,40,42,44,46 > /sys/fs/cgroup/cpuset/ats-numa0-cnt0/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuseprivilegdedt/ats-numa0-cnt0/cpuset.mems
echo 1 > /sys/fs/cgroup/cpuset/ats-numa0-cnt0/cpuset.mem_hardwall
echo 32G > /sys/fs/cgroup/memory/ats-numa0-cnt0/memory.limit_in_bytes

# create docker network, which maps to the just created bond interface on vlan1400.
  
docker network create -d passthrough --ip-range 192.168.1.228/30 --gateway=192.168.1.193 --subnet=192.168.1.192/26 -o netdevice=bndn0c0.1400 -o mode=passthrough bndn0c0-1400

create docker container which attaches to the above docker network. i started the container in privileged mode(being lazy). we can set
# permissions which limit access accordingly. we would not launch a production # container in priviledged mode.
 
docker run --dns=192.168.1.104 --ip=192.168.1.229 --privileged  --rm --cgroup-parent=/ats-numa0-cnt0-cgroup/ --name=ats-numa0-cnt0 --net=bndn0c0-1416 --device=/dev/ram0:/dev/xram0 -it 829dabfd1184/bin/bash
 
# once the container is started, it can contact a chef server or some configuration manager to pull configurations.
# i built this container image manually and set static values within the image. coniguration management is no different
# than a VM or bare metal.


Comments