Ceph, AWS S3, and Multipart uploads using Python

Summary

In this article the following will be demonstrated:

  • Ceph Nano – As the back end storage and S3 interface
  • Python script to use the S3 API to multipart upload a file to the Ceph Nano using Python multi-threading

Introduction

Caph Nano is a Docker container providing basic Ceph services (mainly Ceph Monitor, Ceph MGR, Ceph OSD for managing the Container Storage and a RADOS Gateway to provide the S3 API interface). It also provides Web UI interface to view and manage buckets.

Multipart uploads is a feature in HTTP/1.1 protocol that allow download/upload of range of bytes in a file. For example, a 200 MB file can be downloaded in 2 rounds, first round can 50% of the file (byte 0 to 104857600) and then download the remaining 50% starting from byte 104857601 in the second round.

The Details

First Docker must be installed in local system, then download the Ceph Nano CLI using:

$ curl -L https://github.com/ceph/cn/releases/download/v2.3.1/cn-v2.3.1-linux-amd64 -o cn && chmod +x cn

This will install the binary cn version 2.3.1 in local folder and turn it executable.

To start the Ceph Nano cluster (container), run the following command:

$ ./cn cluster start ceph
2019/12/03 11:59:12 Starting cluster ceph...

Endpoint: http://166.87.163.10:8000
Dashboard: http://166.87.163.10:5000
Access key: 90WFLFQNZQ452XXI6851
Secret key: ISmL6Ru3I3MDiFwZITPCu8b1tL3BWyPDAmLoF0ZP
Working directory: /usr/share/ceph-nano

This will download the Ceph Nano image and run it as a Docker container. Web UI can be accessed on http://166.87.163.10:5000, API end point is at http://166.87.163.10:8000.

We can verify that using:

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                            NAMES
0ba17ec716d3        ceph/daemon         "/opt/ceph-contain..."   4 weeks ago         Up 26 seconds       0.0.0.0:5000->5000/tcp, 0.0.0.0:8000->8000/tcp   ceph-nano-ceph

Of course this is for demonstration purpose, the container here is created 4 weeks ago.

It can be accessed with the name ceph-nano-ceph using the command

$ docker exec -it ceph-nano-ceph /bin/bash

Which will drop me in a BASH shell inside the Ceph Nano container.

To examine the running processes inside the container:

[root@ceph-nano-ceph-faa32aebf00b /]# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 08:59 ?        00:00:00 /bin/bash /opt/ceph-container/bin/entrypoint.sh
ceph       113     1  0 08:59 ?        00:00:43 /usr/bin/ceph-mon --cluster ceph --default-log-to-file=false --default-mon-cluster-log-to-file=false --setuser ceph --setgroup ceph -i ceph-nano-ceph-faa32aebf00b --mon-data /var/lib/ceph/mo
ceph       194     1  1 08:59 ?        00:02:08 ceph-mgr --cluster ceph --default-log-to-file=false --default-mon-cluster-log-to-file=false --setuser ceph --setgroup ceph -i ceph-nano-ceph-faa32aebf00b
ceph       240     1  0 08:59 ?        00:00:29 ceph-osd --cluster ceph --default-log-to-file=false --default-mon-cluster-log-to-file=false --setuser ceph --setgroup ceph -i 0
ceph       451     1  0 08:59 ?        00:00:17 radosgw --cluster ceph --default-log-to-file=false --default-mon-cluster-log-to-file=false --setuser ceph --setgroup ceph -n client.rgw.ceph-nano-ceph-faa32aebf00b -k /var/lib/ceph/radosgw/c
root       457     1  0 08:59 ?        00:00:02 python app.py
root       461     1  0 08:59 ?        00:00:00 /usr/bin/python2.7 /usr/bin/ceph --cluster ceph -w
root      1093     0  0 11:02 ?        00:00:00 /bin/bash
root      1111  1093  0 11:03 ?        00:00:00 ps -ef

The first thing I need to do is to create a bucket, so when inside the Ceph Nano container I use the following command:

# s3cmd mb s3://nano

Which will create a Bucket called nano.

Now to create a user on the Ceph Nano cluster to access the S3 buckets. So here I created a user called test, with access and secret keys set to test.

$ radosgw-admin user create --uid=test --access-key=test --secret=test --display-name test

To upload a test file for testing:

# dd if=/dev/zero of=./zeros bs=15M count=1
# s3cmd put ./zeros s3://nano

And list the file in the bucket:

# s3cmd ls s3://nano
2019-10-29 11:58  15728640   s3://nano/zeors

The Python code

#!/usr/bin/env python
#
# Copyright (c) 2019 Tamer Embaby <tamer@redhat.com>
# All rights reserved.
#
# Main reference is: https://stackoverflow.com/questions/34303775/complete-a-multipart-upload-with-boto3
# Good code, but it will take too much time to complete especially for thread synchronization. (DONE)
#
# TODO: 
#       - Check return code of function calls everywhere.
#       - Use logging instead of print's everywhere.
#       - Address the XXX and FIXME's in the code
#

import boto3
import sys, os
import threading
import logging

b3_client = None
b3_s3 = None
mpu = None              # Multipart upload handle

#
# Thread (safe) function responsible of uploading a part of the file
#
def upload_part_r(partid, part_start, part_end, thr_args):
        filename = thr_args['FileName']
        bucket = thr_args['BucketName']
        upload_id = thr_args['UploadId']

        logging.info("%d: >> Uploading part %d", partid, partid)
        logging.info("%d: --> Upload starts at byte %d", partid, part_start)
        logging.info("%d: --> Upload ends at byte %d", partid, part_end)

        f = open(filename, "rb")
        logging.info("%d: DEBUG: Seeking offset: %d", partid, part_start)
        logging.info("%d: DEBUG: Reading size: %d", partid, part_end - part_start)
        f.seek(part_start, 0)
        # XXX: Would the next read fail if the portion is too large?
        data = f.read(part_end - part_start + 1)

        # DO WORK HERE
        # TODO:
        # - Variables like mpu, Bucket, Key should be passed from caller -- DONE
        # - We should collect part['ETag'] from this part into array/list, so we must synchronize access
        #   to that list, this list is then used to construct part_info array to call .complete_multipart_upload(...)
        # TODO.
        #
        # NOTES:
        # - Since part id is zero based (from handle_mp_file function), we add 1 to it here as HTTP parts should start
        #   from 1
        part = b3_client.upload_part(Bucket=bucket, Key=filename, PartNumber=partid+1, UploadId=upload_id, Body=data)

        # Thread critical variable which should hold all information about ETag for all parts, access to this variable
        # should be synchronized.
        lock = thr_args['Lock']
        if lock.acquire():
                thr_args['PartInfo']['Parts'].append({'PartNumber': partid+1, 'ETag': part['ETag']})
                lock.release()

        f.close()
        logging.info("%d: -><- Part ID %d is ending", partid, partid)
        return

#
# Part size calculations.
# Thread dispatcher
#
def handle_mp_file(bucket, filename, nrparts):

        print ">> Uploading file: " + filename + ", nr_parts = " + str(nrparts)

        fsize = os.path.getsize(filename)
        print "+ %s file size = %d " % (filename, fsize)

        # do the part size calculations
        part_size = fsize / nrparts
        print "+ standard part size = " + str(part_size) + " bytes"

        # Initiate multipart uploads for the file under the bucket
        mpu = b3_client.create_multipart_upload(Bucket=bucket, Key=filename)

        threads = list()
        thr_lock = threading.Lock()
        thr_args = { 'PartInfo': { 'Parts': [] } , 'UploadId': mpu['UploadId'], 'BucketName': bucket, 'FileName': filename,
                'Lock': thr_lock }

        for i in range(nrparts):
                print "++ Part ID: " + str(i)

                part_start = i * part_size
                part_end = (part_start + part_size) - 1

                if (i+1) == nrparts:
                        print "DEBUG: last chunk, part-end was/will %d/%d" % (part_end, fsize)
                        part_end = fsize

                print "DEBUG: part_start=%d/part_end=%d" % (part_start, part_end)

                thr = threading.Thread(target=upload_part_r, args=(i, part_start, part_end, thr_args, ) )
                threads.append(thr)
                thr.start()

        # Wait for all threads to complete
        for index, thr in enumerate(threads):
                thr.join()
                print "%d thread finished" % (index)

        part_info = thr_args['PartInfo']
        for p in part_info['Parts']:
                print "DEBUG: PartNumber=%d" % (p['PartNumber'])
                print "DEBUG: ETag=%s" % (p['ETag'])

        print "+ Finishing up multi-part uploads"
        b3_client.complete_multipart_upload(Bucket=bucket, Key=filename, UploadId=mpu['UploadId'], MultipartUpload=thr_args['PartInfo'])
        return True

### MAIN ###

if __name__ == "__main__":
        bucket = 'test'                 # XXX FIXME: Pass in arguments

        if len(sys.argv) != 3:
                print "usage: %s <filename to upload> <number of threads/parts>" % (sys.argv[0])
                sys.exit(1)

        # Filename: File to upload
        # NR Parts: Number of parts to divide the file to, which is the number of threads to use
        filename = sys.argv[1]
        nrparts = int(sys.argv[2])

        format = "%(asctime)s: %(message)s"
        logging.basicConfig(format=format, level=logging.INFO, datefmt="%H:%M:%S")

        # Initialize the connection with Ceph RADOS GW
        b3_client = boto3.client(service_name = 's3', endpoint_url = 'http://127.0.0.1:8000', aws_access_key_id = 'test', aws_secret_access_key = 'test')
        b3_s3 = boto3.resource(service_name = 's3', endpoint_url = 'http://127.0.0.1:8000', aws_access_key_id = 'test', aws_secret_access_key = 'test')

        handle_mp_file(bucket, filename, nrparts)

### END ###

This code will using Python multithreading to upload multiple part of the file simultaneously as any modern download manager will do using the feature of HTTP/1.1.

To use this Python script, name the above code to a file called boto3-upload-mp.py and run is as:

$ ./boto3-upload-mp.py mp_file_original.bin 6

Here 6 means the script will divide the file into 6 parts and create 6 threads to upload these part simultaneously.

The uploaded file can be then redownloaded and checksummed against the original file to veridy it was uploaded successfully.

Using GlusterFS with Docker swarm cluster

In this blog I will create a 3 node Docker swarm cluster and use GlusterFS to share volume storage across Docker swarm nodes.

Introduction

Using Swarm node in Docker will create a cluster of Docker hosts to run container on, the problem in had is if container “A” run in “node1” with named volume “voldata”, all data changes applied to “voldata” will be locally saved to “node1”. If container A is shut down and happen to start again in different node, let’s say this time on “node3” and also mounting named volume “voldata” will be empty and will not contain changes done to the volume when it was mounted in “node1”.

In this example I will not use named volume, rather I will use shared mount storage among cluster nodes, of course the same can apply to share storage for named volume folder.

I’m using for this exercise 3 EC2 on AWS with 1 attached EBS volumes for each one of them.

How to get around this?

One of the way to solve this is to use GlusterFS to replicate volumes across swarm nodes and make data available to all nodes at any time. Named volumes will still be local to each Docker host since GlusterFS takes care of the replication.

Preparation on each server

I will use Ubuntu 16.04 for this exercise.

First we put friendly name in /etc/hosts:

XX.XX.XX.XX    node1
XX.XX.XX.XX    node2
XX.XX.XX.XX    node3

Then we update the system

$ sudo apt update
$ sudo apt upgrade

Finally we reboot the server. Then start with installing necessary packages on all nodes:

$ sudo apt install -y docker.io
$ sudo apt install -y glusterfs-server

Then start the services:

$ sudo systemctl start glusterfs-server
$ sudo systemctl start docker

Create GlusterFS storage for bricks:

$ sudo mkdir -p /gluster/data /swarm/volumes

GlusterFS setup

First we prepare filesystem for the Gluster storage on all nodes:

$ sudo mkfs.xfs /dev/xvdb 
$ sudo mount /dev/xvdb /gluster/data/

From node1:

$ sudo gluster peer probe node2
peer probe: success. 
$ sudo gluster peer probe node3
peer probe: success.

Create the volume as a mirror:

$ sudo gluster volume create swarm-vols replica 3 node1:/gluster/data node2:/gluster/data node3:/gluster/data force
volume create: swarm-vols: success: please start the volume to access data

Allow mount connection only from localhost:

$ sudo gluster volume set swarm-vols auth.allow 127.0.0.1
volume set: success

Then start the volume:

$ sudo gluster volume start swarm-vols
volume start: swarm-vols: success

Then on each Gluster node we mount the shared mirrored GlusterFS locally:

$ sudo mount.glusterfs localhost:/swarm-vols /swarm/volumes

Docker swarm setup

Here I will create 1 manager node and 2 worker nodes.

$ sudo docker swarm init
Swarm initialized: current node (82f5ud4z97q7q74bz9ycwclnd) is now a manager.
 
To add a worker to this swarm, run the following command:
 
    docker swarm join \
    --token SWMTKN-1-697xeeiei6wsnsr29ult7num899o5febad143ellqx7mt8avwn-1m7wlh59vunohq45x3g075r2h \
    172.31.24.234:2377
 
To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.

Get the token for worker nodes:

$ sudo docker swarm join-token worker
To add a worker to this swarm, run the following command:
 
    docker swarm join \
    --token SWMTKN-1-697xeeiei6wsnsr29ult7num899o5febad143ellqx7mt8avwn-1m7wlh59vunohq45x3g075r2h \
    172.31.24.234:2377

Then on both worker nodes:

$ sudo docker swarm join --token SWMTKN-1-697xeeiei6wsnsr29ult7num899o5febad143ellqx7mt8avwn-1m7wlh59vunohq45x3g075r2h 172.31.24.234:2377
This node joined a swarm as a worker.

Verify the swarm cluster:

$ sudo docker node ls
ID                           HOSTNAME          STATUS  AVAILABILITY  MANAGER STATUS
6he3dgbanee20h7lul705q196    ip-172-31-27-191  Ready   Active        
82f5ud4z97q7q74bz9ycwclnd *  ip-172-31-24-234  Ready   Active        Leader
c7daeowfoyfua2hy0ueiznbjo    ip-172-31-26-52   Ready   Active

Testing

To test, I will create label on node1 and node3, then create a container on node1 then shut it down then create it again on node3, with the same volume mounts, then we will notice that files created by both containers are shared.

Label swarm nodes:

$ sudo docker node update --label-add nodename=node1 ip-172-31-24-234
ip-172-31-24-234
$ sudo docker node update --label-add nodename=node3 ip-172-31-26-52
ip-172-31-26-52

Check the labels:

$ sudo docker node inspect --pretty ip-172-31-26-52
ID:			c7daeowfoyfua2hy0ueiznbjo
Labels:
 - nodename = node3
Hostname:		ip-172-31-26-52
Joined at:		2017-01-06 22:44:17.323236832 +0000 utc
Status:
 State:			Ready
 Availability:		Active
Platform:
 Operating System:	linux
 Architecture:		x86_64
Resources:
 CPUs:			1
 Memory:		1.952 GiB
Plugins:
  Network:		bridge, host, null, overlay
  Volume:		local
Engine Version:		1.12.1

Create Docker service on node1 that will create a file in the shared volume:

$ sudo docker service create --name testcon --constraint 'node.labels.nodename == node1' --mount type=bind,source=/swarm/volumes/testvol,target=/mnt/testvol /bin/touch /mnt/testvol/testfile1.txt
duvqo3btdrrlwf61g3bu5uaom

Verify service creation:

$ sudo docker service ls
ID            NAME     REPLICAS  IMAGE    COMMAND
duvqo3btdrrl  testcon  0/1       busybox  /bin/bash

Check that it’s running in node1:

$ sudo docker service ps testcon
ID                         NAME           IMAGE          NODE              DESIRED STATE  CURRENT STATE           ERROR
6nw6sm8sak512x24bty7fwxwz  testcon.1      ubuntu:latest  ip-172-31-24-234  Ready          Ready 1 seconds ago     
6ctzew4b3rmpkf4barkp1idhx   \_ testcon.1  ubuntu:latest  ip-172-31-24-234  Shutdown       Complete 1 seconds ago

Also check the volume mounts:

$ sudo docker inspect testcon
[
    {
        "ID": "8lnpmwcv56xwmwavu3gc2aay8",
        "Version": {
            "Index": 26
        },
        "CreatedAt": "2017-01-06T23:03:01.93363267Z",
        "UpdatedAt": "2017-01-06T23:03:01.935557744Z",
        "Spec": {
            "ContainerSpec": {
                "Image": "busybox",
                "Args": [
                    "/bin/bash"
                ],
                "Mounts": [
                    {
                        "Type": "bind",
                        "Source": "/swarm/volumes/testvol",
                        "Target": "/mnt/testvol"
                    }
                ]
            },
            "Resources": {
                "Limits": {},
                "Reservations": {}
            },
            "RestartPolicy": {
                "Condition": "any",
                "MaxAttempts": 0
            },
            "Placement": {
                "Constraints": [
                    "nodename == node1"
                ]
            }
        },
        "ServiceID": "duvqo3btdrrlwf61g3bu5uaom",
        "Slot": 1,
        "Status": {
            "Timestamp": "2017-01-06T23:03:01.935553276Z",
            "State": "allocated",
            "Message": "allocated",
            "ContainerStatus": {}
        },
        "DesiredState": "running"
    }
]

Shutdown the service and then create in node3:

$ sudo docker service create --name testcon --constraint 'node.labels.nodename == node3' --mount type=bind,source=/swarm/volumes/testvol,target=/mnt/testvol ubuntu:latest /bin/touch /mnt/testvol/testfile3.txt
5y99c0bfmc2fywor3lcsvmm9q

Verify it has ran on node3:

$ sudo docker service ps testcon
ID                         NAME           IMAGE          NODE             DESIRED STATE  CURRENT STATE           ERROR
5p57xyottput3w34r7fclamd9  testcon.1      ubuntu:latest  ip-172-31-26-52  Ready          Ready 1 seconds ago     
aniesakdmrdyuq8m2ddn3ga9b   \_ testcon.1  ubuntu:latest  ip-172-31-26-52  Shutdown       Complete 2 seconds ago

Now check the files created from both containers exist in the same volume:

$ ls -l /swarm/volumes/testvol/
total 0
-rw-r--r-- 1 root root 0 Jan  6 23:59 testfile3.txt
-rw-r--r-- 1 root root 0 Jan  6 23:58 testfile1.txt

Containerizing Alfresco

What is Alfresco?

Alfresco is an open source document management system for Microsoft Windows and Unix-like operating systems. It comes in Community and Enterprise editions that are different only in scalability and high availability features. It’s powerful, easy to use and much mature than other open source document management systems. Alfresco is Java based software, Community edition comes bundled with Tomcat application server and uses PostgreSQL database as backend.

Introduction

The scope here is to containerize Alfresco, the setup will have 2 containers, one for PostgreSQL and another one for Alfresco. The container entry point will detect if Alfresco is not installed and will install it in this case, otherwise it will start the Tomcat server to run Alfresco application. Here I’m relying on Docker compose to run the environment.

Details

– Setup directory hierarchy:
The following is the directory hierarchy for the container folder:

alfresco_root/
  docker-compose.yml
  README.md
  alfresco/                     Alfresco Docker container folder
    alfresco-entrypoint.sh      Docker entry point for Alfresco container
    properties.sh               Configration parameters for Alfresco installation
    Dockerfile                  Dockerfile to build Alfresco container
    alfresco/                   Shared volume for Alfresco
  db_store/                     PostgreSQL Docker container folder
    data/                       Shared volume for PostgreSQL database storage

– On container host: Create group “alfresco” (GID: 888) and user “alfresco” (UID: 888), home directory /opt/alfresco (Alfresco root)

– Login with user “alfresco”
– Move the above hierarchy to the file /opt/alfresco (Available at https://github.com/tembaby/alfresco-container)
– Build the Alfresco container:

$ docker-compose build alfresco

– If SELinux is enabled, run the following command on all shared volumes (Needed to run that on CentOS 7.2, docker 1.9.1):

$ chcon --recursive --type=svirt_sandbox_file_t --range=s0 /path/to/volume

– Copy Alfresco installation file “alfresco-community-installer-201605-linux-x64.bin” to alfresco shared volume:
– Copy the file properties.sh to alfresco shared volume and change:
– HOSTNAME
To match hostname of the Docker host

– Change hostname as well in docker-compose.yml file (alfresco service):
hostname:

– Bring the environment up without -d parameter time to make sure it installs everything OK:

$ docker-compose up

– After finishing and accessing Alfresco on port 8080, press Ctrl-C
– Start the environment in daemon mode:

$ docker-compose up -d

Now you can enjoy Alfresco Document Management System in Docker container.