Deploying Ceph – I: Initial environment

What is Ceph

Ceph is an Open Source Distributed Storage Platform, it stores data on a single cluster running on distributed computers. It and provides interfaces for object, block and file-level storage. Ceph is completely free and aims to be with no single point of failure, and to be scalable to the exabyte level.

Ceph is fault tolerant, it does so by replicating data across cluster. By default data stored in Ceph is replicated to 3 different storage devices. The replication level can be changed of course.

For simplicity, from a high level, when one of the storage devices fails, Ceph will replicate missing blocks to new device replacing the faulted one. An in depth technical explanation of this process will follow in the document.

Basic Ceph components

Cluster monitors (ceph-mon) that keep track of active and failed cluster nodes, it monitors the Ceph cluster, and report on its health and status.

Metadata servers (ceph-mds) that store the metadata of inodes and directories.

Object storage devices (ceph-osd) that actually store the content of files. This daemon is responsible for storing data on a local file system and providing access to this data over the network via different client software or access mediums. OSD daemon is responsible as well of data replication.

Data placement

Ceph stores, replicates and re-balances data objects across the cluster dynamically. To understand how Ceph places data in a cluster the following definition must be known:

Pools: Ceph stores data within pools, which are logical groups for storing objects. Pools manage the number of placement groups, the number of replicas, and the rule-set for the pool.

Placement Groups: This is the most important thing to understand. Ceph maps objects to placement groups (PGs). Placement groups (PGs) are shards or fragments of a logical object pool that place objects as a group into OSDs. Placement groups reduce the amount of per-object metadata when Ceph stores the data in OSDs. A larger number of placement groups (e.g., 100 per OSD) leads to better balancing.

In a nut-shell, if the replication level is 3 (default), each placement group will be assigned the exact number of OSDs (3), So when you have 4 way replication each PG will have 4 OSDs. When data is stored in the cluster, it will be assigned by the CRUSH algorithm (next section) to a PG to be stored. The PG will then replicate the data across all OSDs assigned to it, thus the data is stored 3 times (3 way replication).

Each OSD daemon is responsible for one storage device, so there is a one to one mapping between the number of OSDs and storage devices.

CRUSH Maps: CRUSH is a big part of what allows Ceph to scale without performance bottlenecks, without limitations to scalability, and without a single point of failure. CRUSH maps provide the physical topology of the cluster to the CRUSH algorithm to determine where the data for an object and its replicas should be stored, and how to do so across failure domains for added data safety among other things.

Explanation Scenario

This is not a real world example, it’s simple for explanation purposes only.

A Ceph cluster has 4 OSDs and a 3 way replication. Containing 10 PGs


So when we store data to the Ceph cluster, the CRUSH algorithm will uniquely select the appropriate PG to place the data in the cluster, for example is selects PG2. PG2 will then replicate the data to the assigned OSDs (in this case OSD2, OSD3, OSD4 (3 way replication).

In the ideal case the PG will be in the state clean when all its OSDs are in state up and in, and when all the data is replicated correctly.

Failure scenario

In the above scenario, suppose had disk served by OSD1 has failed, PG1 and PG3 will be degraded and unclean state until the problem is fixed with the drive or get replaced. A new OSD can be introduced and the degraded PGs will assign in to replace OSD1 and start the process of re-balancing and replicating the data.

Deployment steps

The below setup is initial, it’s not the optimum of course for running a production storage based on Ceph. It’s a 3 way replication with two OSDs, so it’s degraded and unclean to begin with until we add more OSDs.

First disable firewall.

Command to be run as root will be indicated by using the # prompt, otherwise user ceph is used.

In this scenario:

– 1 manager node with: Ceph monitor. It’s used as well as the administration node.┬áIt has the following specs:

  • 2 vCPUs
  • 6 GB RAM
  • 60 GB Storage
  • VM running on XenServer 6.2
  • CentOS 7.2 OS

– 1 node used as storage node with the following specs:

  • 2 x quad code Intel Xeon processors with hyper-threading enabled (total of 16 vCPUs
  • 128 GB RAM
  • 2 x 2T SAS disk drives
  • CentOS 7.2 OS

Each disk drive above will be assigned to an OSD. Of course more OSDs and Monitor node will be added later.

We will install Ceph Hammer, although Jewel was release by the time of this deployment.

1- Add user ceph, to all nodes and give it passwordless sudo access:

# vi /etc/sudoers.d/ceph
Defaults:ceph !requiretty
ceph ALL = (root) NOPASSWD:ALL

Give it approperiate permissions:

# chmod 440 /etc/sudoers.d/ceph

2- Install needed packages to add yum repository for EPEL and Ceph on all nodes

# yum -y install centos-release-ceph-hammer epel-release yum-plugin-priorities

3- Make Ceph repository have priority over other repositories on all nodes.

Add “priority=1” after the “enabled=1” value in the /etc/yum.repos.d/CentOS-Ceph-Hammer.repo file.

4- Install the Ceph deployer package on the administration node

# yum -y install ceph-deploy

Now login as user ceph on the administration machine and continue with the following steps.

5- Generate and distribute SSH keys for the user ceph:

$ ssh-copy-id ceph@ceph-stor-01

Since the administration node is the first Ceph monitor, we need to set up the SSH key on the local machine as well:

$ cp .ssh/authorized_keys
$ chmod 600 .ssh/authorized_keys

6- Prepare local directoy for the installation

$ mkdir ceph
$ cd ceph/

7- Start deploying a new cluster, and write a .conf and keyring for it.

$ ceph-deploy new --cluster-network --public-network ceph-mon-01

Cluster network is the local private network for cluster communication, public network is were the Ceph nodes and Storage will be accessible on.

8- Start installing Ceph components on the monitor node:

Components to be installed, Ceph Monitor, Ceph MDS, Ceph RESt gateway

$ ceph-deploy install --mon --rgw --mds --release hammer ceph-mon-01

9- Deploy for monitors defined in `ceph.conf:mon_initial_members`, wait until they form quorum and then gatherkeys, reporting the monitor status along the process. If monitors don’t form quorum the command will eventually time out.

$ ceph-deploy mon create-initial

10- On the OSD node, install the OSD daemon on the Storage node containing the initial 2 OSDs:

$ ceph-deploy install --release hammer --osd ceph-stor-01

11- Copy the key rings and configuration files on all nodes:

$ ceph-deploy admin ceph-mon-01 ceph-stor-01
$ sudo chmod 644 /etc/ceph/ceph.client.admin.keyring

12- Prepare Object Storage Daemon on the initial OSD node:

First ceph-stor-01 node has 2 disk drives, we dedicate 1.6T on each drive for a separate OSD. File system used is XFS.

$ ceph-deploy osd prepare ceph-stor-01:/ceph/storage01 ceph-stor-01:/ceph/storage02

13- Activate Object Storage Daemon

$ ceph-deploy osd activate ceph-stor-01:/ceph/storage01 ceph-stor-01:/ceph/storage02

14- Some commands o make sure we are on the right track

Check available storage:

$ ceph df
3214G 3204G 10305M 0.31
rbd 0 0 0 1068G 0

Check cluster status:

$ ceph -w
cluster 1ad7d5ae-3414-4286-b42e-7082b3355fde
64 pgs degraded
64 pgs stuck degraded
64 pgs stuck inactive
64 pgs stuck unclean
64 pgs stuck undersized
64 pgs undersized
monmap e1: 1 mons at {ceph-mon-01=}
election epoch 2, quorum 0 ceph-mon-01
osdmap e9: 2 osds: 2 up, 2 in
pgmap v15: 64 pgs, 1 pools, 0 bytes data, 0 objects
10305 MB used, 3204 GB / 3214 GB avail
64 undersized+degraded+peered</code>
2016-04-28 23:41:34.137681 mon.0 [INF] pgmap v15: 64 pgs: 64 undersized+degraded+peered; 0 bytes data, 10305 MB used, 3204 GB / 3214 GB avail

It’s degraded, will remain so until we add more OSDs (again, because it’s a 3 way replication with only 2 OSDs)