This is the third in a series of posts about how DreamHost builds its Cloud products. Written by Luke Odom straight from the data center operations.
To begin picking what drives we wanted to use in our new DreamCompute cluster, we first looked at the IO usage and feedback from the beta cluster. The amount of data moving around our beta cluster was very small. Despite the fact that we had over-provisioned, and had very little disk activity in our beta cluster, we still received a lot of feedback requesting faster IO. There were three main reasons that the throughput and latency in the beta cluster was slow.
The first reason was our choice of processor and RAM amount. We used machines of the same specs we used for DreamObjects. This worked well there, where there is a lot of storage but it is not accessed very often. However, in DreamCompute we store much less data, but the data is comparatively much more active. The second reason was density. Ceph functions better when you have fewer drives per machines. Recovery is also faster when you have smaller drives.
With the original DreamCompute cluster, we had machines that contained 12 Ceph drives. Each drive was 3TB in size, for a total of 36TB of storage per node. This was too dense for our needs. The third problem we had were with the SAS expanders we used. We used RAID cards with “4 channel”, meaning they can access four drives at a time. However, we had 14 drives attached (12 Ceph, two boot drives). This required us to use a “SAS expander”, a device that sits between the RAID card and drives, acting as a traffic light. Unfortunately, the SAS expander we were using could only handle enough traffic to fully utilize two lanes. Imagine 14 cars traveling on one two-lane road. As long as the cars are parked most of the time, it’s fine. If they all want to drive at the same time, traffic gets slow. For the new cluster, we wanted to remove the latency of an SAS expander and use the maximum amount of drives than either the RAID card or motherboard supported. We also made sure that we had an ample amount of RAM and fast processors to avoid other hardware-related latency.
There two broad categories of drives for both the server and consumer markets. There are traditional hard drives (HDD) which magnetically store data on spinning platters, and solid state drives (SSD) which store data in integrated memory circuits. SSDs are much faster and have a lower latency than HDDs because they don’t have to wait to get the data from a spinning disk. However, SSDs are currently substantially more expensive. From the feedback we had gotten from our customers, fast IO and low latency were very important so we decided to go with SSDs!
There are three choices on how the SSD will interface with the server:
- The first is SATA. Most consumer desktops and laptops interface with their attached storage devices using the SATA interface. This is an improvement on the old PATA interface, which was also commonly a consumer interface. Historically, SATA isn’t used very often in data center environments as it doesn’t play very well with RAID cards. However, we won’t be putting these drives in a RAID array. SATA also has a max throughput of 6 Gbit/s which is slower than our other options. Serial Attached SCSI (SAS) was developed from the more enterprise focused SCSI (pronounced like “scuzzy”). SAS has many advantages, but the main benefit for us is that the speed is limited to 12 Gbit/s, double that of SATA. The major disadvantage of SAS is that it isn’t directly supported by any server chipsets. This means that you have to place it behind a RAID card, even if you don’t need any RAID functionality. SAS drives are also much more expensive because the SAS controllers they use are very expensive. NVMe is an interface designed specifically for SSDs. NVMe drives are more than twice as fast as SATA or SAS SSDs, but they are just beginning to hit the market so they are much more expensive than SATA or SAS. All factors considered, we felt that SATA was the best choice. The drives were still four times faster than the drives in our original cluster and each drive was only 1TB instead of 3TB. We could now measure recovery time after drives were added or failed in hours instead of days.
- Another factor to consider with SSDs is how much information each memory cell will hold. The three options are one bit per cell (SLC), two bits per cell (MLC), or three bits per cell (TLC). The more information each cell is holding, the fewer times it can be rewritten in the lifetime of the cell, so durability decreases as data per cell increases. Because of the large price difference between MLC and SLC, manufacturers have also released a couple intermediary options like eMLC (MLC with a greater endurance and higher over-provisioning) and pSLC (pseudo SLC which uses MLC but only writes one bit per cell). Based on testing in the original DreamCompute cluster, we wouldn’t even be near the TLC limitations for years. However, we purchased MLC-based drives just in case usage was higher than expected. Unfortunately we had firmware issues with the MLC drives. We got those worked out, but as a precaution, we replaced half of the drives in our cluster with a different manufacturer’s TLC-based drives so we now have a 50/50 mixture of MLC and TLC enterprise drives.
Enterprise SSDs, especially SATA ones, are very similar to the SSDs used in consumer desktops and laptops with a few critical differences. The biggest difference is the use of an integrated supercapacitor. On enterprise drives, should power fail in the middle of a write, the super capacitor will keep a drive powered long enough to finish the write it was working on and prevent data corruption. Another difference is the amount of over-provisioning. All SSDs display less storage than they actually have installed. The reason for this is memory cells will sometimes need to be replaced. When this happens, the SSD will start using one of its spare cells. As enterprise SSDs are usually under more stress, more memory cells are allocated as spares. For example, a consumer drive with 512 GB of internal storage is usually sold as a 500GB drive with 12GB over-provisioning. The drives we buy are either 400 or 480 depending on the workload. The firmware for enterprise SSD drives is also tuned for an enterprise environment which is usually much more write heavy than consumer workloads. For these reasons, we only use enterprise-rated SSDs in DreamCompute.
- The final factor to consider was if we should put the SSDs behind a RAID card or directly attach them to the motherboard. A RAID array can be created via software, but to be able to add additional features, arrays are often implemented via a separate card. The redundancy that a RAID card provides isn’t needed with Ceph. Ceph provides its own redundancy. RAID cards can also provide protection against data corruption by having a battery or capacitor that provides power to the card so that it can hold incomplete writes in memory during unexpected reboots. This also isn’t needed as enterprise SSDs have capacitors built in. The final advantage of a RAID card is the onboard cache. The RAID card cache isn’t any faster than an SSDs onboard cache, but by setting each disk up as a single disk RAID array, we can use the drive’s capacitor protected cache as a write cache and the RAID card’s cache as a read ahead cache. Our testing showed a slight increase in speed using this setup.
The systems we plan on using have 10 SATA ports behind 2 controllers (four on CPU 1, six on CPU 2) on the motherboard. We could also use an 8-port RAID card for the eight Ceph drives and only have the two boot drives directly connected. In the end, we decided the expense and potential for problems weren’t worth the marginal speed increase and decided to just directly attach the drives to the motherboard.
So the final layout is eight 960GB Enterprise SSDs (half TLC, half MLC) directly attached to the motherboard with four Ceph drives and both boot drives going to one CPU, and four Ceph drives going to the second CPU.
Stay tuned for more inside looks at our Cloud in the next post, coming soon!