Build servers offline due to failed SSD

classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Build servers offline due to failed SSD

ryandesign2
Administrator
We got through the winter storms but now there's a new problem. The SSD that the buildmaster VM is stored on and that boots up VMware ESXi is failing. I'm currently setting up a new ESXi startup disk and trying to find a temporary disk I can move that VM to to get us back up and running.

The long-term plan is to rewrite the build system to run under buildbot version 2. I already created a buildbot 2 VM on a new SSD last year but I haven't finished rewriting the configuration so we can't just switch to that yet.


Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

ryandesign2
Administrator


On Feb 21, 2021, at 10:08, Ryan Schmidt wrote:

> We got through the winter storms but now there's a new problem. The SSD that the buildmaster VM is stored on and that boots up VMware ESXi is failing. I'm currently setting up a new ESXi startup disk and trying to find a temporary disk I can move that VM to to get us back up and running.

Builds are resuming though the buildmaster web interface is not yet available.

Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

ryandesign2
Administrator


On Mar 2, 2021, at 09:03, Ryan Schmidt wrote:

> On Feb 21, 2021, at 10:08, Ryan Schmidt wrote:
>
>> We got through the winter storms but now there's a new problem. The SSD that the buildmaster VM is stored on and that boots up VMware ESXi is failing. I'm currently setting up a new ESXi startup disk and trying to find a temporary disk I can move that VM to to get us back up and running.
>
> Builds are resuming though the buildmaster web interface is not yet available.

The buildbot web interface has been available read-only for a few days and is now back to normal functionality. The backlog of builds has been mostly completed, except on 10.15. The buildmaster is still running on a slow temporary disk so it's probably best not to access it unless you have to.
Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

MacPorts - Users mailing list
Isn’t SSD a bad choice for server duty? No server farms use them, apparently due to short lifespan.

Dave
Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

Andrew Udvare

> On 2021-03-07, at 01:20, Dave C via macports-users <[hidden email]> wrote:
>
> Isn’t SSD a bad choice for server duty? No server farms use them, apparently due to short lifespan.
>
> Dave

Plenty of servers use SSDs now, usually with HDDs to lower cost. The default option on AWS EC2 is to use an SSD.

There are enterprise grade SSDs that basically have the same characteristics as enterprise HDDs. Usually they are not as cheap as HDDs but sometimes are due to the underlying technology. Spinning discs will remain useful for long term storage that can tolerate large delays, but I don't see them being used for much else soon.

If SSDs become as cheap as HDDs with the same expected enterprise-level tolerances, there will be no reason to keep HDDs. That would mean you get the same benefits as an HDD, but with huge performance increases, and huge decreases in power usage.

Andrew
Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

MacPorts - Users mailing list
In reply to this post by ryandesign2
This applies to affordable SSDs. As you say, the ones that are on par (re. reliability) with HDDs are $pendy.

It’s something to do with an SSD’s limited number of write cycles, if I remember...

Dave

- - -

> Isn’t SSD a bad choice for server duty? No server farms use them, apparently due to short lifespan.
>
> Dave

Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

ryandesign2
Administrator
In reply to this post by MacPorts - Users mailing list


On Mar 7, 2021, at 00:20, Dave C wrote:

> Isn’t SSD a bad choice for server duty?

My opinion is that it is a good choice in terms of performance. When I first up this incarnation of our buildbot system in 2016 I had the workers running on SSDs so that builds would be fast (our previous buildbot setup at Apple's macOS forge used a very expensive very-many-hard-disk RAID; we were in no position to purchase any equivalent type of hardware once we left macOS forge) and I had the master and distfiles/packages storage on a hard disk RAID for reliability. The specific RAID that I have turned out to be too slow. Web requests could take many seconds to respond. GitHub Web Hooks being delivered to the server could be marked as failed because GitHub didn't always wait long enough for our server to respond. This was unsatisfying so I moved the buildmaster to an old SSD while keeping the large files on the RAID. This was much faster, though not as fast as if I had used a new SSD, which is what I will ultimately be using. For now, the buildmaster is temporarily running off a USB hard drive and is slow as molasses. This is a terrible choice but all drive bays are already occupied by the RAID.

All of the SSDs we used for the workers have failed as well, 2 last year and the last one last month. In response to these failures, someone else also suggested that we should not use SSDs. I've run one of the workers off of three independent hard disks for the past year, and my opinion is that the performance and power consumption of SSDs is much better and I will switch the hard disk-based worker back to an SSD in the future. You can read this discussion here:

https://trac.macports.org/ticket/60178

Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

Dave Horsfall
In reply to this post by MacPorts - Users mailing list
On Sat, 6 Mar 2021, Dave C via macports-users wrote:

> Isn’t SSD a bad choice for server duty? No server farms use them,
> apparently due to short lifespan.

If you knew how SSDs worked then you wouldn't use them at all without many
backups.  Give me spinning rust any day...

-- Dave
Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

MacPorts - Users mailing list
I’d really love to know more about what you’re saying here. Up until I just read what you wrote, I thought SSDs were the savior of HDDs.

Michael A. Leonetti
As warm as green tea

> 3/7/21 午後5:26、Dave Horsfall <[hidden email]>のメール:
>
> On Sat, 6 Mar 2021, Dave C via macports-users wrote:
>
>> Isn’t SSD a bad choice for server duty? No server farms use them, apparently due to short lifespan.
>
> If you knew how SSDs worked then you wouldn't use them at all without many backups.  Give me spinning rust any day...
>
> -- Dave

Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

John Chivian
The “on/off” switches in SSD’s are fragile and essentially break after too many read/write cycles.  As pointed out, it’s a get what you pay for world and cheap SSD’s are just that… cheap.   The expensive ones are more reliable because they actually make available only a portion of their total capacity, reserving the rest as replacements for such failures.  Intelligent software within the firmware manages this so that the end user experiences a much longer device lifespan.

There’s lots of technical documentation for such.  Google knows.

Regards,


> On Mar 7, 2021, at 18:15, Michael A. Leonetti via macports-users <[hidden email]> wrote:
>
> I’d really love to know more about what you’re saying here. Up until I just read what you wrote, I thought SSDs were the savior of HDDs.
>
> Michael A. Leonetti
> As warm as green tea
>
>> 3/7/21 午後5:26、Dave Horsfall <[hidden email]>のメール:
>>
>> On Sat, 6 Mar 2021, Dave C via macports-users wrote:
>>
>>> Isn’t SSD a bad choice for server duty? No server farms use them, apparently due to short lifespan.
>>
>> If you knew how SSDs worked then you wouldn't use them at all without many backups.  Give me spinning rust any day...
>>
>> -- Dave
>

Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

MacPorts - Users mailing list
To emphasize again, the reason SSDs aren’t recommended for servers is because servers—by definition—see much heavier service, and these read/write cycles are used up more quickly.

For personal use in a PC, or such, SSDs are proving to be the dream they were promised to be.

As mentioned, given time, the technology will overcome this limitation for use in servers and these comments will be just so much past history.

Dave C.

- - -

> The “on/off” switches in SSD’s are fragile and essentially break after too many read/write cycles.  As pointed out, it’s a get what you pay for world and cheap SSD’s are just that… cheap.   The expensive ones are more reliable because they actually make available only a portion of their total capacity, reserving the rest as replacements for such failures.  Intelligent software within the firmware manages this so that the end user experiences a much longer device lifespan.
>
> There’s lots of technical documentation for such.  Google knows.
>
> Regards,
>
>
>>> On Mar 7, 2021, at 18:15, Michael A. Leonetti via macports-users <[hidden email]> wrote:
>> I’d really love to know more about what you’re saying here. Up until I just read what you wrote, I thought SSDs were the savior of HDDs.
>> Michael A. Leonetti
>> As warm as green tea
>>> 3/7/21 午後5:26、Dave Horsfall <[hidden email]>のメール:
>>> On Sat, 6 Mar 2021, Dave C via macports-users wrote:
>>>> Isn’t SSD a bad choice for server duty? No server farms use them, apparently due to short lifespan.
>>> If you knew how SSDs worked then you wouldn't use them at all without many backups.  Give me spinning rust any day...
>>> -- Dave

Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

Todd Doucet
I think one can only get so far with purely qualitative analysis of the characteristics of SSDs and HDs and then the end of that analysis will be one-size-fits all advice, for example "recommended" or "not recommended" for servers.

Surely the answer might vary depending on the particular server usage pattern, the need for performance, the cost of routine maintenance (swapping out aging drives or SSDs), the cost of the devices themselves, etc.

It seems to me that a given server operator can tell how long a particular SSD is likely to last.  They do not fail randomly, at least not very much.  The fail when they are "used up" and you can figure out well in advance, usually, when you will need to swap the old ones out of service.

HDs fail also, obviously, but tend not to be so predictable about it.  Whether it makes sense for a given server to use an SSD really does depend on the numbers.  All drives will fail.  All drives will need to be rotated out of service.  It is a matter of cost, convenience, and performance.

The only caveat I can think of is that there might be an issue of malicious use--a server with SSDs might be vulnerable to a wear attack, depending on the server services offered, I suppose.




To emphasize again, the reason SSDs aren’t recommended for servers is because servers—by definition—see much heavier service, and these read/write cycles are used up more quickly.

For personal use in a PC, or such, SSDs are proving to be the dream they were promised to be.

As mentioned, given time, the technology will overcome this limitation for use in servers and these comments will be just so much past history.

Dave C.

- - - 

> The “on/off” switches in SSD’s are fragile and essentially break after too many read/write cycles.  As pointed out, it’s a get what you pay for world and cheap SSD’s are just that… cheap.   The expensive ones are more reliable because they actually make available only a portion of their total capacity, reserving the rest as replacements for such failures.  Intelligent software within the firmware manages this so that the end user experiences a much longer device lifespan.

> There’s lots of technical documentation for such.  Google knows.

> Regards,


>>> On Mar 7, 2021, at 18:15, Michael A. Leonetti via macports-users <[hidden email]> wrote:
>> I’d really love to know more about what you’re saying here. Up until I just read what you wrote, I thought SSDs were the savior of HDDs.
>> Michael A. Leonetti
>> As warm as green tea
>>> 3/7/21 午後5:26、Dave Horsfall <[hidden email]>のメール:
>>> On Sat, 6 Mar 2021, Dave C via macports-users wrote:
>>>> Isn’t SSD a bad choice for server duty? No server farms use them, apparently due to short lifespan.
>>> If you knew how SSDs worked then you wouldn't use them at all without many backups.  Give me spinning rust any day...
>>> -- Dave



Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

Peter West-2
I’ve been looking at VPS providers, and most of them offer SSD-based VPSs, so they seem to be increasingly popular. I suspect that most VPSs do not get consistently hammered, though.

Peter
“Destroy this temple, and in three days I will raise it up.”

On 8 Mar 2021, at 11:30 am, Todd Doucet <[hidden email]> wrote:

I think one can only get so far with purely qualitative analysis of the characteristics of SSDs and HDs and then the end of that analysis will be one-size-fits all advice, for example "recommended" or "not recommended" for servers.

Surely the answer might vary depending on the particular server usage pattern, the need for performance, the cost of routine maintenance (swapping out aging drives or SSDs), the cost of the devices themselves, etc.

It seems to me that a given server operator can tell how long a particular SSD is likely to last.  They do not fail randomly, at least not very much.  The fail when they are "used up" and you can figure out well in advance, usually, when you will need to swap the old ones out of service.

HDs fail also, obviously, but tend not to be so predictable about it.  Whether it makes sense for a given server to use an SSD really does depend on the numbers.  All drives will fail.  All drives will need to be rotated out of service.  It is a matter of cost, convenience, and performance.

The only caveat I can think of is that there might be an issue of malicious use--a server with SSDs might be vulnerable to a wear attack, depending on the server services offered, I suppose.




To emphasize again, the reason SSDs aren’t recommended for servers is because servers—by definition—see much heavier service, and these read/write cycles are used up more quickly.

For personal use in a PC, or such, SSDs are proving to be the dream they were promised to be.

As mentioned, given time, the technology will overcome this limitation for use in servers and these comments will be just so much past history.

Dave C.

- - - 

> The “on/off” switches in SSD’s are fragile and essentially break after too many read/write cycles.  As pointed out, it’s a get what you pay for world and cheap SSD’s are just that… cheap.   The expensive ones are more reliable because they actually make available only a portion of their total capacity, reserving the rest as replacements for such failures.  Intelligent software within the firmware manages this so that the end user experiences a much longer device lifespan.

> There’s lots of technical documentation for such.  Google knows.

> Regards,


>>> On Mar 7, 2021, at 18:15, Michael A. Leonetti via macports-users <[hidden email]> wrote:
>> I’d really love to know more about what you’re saying here. Up until I just read what you wrote, I thought SSDs were the savior of HDDs.
>> Michael A. Leonetti
>> As warm as green tea
>>> 3/7/21 午後5:26、Dave Horsfall <[hidden email]>のメール:
>>> On Sat, 6 Mar 2021, Dave C via macports-users wrote:
>>>> Isn’t SSD a bad choice for server duty? No server farms use them, apparently due to short lifespan.
>>> If you knew how SSDs worked then you wouldn't use them at all without many backups.  Give me spinning rust any day...
>>> -- Dave

Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

Daniel J. Luke
In reply to this post by Todd Doucet
On Mar 7, 2021, at 8:30 PM, Todd Doucet <[hidden email]> wrote:
> I think one can only get so far with purely qualitative analysis of the characteristics of SSDs and HDs and then the end of that analysis will be one-size-fits all advice, for example "recommended" or "not recommended" for servers.

this +1000

> Surely the answer might vary depending on the particular server usage pattern, the need for performance, the cost of routine maintenance (swapping out aging drives or SSDs), the cost of the devices themselves, etc.

exactly

There's a reason you don't really see 15k enterprise drives anymore.

> It seems to me that a given server operator can tell how long a particular SSD is likely to last.  They do not fail randomly, at least not very much.  The fail when they are "used up" and you can figure out well in advance, usually, when you will need to swap the old ones out of service.

Back in 2015 - there's this article https://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead/ where someone actually bothered to test and report some results.

> HDs fail also, obviously, but tend not to be so predictable about it.  Whether it makes sense for a given server to use an SSD really does depend on the numbers.  All drives will fail.  All drives will need to be rotated out of service.  It is a matter of cost, convenience, and performance.
>
> The only caveat I can think of is that there might be an issue of malicious use--a server with SSDs might be vulnerable to a wear attack, depending on the server services offered, I suppose.

I'm sure there are worst-case scenarios for spinning disks that (in theory) could be exploited to wear their mechanisms out as well.

I've personally used both enterprise and consumer SSDs in high-write environments where the cost of replacing the SSDs was worthwhile for the performance benefits (or otherwise didn't change the overall cost of the solution) - and I've been pleasantly surprised with how much more use I've gotten from them than I originally calculated (based on the drive specs + the planed utilization + over provisioning).

YMMV of course - but the blanket "you shouldn't use SSDs for servers" or "no one uses SSDs for servers" is wrong. For those who are interested in more details, there are a bunch of good USENIX and ACM papers where people have actually gone and collected data on real-world failure rates.

--
Daniel J. Luke

Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

lhaeger
Here‘s an in depth discussion on SSD reliability, a little more detailed than „(not) recommended“ from someone with a lot of first hand experience, it seems: https://www.backblaze.com/blog/how-reliable-are-ssds/


Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

James Linder
In reply to this post by MacPorts - Users mailing list


> On 7 Mar 2021, at 3:26 pm, Dave C via macports-users <[hidden email]> wrote:
>
> This applies to affordable SSDs. As you say, the ones that are on par (re. reliability) with HDDs are $pendy.
>
> It’s something to do with an SSD’s limited number of write cycles, if I remember...
>
> Dave
>
> - - -
>
>> Isn’t SSD a bad choice for server duty? No server farms use them, apparently due to short lifespan.

The reality needs to be carefully weighed up

SSDs are rated in TBW. That is Terrabytes Written
The Cheaper SSDs may be 300 or 600 TBW the more expensive may be 1200 TBW or even 2500 TBW.

The TBW rating depends on size,

I’ve put a 2T SSD (600 TBW) in my iMac and after a year i see life expected of 65 years. So no SSD for a build farm is not a bad idea. The performance benefits far outweigh the 50+ year hastle of replacing.

The MTBF of spinning rust is 10 odd years, ssd is many times that. But remembering my uni stats the chance of a light globe, with a life of 1000 hours failing, when you have a few dozen bulbs (in my test question) was 20 min !!!

Enterprize Disks have a longer life, but as I said it is complicated.

James
Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

Dave Horsfall
In reply to this post by MacPorts - Users mailing list
On Sun, 7 Mar 2021, Michael A. Leonetti via macports-users wrote:

> I’d really love to know more about what you’re saying here. Up until I
> just read what you wrote, I thought SSDs were the savior of HDDs.

Real disk drives [tm] have their N/S magnetic poles lined up pretty much
forever; SSDs rely upon capacitors storing their charge forever (hah!).

You need to have an electronics background to understand...

-- Dave (VK2KFU)
Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

Dave Horsfall
In reply to this post by Todd Doucet
On Sun, 7 Mar 2021, Todd Doucet wrote:

> HDs fail also, obviously, but tend not to be so predictable about it. 

That of course depends upon the HD and the OS; my (FreeBSD) server's drive
is around 20 years old, and is still going strong.

There's also software that monitors the health of the disk.

-- Dave
Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

MacPorts - Users mailing list
In reply to this post by ryandesign2
Old technology drives use magnetism to hold bits. This works for decades, or so I’ve read. Usually the motor or bearings die before the magnetic medium fails.

Solid State Drives use memory chips to hold bits. These “bit holders” can wear out after a few trillion transitions (changing from 1 to 0 and 0 to 1). I’d you’re using it in your laptop or PC, you’ll likely have no problems for many years. In an internet-connected server, you may exceed those maximum write cycles sooner rather than later.

Dave

- - -

>> On Sun, 7 Mar 2021, Michael A. Leonetti via macports-users wrote:
>>
>> I’d really love to know more about what you’re saying here. Up until I just read what you wrote, I thought SSDs were the savior of HDDs.
>
> Real disk drives [tm] have their N/S magnetic poles lined up pretty much forever; SSDs rely upon capacitors storing their charge forever (hah!).
>
> You need to have an electronics background to understand...
>
> -- Dave (VK2KFU)

Reply | Threaded
Open this post in threaded view
|

Re: Build servers offline due to failed SSD

MacPorts - Users mailing list
In reply to this post by ryandesign2
I think most people who talk about servers and HDs/SSDs are referring to commercial internet-connected servers.

Yes, a private server will likely see a lesser degree of service/use and storage drives can be uprated (the opposite of derated) for greater lifetime.

Dave

> I’ve been looking at VPS providers, and most of them offer SSD-based VPSs, so they seem to be increasingly popular. I suspect that most VPSs do not get consistently hammered, though.
>
> Peter

123