The whole push to the cloud has always fascinated me. I get it - most people aren't interested in babysitting their own hardware. On the other hand, a business of just about any size that has any reasonable amount of hosting is better off with their own systems when it comes purely to cost.
All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding, but serve to introduce doubt to non-technical people and to trick people who don't examine what they're told.
What's particularly fascinating to me, though, is how some people are so pro-cloud that they'd argue with a writeup like this with silly cloud talking points. They don't seem to care much about data or facts, just that they love cloud and want everyone else to be in cloud, too. This happens much more often on sites like Reddit (r/sysadmin, even), but I wouldn't be surprised to see a little of it here.
It makes me wonder: how do people get so sold on a thing that they'll go online and fight about it, even when they lack facts or often even basic understanding?
I can clearly state why I advocate for avoiding cloud: cost, privacy, security, a desire to not centralize the Internet. The reason people advocate for cloud for others? It puzzles me. "You'll save money," "you can't secure your own machines," "it's simpler" all have worlds of assumptions that those people can't possibly know are correct.
So when I read something like this from Fastmail which was written without taking an emotional stance, I respect it. If I didn't already self-host email, I'd consider using Fastmail.
There used to be so much push for cloud everything that an article like this would get fanatical responses. I hope that it's a sign of progress that that fanaticism is waning and people aren't afraid to openly discuss how cloud isn't right for many things.
"All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding,"
This is false. AWS infrastructure is vastly more secure than almost all company data centers. AWS has a rule that the same person cannot have logical access and physical access to the same storage device. Very few companies have enough IT people to have this rule. The AWS KMS is vastly more secure than what almost all companies are doing. The AWS network is vastly better designed and operated than almost all corporate networks. AWS S3 is more reliable and scalable than anything almost any company could create on their own. To create something even close to it you would need to implement something like MinIO using 3 separate data centers.
1. big clouds are very lucrative targets for spooks, your data seem pretty likely to be hoovered up as "bycatch" (or maybe main catch depending on your luck) by various agencies and then traded around as currency
2. you never hear about security probems (incidents or exposure) in the platforms, there's no transparency
I think it's a very relevant bar, though. The top level commenter made points about "a business of just about any size", which seems pretty exactly aligned with "most corporate stuff".
> AWS infrastructure is vastly more secure than almost all company data centers
Secure in what terms? Security is always about a threat model and trade-offs. There's no absolute, objective term of "security".
> AWS has a rule that the same person cannot have logical access and physical access to the same storage device.
Any promises they make aren't worth anything there's contractually-stipulated damages that AWS should pay in case of breach, those damages actually corresponding to the costs of said breach for the customer, and a history of actually paying out said damages without shenanigans. They've already got a track record of lying on their status pages, so it doesn't bode well.
But I'm actually wondering what this specific rule even tries to defend against? You presumably care about data protection, so logical access is what matters. Physical access seems completely irrelevant no?
> Very few companies have enough IT people to have this rule
Maybe, but that doesn't actually mitigate anything from the company's perspective? The company itself would still be in the same position, aka not enough people to reliably separate responsibilities. Just that instead of those responsibilities being physical, they now happen inside the AWS console.
> The AWS KMS is vastly more secure than what almost all companies are doing.
See first point about security. Secure against what - what's the threat model you're trying to protect against by using KMS?
But I'm not necessarily denying that (at least some) AWS services are very good. Question is, is that "goodness" required for your use-case, is it enough to overcome its associated downsides, and is the overall cost worth it?
A pragmatic approach would be to evaluate every component on its merits and fitness to the problem at hand instead of going all in, one way or another.
one of my greatest learnings in life is to differentiate between facts and opinions- sometimes opinions are presented as facts and vice-versa. if you think about it- the statement "this is false" is a response to an opinion (presented as a fact) but not a fact. there is no way one can objectively define and defend what does "real technical understanding" means. the cloud space is vast with millions of people having varied understanding and thus opinions.
so let's not fight the battle that will never be won. there is no point in convincing pro-cloud people that cloud isn't the right choice and vice-versa. let people share stories where it made sense and where it didn't.
as someone who has lived in cloud security space since 2009 (and was founder of redlock - one of the first CSPMs), in my opinion, there is no doubt that AWS is indeed superiorly designed than most corp. networks- but is that you really need? if you run entire corp and LOB apps on aws but have poor security practices, will it be right decision? what if you have the best security engineers in the world but they are best at Cisco type of security - configuring VLANS and managing endpoints but are not good at detecting someone using IMDSv1 in ec2 exposed to the internet and running a vulnerable (to csrf) app?
when the scope of discussion is as vast as cloud vs on-prem, imo, it is a bad idea to make absolute statements.
Great points. Also if you end up building your apps as rube goldberg machines living up to "AWS Well Architected" criteria (indoctrinated by staff lots of AWS certifications, leading to a lot of AWS certified staff whose paycheck now depends on following AWS recommended practices) the complexity will kill your security, as nobody will understand the systems anymore.
The other part is that when us-east-1 goes down, you can blame AWS, and a third of your customer's vendors will be doing the same. When you unplug the power to your colo rack while installing a new server, that's on you.
about security, most businesses using AWS invest little to nothing in securing their software, or even adopt basic security practices for their employees
having the most secure data center doesn't matter if you load your secrets as env vars in a system that can be easily compromised by a motivated attacker
so i don't buy this argument as a general reason pro-cloud
It’s like putting something in someone’s desk drawer under the guise of convenience at the expense of security.
Why?
Too often, someone other than the data owner has or can get access to the drawer directly or indirectly.
Also, Cloud vs self hosted to me is a pendulum that has swung back and forth for a number of reasons.
The benefits of the cloud outlined here are often a lot of open source tech packaged up and sold as manageable from a web browser, or a command line.
One of the major reasons the cloud became popular was networking issues in Linux to manage volume at scale. At the time the cloud became very attractive for that reason, plus being able to virtualize bare metal servers to put into any combination of local to cloud hosting.
Self-hosting has become easier by an order of magnitude or two for anyone who knew how to do it, except it’s something people who haven’t done both self-hosting and cloud can really discuss.
Cloud has abstracted away the cost of horsepower, and converted it to transactions. People are discovering a fraction of the horsepower is needed to service their workloads than they thought.
At some point the horsepower got way beyond what they needed and it wasn’t noticed. But paying for a cloud is convenient and standardized.
Company data centres can be reasonably secured using a number of PaaS or IaaS solutions readily available off the shelf. Tools from VMware, Proxmox and others are tremendous.
It may seem like there’s a lot to learn, except most problems they are new to someone have often been thought of a ton by both people with and without experience that is beyond cloud only.
Isn’t it more like leasing in a public property? Meaning it is yours as long as you are paying the lease? Analogous to renting an apartment instead of owning a condo?
But isn't using Fastmail akin to using a cloud provider (managed email vs managed everything else)? They are similarly a service provider, and as a customer, you don't really care "who their ISP is?"
The discussion matters when we are talking about building things: whether you self-host or use managed services is a set of interesting trade-offs.
Yes, FastMail is a SAAS. But there adepts of a religion which would tell you that companies like FastMail should be built on top of AWS and it is the only true way. It is good to have some counter narrative to this.
<ctoHatTime>
Dunno man, it's really really easy to set up an S3 and use it to share datasets for users authorized with IAM....
And IAM and other cloud security and management considerations is where the opex/capex and capability argument can start to break down. Turns out, the "cloud" savings comes from not having capabilities in house to manage hardware. Sometimes, for most businesses, you want some of that lovely reliability.
(In short, I agree with you, substantially).
Like code. It is easy to get something basic up, but substantially more resources are needed for non-trivial things.
I strongly agree with this and also strongly lament it.
I find IAM to be a terrible implementation of a foundationally necessary system. It feels tacked on to me, except now it's tacked onto thousands of other things and there's no way out.
That's essentially why "platform engineering" is a hot topic. There are great FOSS tools for this, largely in the Kubernetes ecosystem.
To be clear, authentication could still be outsourced, but authorizing access to (on-prem) resources in a multi-tenant environment is something that "platforms" are frequently designed for.
> All the pro-cloud talking points... don't persuade anyone with any real technical understanding
This is a very engineer-centric take. The cloud has some big advantages that are entirely non-technical:
- You don't need to pay for hardware upfront. This is critical for many early-stage startups, who have no real ability to predict CapEx until they find product/market fit.
- You have someone else to point the SOC2/HIPAA/etc auditors at. For anyone launching a company in a regulated space, being able to checkbox your entire infrastructure based on AWS/Azure/etc existing certifications is huge.
You can over-provision your own baremetal resources 20x and it will be still cheaper than cloud. The capex talking point is just that, a talking point.
The real cost wins of self-hosted are that anything using new hardware becomes an ordeal, and engineers won't use high-cost, value-added services. I agree that there's often too little restraint in cloud architectures, but if a business truly believes in a project, it shouldn't be held up for six months waiting for server budget with engineers spending doing ops work to get three nines of DB reliability.
There is a size where self-hosting makes sense, but it's much larger than you think.
Most companies severely understaff ops, infra, and security. Your talking points might be good but, in practice, won’t apply in many cases because of the intractability of that management mindset. Even when they should know better.
I’ve worked at tech companies with hundreds of developers and single digit ops staff. Those people will struggle to build and maintain mature infra. By going cloud, you get access to mature infra just by including it in build scripts. Devops is an effective way to move infra back to project teams and cut out infra orgs (this isn’t great but I see it happen everywhere). Companies will pay cloud bills but not staffing salaries.
I'm curious about what "reasonable amount of hosting" means to you, because from my experience, as your internal network's complexity goes up, it's far better for your to move systems to a hyperscaler. The current estimate is >90% of Fortune 500 companies are cloud-based. What is it that you know that they don't?
>What's particularly fascinating to me, though, is how some people are so pro-cloud that they'd argue with a writeup like this with silly cloud talking points. They don't seem to care much about data or facts, just that they love cloud and want everyone else to be in cloud, too.
The irony is absolutely dripping off this comment, wow.
Commenter makes emotionally charge comment with no data or facts and decries anyone who disagrees with them as "silly talking points" for not caring about data and facts.
I’m not convinced this is entirely true. The upfront cost if you don’t have the skills, sure – it takes time to learn Linux administration, not to mention management tooling like Ansible, Puppet, etc.
But once those are set up, how is it different? AWS is quite clear with their responsibility model that you still have to tune your DB, for example. And for the setup, just as there are Terraform modules to do everything under the sun, there are Ansible (or Chef, or Salt…) playbooks to do the same. For both, you _should_ know what all of the options are doing.
The only way I see this sentiment being true is that a dev team, with no infrastructure experience, can more easily spin up a lot of infra – likely in a sub-optimal fashion – to run their application. When it inevitably breaks, they can then throw money at the problem via vertical scaling, rather than addressing the root cause.
I think this is only true for teams and apps of a certain size.
I've worked on plenty of teams with relatively small apps, and the difference between:
1. Cloud: "open up the cloud console and start a VM"
2. Owned hardware: "price out a server, order it, find a suitable datacenter, sign a contract, get it racked, etc."
Is quite large.
#1 is 15 minutes for a single team lead.
#2 requires the team to agree on hardware specs, get management approval, finance approval, executives signing contracts. And through all this you don't have anything online yet for... weeks?
If your team or your app is large, this probably all averages out in favor of #2. But small teams often don't have the bandwidth or the budget.
I work for a 50 person subsidiary of a 30k person organisation. I needed a domain name. I put in the purchase request and 6 months later eventually gave up, bought it myself and expensed it.
Our AWS account is managed by an SRE team. It’s a 3 day turnaround process to get any resources provisioned, and if you don’t get the exact spec right (you forgot to specify the iops on the volume? Oops) 3 day turnaround. Already started work when you request an adjustment? Better hope as part of your initial request you specified backups correctly or you’re starting again.
The overhead is absolutely enormous, and I actually don’t even have billing access to the AWS account that I’m responsible for.
You gave me flashbacks to a far worse bureaucratic nightmare with #2 in my last job.
I supported an application with a team of about three people for a regional headquarters in the DoD. We had one stack of aging hardware that was racked, on a handshake agreement with another team, in a nearby facility under that other team's control. We had to periodically request physical access for maintenance tasks and the facility routinely lost power, suffered local network outages, etc. So we decided that we needed new hardware and more of it spread across the region to avoid the shaky single-point-of-failure.
That began a three year process of: waiting for budget to be available for the hardware / licensing / support purchases; pitching PowerPoints to senior management to argue for that budget (and getting updated quotes every time from the vendors); working out agreements with other teams at new facilities to rack the hardware; traveling to those sites to install stuff; and working through the cybersecurity compliance stuff for each site. I left before everything was finished, so I don't know how they ultimately dealt with needing, say, someone in Japan to physically reseat a cable or something.
I’ve never worked at a company with these particular problems, but:
#1: A cloud VM comes with an obligation for someone at the company to maintain it. The cloud does not excuse anyone from doing this.
#2: Sounds like a dysfunctional system. Sure, it may be common, but a medium sized org could easily have some datacenter space and allow any team to rent a server or an instance, or to buy a server and pay some nominal price for the IT team to keep it working. This isn’t actually rocket science.
Sure, keeping a fifteen year old server working safely is a chore, but so is maintaining a fifteen-year-old VM instance!
Obligation? Far from it. I've worked at some poorly staffed companies. Nobody is maintaining old VMs or container images. If it works, nobody touches it.
I worked at a supposedly properly staffed company that had raised 100's of millions in investment, and it was the same thing. VMs running 5 year old distros that hadn't been updated in years. 600 day uptimes, no kernel patches, ancient versions of Postgres, Python 2.7 code everywhere, etc. This wasn't 10 years ago. This was 2 years ago!
The SMB I work for runs a small on-premise data center that is shared between teams and projects, with maybe 3-4 FTEs managing it (the respective employees also do dev and other work). This includes self-hosting email, storage, databases, authentication, source control, ticketing, company wiki, and other services. The current infrastructure didn’t start out that way and developed over many years, so it’s not necessarily something a small startup can start out with, but beyond a certain company size (a couple dozen employees or more) it shouldn’t really be a problem to develop that, if management shares the philosophy. I certainly find it preferable culturally if not technically to maximize independence in that way, have the local expertise and much better control over everything.
One (the only?) indisputable benefit of cloud is the ability to scale up faster (elasticity), but most companies don’t really need that. And if you do end up needing it after all, then it’s a good problem to have, as they say.
You're assuming that hosting something in-house implies that each application gets its own physical server.
You buy a couple of beastly things with dozens of cores. You can buy twice as much capacity as you actually use and still be well under the cost of cloud VMs. Then it's still VMs and adding one is just as fast. When the load gets above 80% someone goes through the running VMs and decides if it's time to do some house cleaning or it's time to buy another host, but no one is ever waiting on approval because you can use the reserve capacity immediately while sorting it out.
There is a large gap between "own the hardware" and "use cloud hosting". Many people rent the hardware, for example, and you can use managed databases which is one step up than "starting a vm".
But your comparison isn't fair. The difference between running your own hardware and using the cloud (which is perhaps not even the relevant comparison but let's run with it) is the difference between:
1. Open up the cloud console, and
2. You already have the hardware so you just run "virsh" or, more likely, do nothing at all because you own the API so you have already included this in your Ansible or Salt or whatever you use for setting up a server.
Because ordering a new physical box isn't really comparable to starting a new VM, is it?
Before the cloud, you could get a VM provisioned (virtual servers) or a couple of apps set up (LAMP stack on a shared host ;)) in a few minutes over a web interface already.
"Cloud" has changed that by providing an API to do this, thus enabling IaC approach to building combined hardware and software architectures.
For purposes of this discussion, isn't AWS just a very large hosting provider?
I.e. most hosting providers give you the option for virtual or dedicated hardware. So does Amazon (metal instances).
Like, "cloud" was always an ill-defined term, but in the case of "how do I provision full servers" I think there's no qualitative difference between Amazon and other hosting providers. Quantitative, sure.
But you still get nickel & dimed and pay insane costs, including on bandwidth (which is free in most conventional hosting providers, and overages are 90x cheaper than AWS' costs).
You can get pretty far without any of that fancy stuff. You can get plenty done by using parallel-ssh and then focusing on the actual thing you develop instead of endless tooling and docker and terraform and kubernetes and salt and puppet and ansible. Sure, if you know why you need them and know what value you get from them OK. But many people just do it because it's the thing to do...
Do you need those tools? It seems that for fundamental web hosting, you need your application server, nginx or similar, postgres or similar, and a CLI. (And an interpreter etc if your application is in an interpreted lang)
I suppose that depends on your RTO. With cloud providers, even on a bare VM, you can to some extent get away with having no IaC, since your data (and therefore config) is almost certainly on networked storage which is redundant by design. If an EC2 fails, or even if one of the drives in your EBS drive fails, it'll probably come back up as it was.
If it's your own hardware, if you don't have IaC of some kind – even something as crude as a shell script – then a failure may well mean you need to manually set everything up again.
Well, sure – I was trying to do a comparison in favor of cloud, because the fact that EBS Volumes can magically detach and attach is admittedly a neat trick. You can of course accomplish the same (to a certain scale) with distributed storage systems like Ceph, Longhorn, etc. but then you have to have multiple servers, and if you have multiple servers, you probably also have your application load balanced with failover.
- Some sort of firewall or network access control. Being able to say "allow http/s from the world (optionally minus some abuser IPs that cause problems), and allow SSH from developers (by IP, key, or both)" at a separate layer from nginx is prudent. Can be ip/tables config on servers or a separate firewall appliance.
- Some mechanism of managing storage persistence for the database, e.g. backups, RAID, data files stored on fast network-attached storage, db-level replication. Not losing all user data if you lose the DB server is table stakes.
- Something watching external logging or telemetry to let administrators know when errors (e.g. server failures, overload events, spikes in 500s returned) occur. This could be as simple as Pingdom or as involved as automated alerting based on load balancer metrics. Relying on users to report downtime events is not a good approach.
- Some sort of CDN, for applications with a frontend component. This isn't required for fundamental web hosting, but for sites with a frontend and even moderate (10s/sec) hit rates, it can become required for cost/performance; CDNs help with egress congestion (and fees, if you're paying for metered bandwidth).
- Some means of replacing infrastructure from nothing. If the server catches fire or the hosting provider nukes it, having a way to get back to where you were is important. Written procedures are fine if you can handle long downtime while replacing things, but even for a handful of application components those procedures get pretty lengthy, so you start wishing for automation.
- Some mechanism for deploying new code, replacing infrastructure, or migrating data. Again, written procedures are OK, but start to become unwieldy very early on ('stop app, stop postgres, upgrade the postgres version, start postgres, then apply application migrations to ensure compatibility with new version of postgres, then start app--oops, forgot to take a postgres backup/forgot that upgrading postgres would break the replication stream, gotta write that down for net time...').
...and that's just for a very, very basic web hosting application--one that doesn't need caches, blob stores, the ability to quickly scale out application server or database capacity.
Each of those things can be accomplished the traditional way--and you're right, that sometimes that way is easier for a given item in the list (especially if your maintainers have expertise in that item)! But in aggregate, having a cloud provider handle each of those concerns tends to be easier overall and not require nearly as much in-house expertise.
You are focusing on technology. And sure of course you can get most of the benefits of AWS a lot cheaper when self-hosting.
But when you start factoring internal processes and incompetent IT departments, suddenly that's not actually a viable option in many real-world scenarios.
I have never ever worked somewhere with one of these "cloud-like but custom on our own infrastructure" setups that didn't leak infrastructure concerns through the abstraction, to a significantly larger degree than AWS.
I believe it can work, so maybe there are really successful implementations of this out there, I just haven't seen it myself yet!
> Cloud expands the capabilities of what one team can manage by themselves, enabling them to avoid a huge amount of internal politics.
It's related to the first part. Re: the second, IME if you let dev teams run wild with "managing their own infra," the org as a whole eventually pays for that when the dozen bespoke stacks all hit various bottlenecks, and no one actually understands how they work, or how to troubleshoot them.
I keep being told that "reducing friction" and "increasing velocity" are good things; I vehemently disagree. It might be good for short-term profits, but it is poison for long-term success.
Our big company locked all cloud resources behind a floating/company-wide DevOps team (git and CI too). We have an old on-prem server that we jealously guard because it allows us to create remotes for new git repos and deploy prototypes without consulting anyone.
(To be fair, I can see why they did it - a lot of deployments were an absolute mess before.)
Self-hosted software also has APIs, and Terraform libraries, and Ansible playbooks, etc. It’s just that you have to know what it is you’re trying to do, instead of asking AWS what collection of XaaS you should use.
Well cloud providers often give more than just VMs in a data enter somewhere. You may not be able to find good equivalents if you aren’t using the cloud. Some third-party products are also only available on clouds. How much of a difference those things make will depend on what you’re trying to do.
I think there are accounting reasons for companies to prefer paying opex to run things on the cloud instead of more capex-intensive self-hosting, but I don’t understand the dynamics well.
It’s certainly the case that clouds tend to be more expensive than self-hosting, even when taking account of the discounts that moderately sized customers can get, and some of the promises around elastic scaling don’t really apply when you are bigger.
To some of your other points: the main customers of companies like AWS are businesses. Businesses generally don’t care about the centralisation of the internet. Businesses are capable of reading the contracts they are signing and not signing them if privacy (or, typically more relevant to businesses, their IP) cannot be sufficiently protected. It’s not really clear to me that using a cloud is going to be less secure than doing things on-prem.
It seems that the preference is less about understanding or misunderstanding the technical requirements but more that it moves a capital expenditure with some recurring operational expenditure entirely into the opex column.
The fact is, managing your own hardware is a pita and a distraction from focusing on the core product. I loathe messing with servers and even opt for "overpriced" paas like fly, render, vercel. Because every minute messing with and monitoring servers is time not spent on product. My tune might change past a certain size and a massive cloud bill and there's room for full time ops people, but to offset their salary, it would have to be huge.
That argument makes sense for PaaS services like the ones you mention. But for bare "cloud" like AWS, I'm not convinced it is saving any effort, it's merely swapping one kind of complexity with another. Every place I've been in had full-time people messing with YAML files or doing "something" with the infrastructure - generally trying to work around the (self-inflicted) problems introduced by their cloud provider - whether it's the fact you get 2010s-era hardware or that you get nickel & dimed on absolutely arbitrary actions that have no relationship to real-world costs.
How do you configure S3 access control? You need to learn & understand how their IAM works.
How do you even point a pretty URL to a lambda? Last time I looked you need to stick an "API gateway" in front (which I'm sure you also get nickel & dimed for).
How do you go from "here's my git repo, deploy this on Fargate" with AWS? You need a CI pipeline which will run a bunch of awscli commands.
And I'm not even talking about VPCs, security groups, etc.
Somewhat different skillsets than old-school sysadmin (although once you know sysadmin basics, you realize a lot of these are just the same concepts under a branded name and arbitrary nickel & diming sprinkled on top), but equivalent in complexity.
Counterpoint: if you’re never “messing with servers,” you probably don’t have a great understanding of how their metrics map to those of your application’s, and so if you bottleneck on something, it can be difficult to figure out what to fix. The result is usually that you just pay more money to vertically scale.
To be fair, you did say “my tune might change past a certain size.” At small scale, nothing you do within reason really matters. World’s worst schema, but your DB is only seeing 100 QPS? Yeah, it doesn’t care.
I don’t think you’re correct. I’ve watched junior/mid-level engineers figure things out solely by working on the cloud and scaling things to a dramatic degree. It’s really not a rocket science.
I didn't say it's rocket science, nor that it's impossible to do without having practical server experience, only that it's more difficult.
Take disks, for example. Most cloud-native devs I've worked with have no clue what IOPS are. If you saturate your disk, that's likely to cause knock-on effects like increased CPU utilization from IOWAIT, and since "CPU is high" is pretty easy to understand for anyone, the seemingly obvious solution is to get a bigger instance, which depending on the application, may inadvertently solve the problem. For RDBMS, a larger instance means a bigger buffer pool / shared buffers, which means fewer disk reads. Problem solved, even though actually solving the root cause would've cost 1/10th or less the cost of bumping up the entire instance.
A small app (or a larger one, for that matter) can quite easily run on infra that's instantiated from canned IaC, like TF AWS Modules [0]. If you can read docs, you should be able to quite trivially get some basic infra up in a day, even with zero prior experience managing it.
Yes, I've used several of these modules myself. They save tons of time! Unfortunately, for legacy projects, I inherited a bunch of code from individuals that built everything "by hand" then copy-pasted everything. No re-usability.
Anecdotal - but I once worked for a company where the product line I built for them after acquisition was delayed by 5 months because that's how long it took to get the hardware ordered and installed in the datacenter. Getting it up on AWS would have been a days work, maybe two.
Yes, it is death by 1000 cuts. Speccing, negotiating with hardware vendors, data center selection and negotiating, DC engineer/remote hands, managing security cage access, designing your network, network gear, IP address ranges, BGP, secure remote console access, cables, shipping, negotiating with bandwidth providers (multiple, for redundancy), redundant hardware, redundant power sources, UPS. And then you get to plug your server in. Now duplicate other stuff your cloud might provide, like offsite backups, recovery procedures, HA storage, geographic redundancy. And do it again when you outgrown your initial DC. Or build your own DC (power, climate, fire protection, security, fiber, flooring, racks)
Much of this is still required in cloud. Also, I think you're missing the middle ground where 99.99% of companies could happily exist indefinitely: colo. It makes little to no financial or practical sense for most to run their own data centers.
I'm with you there, with stuff like fly.io, there's really no reason to worry about infrastructure.
AWS, on the other hand, seems about as time consuming and hard as using root servers. You're at a higher level of abstraction, but the complexity is about the same I'd say. At least that's my experience.
> On the other hand, a business of just about any size that has any reasonable amount of hosting is better off with their own systems when it comes purely to cost
From a cost PoV, sure, but when you're taking money out of capex it represents a big hit to the cash flow, while taking out twice that amount from opex has a lower impact on the company finances.
There is a whole ecosystem that pushes cloud to ignorant/fresh graduates/developers. Just take a look at the sponsors for all the most popular frameworks. When your system is super complex and depends on the cloud they make more money. Just look at the PHP ecosystem, Laravel needs 4 times the servers to server something that a pure PHP system would need. Most projects don't need the cloud. Only around 10% of projects actually need what the cloud provides. But they were able to brainwash a whole generation of developers/managers to think that they do. And so it goes.
I want to see an article like this, but written from a Fortune 500 CTO perspective
It seems like they all abandoned their VMware farms or physical server farms for Azure (they love Microsoft).
Are they actually saving money? Are things faster? How's performance? What was the re-training/hiring like?
In one case I know we got rid of our old database greybeards and replaced them with "DevOps" people that knew nothing about performance etc
And the developers (and many of the admins) we had knew nothing about hardware or anything so keeping the physical hardware around probably wouldn't have made sense anyways
Complicating this analysis is that computers have still been making exponential improvements in capability as clouds became popular (e.g. disks are 1000-10000x faster than they were 15 years ago), so you'd naturally expect things to become easier to manage over time as you need fewer machines, assuming of course that your developers focus on e.g. learning how to use a database well instead of how to scale to use massive clusters.
That is, even if things became cheaper/faster, they might have been even better without cloud infrastructure.
The one convincing argument from technical people I saw, that would be repeated to your comment, is that by now, you dont find enough experienced engineers to reliably setup some really big systems. Because so much went to the cloud, a lot of the knowledge is buried there.
That came from technical people who I didn't perceive as being dogmatically pro-cloud.
Yep. I had someone tell me last week that they didn't want a more rigid schema because other teams rely on it, and anything adding "friction" to using it would be poorly received.
As an industry, we are largely trading correctness and performance for convenience, and this is not seen as a negative by most. What kills me is that at every cloud-native place I've worked at, the infra teams were both responsible for maintaining and fixing the infra that product teams demanded, but were not empowered to push back on unreasonable requests or usage patterns. It's usually not until either the limits of vertical scaling are reached, or a SEV0 occurs where these decisions were the root cause does leadership even begin to consider changes.
The thing that frustrates me is it’s possible to know how to do both. I have worked with multiple people who are quite proficient in both areas.
Cloud has definite advantages in some circumstances, but so does self-hosting; moreover, understanding the latter makes the former much, much easier to reason about. It’s silly to limit your career options.
Being good at both is twice the work, because even if some concepts translate well, IME people won't hire someone based on that. "Oh you have experience with deploying RabbitMQ but not AWS SQS? Sorry, we're looking for someone more qualified."
As someone who ran a startup with 100’s of hosts. As soon as I start to count the salaries, hiring, desk space, etc of the people needed to manage the hosts AWS would look cheap again. Yea, hardware costs they are aggressively expensive. But TCO wise, they’re cheap for any decent sized company.
Add in compliance, auditing, etc. all things that you can set up out of the box (PCI, HIPPA, lawsuit retention). Gets even cheaper.
There was a time when cloud was significantly cheaper then owning.
I'd expect that there are people who moved to the cloud then, and over time started using services offered by their cloud provider (e.g., load balancers, secret management, databases, storage, backup) instead of running those services themselves on virtual machines, and now even if it would be cheaper to run everything on owned servers they find it would be too much effort to add all those services back to their own servers.
The cloud wasn’t about cheap, it was about fast. If you’re VC funded, time is everything, and developer velocity above all else to hyperscale and exit. That time has passed (ZIRP), and the public cloud margin just doesn’t make sense when you can own and operate (their margin is your opportunity) on prem with similar cloud primitives around storage and compute.
Elasticity is a component, but has always been from a batch job bin packing scheduling perspective, not much new there. Before k8s and Nomad, there was Globus.org.
(Infra/DevOps in a previous life at a unicorn, large worker cluster for a physics experiment prior, etc; what is old is a new again, you’re just riding hype cycle waves from junior to retirement [mainframe->COTS on prem->cloud->on prem cloud, and so on])
Also, by the way, I found it interesting that you framed your side of this disagreement as the technically correct one, but then included this:
> a desire to not centralize the Internet
This is an ideological stance! I happen to share this desire. But you should be aware of your own non-technical - "emotional" - biases when dismissing the arguments of others on the grounds that they are "emotional" and+l "fanatical".
Only if you’re literally running your own datacenters, which is in no way required for the majority of companies. Colo giants like Equinix already have the infrastructure in place, with a proven track record.
If you enable Multi-AZ for RDS, your bill doubles until you cancel. If you set up two servers in two DCs, your initial bill doubles from the CapEx, and then a very small percentage of your OpEx goes up every month for the hosting. You very, very quickly make this back compared to cloud.
It depends on how deep you want to go. Equinix for one (I'm sure others as well, but I'm most familiar with them) offers managed cross-DC fiber. You will probably need to manage the networking, to be fair, and I will readily admit that's not trivial.
Yep. Cross-region RDBMS is a hard problem, even when you're using a managed service – you practically always have to deal with eventual consistency, or increased latency for writes.
It can be useful. I run a latency sensitive service with global users. A cloud lets me run it in 35 locations dealing with one company only. Most of those locations only have traffic to justify a single, smallish, instance.
In the locations where there's more traffic, and we need more servers, there are more cost effective providers, but there's value in consistency.
Elasticity is nice too, we doubled our instance count for the holidays, and will return to normal in January. And our deployment style starts a whole new cluster, moves traffic, then shuts down the old cluster. If we were on owned hardware, adding extra capacity for the holidays would be trickier, and we'd have to have a more sensible deployment method. And the minimum service deployment size would probably not be a little quad processor box with 2GB ram.
Using cloud for the lower traffic locations and a cost effective service for the high traffic locations would probably save a bunch of money, but add a lot of deployment pain. And a) it's not my decision and b) the cost difference doesn't seem to be quite enough to justify the pain at our traffic levels. But if someone wants to make a much lower margin, much simpler service with lots of locations and good connectivity, be sure to post about it. But, I think the big clouds have an advantage in geographic expansion, because their other businesses can provide capital and justification to build out, and high margins at other locations help cross subsidize new locations when they start.
I agree it can be useful (latency, availability, using off-peak resources), but running globally should be a default and people should opt-in into fine-grained control and responsibility.
From outside it seems that either AWS picked the wrong default to present their customers, or that it's unreasonably expensive and it drives everyone into the in-depth handling to try to keep cloud costs down.
Cloud is more than instances. If all you need is a bunch of boxes, then cloud is a terrible fit.
I use AWS cloud a lot, and almost never use any VMs or instances. Most instances I use are along the lines of a simple anemic box for a bastion host or some such.
I use higher level abstractions (services) to simplify solutions and outsource maintenance of these services to AWS.
In the public sector, cloud solves the procurement problem. You just need to go through the yearlong process once to use a cloud service, instead of for each purchase > 1000€.
> What's particularly fascinating to me, though, is how some people are so pro-cloud that they'd argue with a writeup like this with silly cloud talking points.
I’m sure I’ll be downvoted to hell for this, but I’m convinced that it’s largely their insecurities being projected.
Running your own hardware isn’t tremendously difficult, as anyone who’s done it can attest, but it does require a much deeper understanding of Linux (and of course, any services which previously would have been XaaS), and that’s a vanishing trait these days. So for someone who may well be quite skilled at K8s administration, serverless (lol) architectures, etc. it probably is seen as an affront to suggest that their skill set is lacking something fundamental.
> So for someone who may well be quite skilled at K8s administration ...
And running your own hardware is not incompatible with Kubernetes: on the contrary. You can fully well have your infra spin up VMs and then do container orchestration if that's your thing.
And part your hardware monitoring and reporting tool can work perfectly fine from containers.
Bare metal -> Hypervisor -> VM -> container orchestration -> a container running a "stateless" hardware monitoring service. And VMs themselves are "orchestrated" too. Everything can be automated.
Anyway say a harddisk being to show errors? Notifications being sent (email/SMS/Telegram/whatever) by another service in another container, dashboard shall show it too (dashboards are cool).
Go to the machine once the spare disk as already been resilvered, move it where the failed disk was, plug in a new disk that becomes the new spare.
Boom, done.
I'm not saying all self-hosted hardware should do container orchestration: there are valid use cases for bare metal too.
But something as to be said about controlling everything on your own infra: from the bare metal to the VMs to container orchestration. To even potentially your own IP address space.
This is all within reach of an individual, both skill-wise and price-wise (including obtaining your own IP address space). People who drank the cloud kool-aid should ponder this and wonder how good their skills truly are if they cannot get this up and working.
Fully agree. And if you want to take it to the next level (and have a large budget), Oxide [0] seems to have neatly packaged this into a single coherent product. They don't quite have K8s fully running, last I checked, but there are of course other container orchestration systems.
> Go to the machine once the spare disk as already been resilvered
> And running your own hardware is not incompatible with Kubernetes: on the contrary
Kubernetes actually makes so much more sense on bare-metal hardware.
On the cloud, I think the value prop is dubious - your cloud provider is already giving you VMs, why would you need to subdivide them further and add yet another layer of orchestration?
Not to mention that you're getting 2010s-era performance on those VMs, so subdividing them is terrible from a performance point of view too.
> All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding, but serve to introduce doubt to non-technical people and to trick people who don't examine what they're told.
This feels like "no true scotsman" to me. I've been building software for close to two decades, but I guess I don't have "any real technical understanding" because I think there's a compelling case for using "cloud" services for many (honestly I would say most) businesses.
Nobody is "afraid to openly discuss how cloud isn't right for many things". This is extremely commonly discussed. We're discussing it right now! I truly cannot stand this modern innovation in discourse of yelling "nobody can talk about XYZ thing!" while noisily talking about XYZ thing on the lowest-friction publishing platforms ever devised by humanity. Nobody is afraid to talk about your thing! People just disagree with you about it! That's ok, differing opinions are normal!
Your comment focuses a lot on cost. But that's just not really what this is all about. Everyone knows that on a long enough timescale with a relatively stable business, the total cost of having your own infrastructure is usually lower than cloud hosting.
But cost is simply not the only thing businesses care about. Many businesses, especially new ones, care more about time to market and flexibility. Questions like "how many servers do we need? with what specs? and where should we put them?" are a giant distraction for a startup, or even for a new product inside a mature firm.
Cloud providers provide the service of "don't worry about all that, figure it out after you have customers and know what you actually need".
It is also true that this (purposefully) creates lock-in that is expensive either to leave in place or unwind later, and it definitely behooves every company to keep that in mind when making architecture decisions, but lots of products never make it to that point, and very few of those teams regret the time they didn't spend building up their own infrastructure in order to save money later.
The problem with your claims here is they can only be right if the entire industry is experiencing mass psychosis. I reject a theory that requires that, because my ego just isn't that large.
I once worked for several years at a publicly traded firm well-known for their return-to-on-prem stance, and honestly it was a complete disaster. The first-party hardware designs didn't work right because they didn't have the hardware designs staffing levels to have de-risked to possibility that AMD would fumble the performance of Zen 1, leaving them with a generation of useless hardware they nonetheless paid for. The OEM hardware didn't work right because they didn't have the chops to qualify it either, leaving them scratching their heads for months over a cohort of servers they eventually discovered were contaminated with metal chips. And, most crucially, for all the years I worked there, the only thing they wanted to accomplish was failover from West Coast to East Coast, which never worked, not even once. When I left that company they were negotiating with the data center owner who wanted to triple the rent.
These experiences tell me that cloud skeptics are sometimes missing a few terms in their equations.
"Vendor problems" is a red herring, IMO; you can have those in the cloud, too.
It's been my experience that those who can build good, reliable, high-quality systems, can do so either in the cloud or on-prem, generally with equal ability. It's just another platform to such people, and they will use it appropriately and as needed.
Those who can only make it work in the cloud are either building very simple systems (which is one place where the cloud can be appropriate), or are building a house of cards that will eventually collapse (or just cost them obscene amounts of money to keep on life support).
Engineering is engineering. Not everyone in the business does it, unfortunately.
Like everything, the cloud has its place -- but don't underestimate the number of decisions that get taken out of the hands of technical people by the business people who went golfing with their buddy yesterday. He just switched to Azure, and it made his accountants really happy!
The whole CapEx vs. OpEx issue drives me batty; it's the number one cause of cloud migrations in my career. For someone who feels like spent money should count as spent money regardless of the bucket it comes out of, this twists my brain in knots.
> or are building a house of cards that will eventually collapse (or just cost them obscene amounts of money to keep on life support)
Ding ding ding. It's this.
> The whole CapEx vs. OpEx issue drives me batty
Seconded. I can't help but feel like it's not just a "I don't understand money" thing, but more of a "the way Wall Street assigns value is fundamentally broken." Spending $100K now, once, vs. spending $25K/month indefinitely does not take a genius to figure out.
it's all about painting the right picture for your investors, so you make up shit and classify as cogs or opex depending on what is most beneficial for you in the moment
There's however a middle-ground between run your own colocated hardware and cloud. It's called "dedicated" servers and many hosting providers (from budget bottom-of-the-barrel to "contact us" pricing) offer it.
Those take on the liability of sourcing, managing and maintaining the hardware for a flat monthly fee, and would take on such risk. If they make a bad bet purchasing hardware, you won't be on the hook for it.
This seems like a point many pro-cloud people (intentionally?) overlook.
> All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding ...
And moreover most of the actual interesting things, like having VM templates and stateless containers, orchestration, etc. is very easy to run yourself and gets you 99.9% of the benefits of the cloud.
About just any and every service is available as container file already written for you. And if it doesn't exist, it's not hard to plumb up.
A friend of mine runs more than 700 containers (yup, seven hundreds), split over his own rack at home (half of them) and the other half on dedicated servers (he runs stuff like FlightRadar, AI models, etc.). He'll soon get his own IP addresses space. Complete "chaos monkey" ready infra where you can cut any cable and the thing shall keep working: everything is duplicated, can be spun up on demand, etc. Someone could still his entire rack and all his dedicated server, he'd still be back operational in no time.
If an individual can do that, a company, no matter its size, can do it too. And arguably 99.9% of all the companies out there don't have the need for an infra as powerful as the one most homelab enthusiast have.
And another thing: there's even two in-betweens between "cloud" and "our own hardware located at our company". First is colocating your own hardware but in a datacenter. Second is renting dedicated servers from a datacenter.
They're often ready to accept cloud-init directly.
And it's not hard. I'd say learning to configure hypervisors on bare metal, then spin VMs from templates, then running containers inside the VMs is actually much easier than learning all the idiosyncrasies of all the different cloud vendors APIs and whatnots.
Funnily enough when the pendulum swung way too far on the "cloud all the things" side, those saying at some point we'd read story about repatriation were being made fun of.
> If an individual can do that, a company, no matter its size, can do it too.
Fully agreed. I don't have physical HA – if someone stole my rack, I would be SOL – but I can easily ride out a power outage for as long as I want to be hauling cans of gasoline to my house. The rack's UPS can keep it up at full load for at least 30 minutes, and I can get my generator running and hooked up in under 10. I've done it multiple times. I can lose a single server without issue. My only SPOF is internet, and that's only by choice, since I can get both AT&T and Spectrum here, and my router supports dual-WAN with auto-failover.
> And arguably 99.9% of all the companies out there don't have the need for an infra as powerful as the one most homelab enthusiast have.
THIS. So many people have no idea how tremendously fast computers are, and how much of an impact latency has on speed. I've benchmarked my 12-year old Dells against the newest and shiniest RDS and Aurora instances on both MySQL and Postgres, and the only ones that kept up were the ones with local NVMe disks. Mine don't even technically have _local_ disks; they're NVMe via Ceph over Infiniband.
Does that scale? Of course not; as soon as you want geo-redundant, consistent writes, you _will_ have additional latency. But most smaller and medium companies don't _need_ that.
Such an awesome article. I like how they didn't just go with the Cloud wave but kept sysadmin'ing, like ol' Unix graybeards. Two interesting things they wrote about their SSDs:
1) "At this rate, we’ll replace these [SSD] drives due to increased drive sizes, or entirely new physical drive formats (such E3.S which appears to finally be gaining traction) long before they get close to their rated write capacity."
and
2) "We’ve also anecdotally found SSDs just to be much more reliable compared to HDDs (..) easily less than one tenth the failure rate we used to have with HDDs."
To avoid sysadmin tasks, and keep costs down, you've got to go so deep in the cloud, that it becomes just another arcane skill set. I run most of my stuff on virtual Linux servers, but some on AWS, and that's hard to learn, and doesn't transfer to GCP or Azure. Unless your needs are extreme, I think sysadmin'ing is the easier route in most cases.
For so many things the cloud isn't really easier or cheaper, and most cloud providers stopped advertising it as such. My assumption is that cloud adoption is mainly driven by 3 forces:
- for small companies: free credits
- for large companies: moving prices as far away as possible from the deploy button, allowing dev and it to just deploy stuff without purchase orders
- self-perpetuating due to hype, cv-driven development, and ease of hiring
All of these are decent reasons, but none of them may apply to a company like fastmail
Also CYA. If you run your own servers and something goes wrong its your fault. if its an outage at AWS its their fault.
Also a huge element of follow the crowd, branding non-technical management are familiar with, and so on. I have also found some developers (front end devs, or back end devs who do not have sysadmin skills) feel cloud is the safe choice. This is very common for small companies as they may have limited sysadmin skills (people who know how to keep windows desktops running are not likely to be who you want to deploy servers) and a web GUI looks a lot easier to learn.
There are other, if often at least tangentially related, reasons but more than I can give justice to in a comment.
Many people largely got a lot of things wrong about cloud that I've been meaning to write about for a while. I'll get to it after the holidays. But probably none more than the idea that massive centralized computing (which was wrongly characterized as a utility like the electric grid) would have economics with which more local computing options could never compete.
In small companies, cloud also provides the ability to work around technical debt and to reduce risk.
For example, I have seen several cases where poorly designed systems that unexpectedly used too much memory, and there was no time to fix it, so the company increased the memory on all instances with a few clicks. When you need to do this immediately to avoid a botched release that has already been called "successful" and announced as such to stakeholders, that is a capability that saves the day.
An example of de-risking is using a cloud filesystem like EFS to provide a pseudo-infinite volume. No risk of an outage due to an unexpectedly full disk.
Another example would be using a managed database system like RDS vs self-managing the same RDBMS: using the managed version saves on labor and reduces risk for things like upgrades. What would ordinarily be a significant effort for a small company becomes automatic, and RDS includes various sanity checks to help prevent you from making mistakes.
The reality of the industry is that many companies are just trying to hit the next milestone of their business by a deadline, and the cloud can help despite the downsides.
> For example, I have seen several cases where poorly designed systems that unexpectedly used too much memory
> using a managed database system like RDS vs self-managing the same RDBMS: using the managed version saves on labor
As a DBRE / SRE, I can confidently assert that belief in the latter is often directly responsible for the former. AWS is quite clear in their shared responsibility model [0] that you are still responsible for making sound decisions, tuning various configurations, etc. Having staff that knows how to do these things often prevents the poor decisions from being made in the first place.
I'm very interested in approaches that avoid cloud, so please don't read this as me saying cloud is superior. I can think of some other advantages of cloud:
- easy to setup different permissions for users (authorisation considerations).
- able to transfer assets to another owner (e.g., if there's a sale of a business) without needing to move physical hardware.
- other outsiders (consultants, auditors, whatever) can come in and verify the security (or other) of your setup, because it's using a standard well known cloud platform.
It never disappeared in some places. In my region there's been zero interest in "the cloud" because of physical remoteness from all major GCP/AWS/Azure datacenters (resulting in high latency), for compliance reasons, and because it's easier and faster to solve problems by dealing with a local company than pleading with a global giant that gives zero shits about you because you're less than a rounding error in its books.
The fact that Fastmail work like this, are transparent about what they're up to and how they're storing my email and the fact that they're making logical decisions and have been doing so for quite a long time is exactly the reason I practically trip over myself to pay them for my email. Big fan of Fastmail.
Aside: Fastmail was the best email provider I ever used. The interface was intuitive and responsive, both on mobile and web. They have extensive documentation for everything. I was able to set up a custom domain and and a catch-all email address in a few minutes. Customer support is great, too. I emailed them about an issue and they responded within the hour (turns out it was my fault). I feel like it's a really mature product/company and they really know what they're doing, and have a plan for where they're going.
I ended up switching to Protonmail, because of privacy (Fastmail is within the Five Eyes (Australia)), which is the only thing I really like about Protonmail. But I'm considering switching back to Fastmail, because I liked it so much.
I was told Fastmail is excellent, and I am not a big fan of gmail. Once locked out for good in gmail, your email and apps associated with it, are gone forever. Source? Personal experience.
"A private inbox $60 for 12 months". I assume it is USD, not AU$ (AFAIK, Fastmail is based in Australia.) Still pricey.
At https://www.infomaniak.com/ I can buy email service for an (in my case external) domain for 18 Euro a year and I get 5 inboxes. And it is based in Switzerland, so no EU or US jurisdiction.
I have a few websites and fastmail would just be prohibitive expensive for me.
I have seen a common sentiment that self hosting is almost always better than cloud. What these discussions does not mention is how to effectively run your business applications on this infrastructure.
Things like identity management (AAD/IAM), provisioning and running VMs, deployments. Network side of things like VNet, DNS, securely opening ports etc. Monitoring setup across the stack. There is so much functionalities that will be required to safely expose an application externally that I can't even coherently list them out here. Are people just using Saas for everything (which I think will defeat the purpose of on-prem infra) or a competent Sys admin can handle all this to give a cloud like experience for end developers?
Can someone share their experience or share any write ups on this topic?
For more context, I worked at a very large hedge fund briefly which had a small DC worth of VERY beefy machines but absolutely no platform on top of it. Hosting application was done by copying the binaries on a particular well known machine and running npm commands and restarting nginx. Log a ticket with sys admin to create a DNS entry to point a reserve and point a internal DNS to this machine (no load balancer). Deployment was a shell script which rcp new binaries and restarts nginx. No monitoring or observability stack.
There was a script which will log you into a random machine for you to run your workloads (be ready to get angry IMs from more senior quants running their workload in that random machine if your development build takes up enough resources to effect their work). I can go on and on but I think you get the idea.
Do you mean for administrative access to the machines (over SSH, etc) or for "normal" access to the hosted applications?
Admin access: Ansible-managed set of UNIX users & associated SSH public keys, combined with remote logging so every access is audited and a malicious operator wiping the machine can't cover their tracks will generally get you pretty far. Beyond that, there are commercial solutions like Teleport which provide integration with an IdP, management web UI, session logging & replay, etc.
Normal line-of-business access: this would be managed by whatever application you're running, not much different to the cloud. But if your application isn't auth-aware or is unsafe to expose to the wider internet, you can stick it behind various auth proxies such as Pomerium - it will effectively handle auth against an IdP and only pass through traffic to the underlying app once the user is authenticated. This is also useful for isolating potentially vulnerable apps.
> provisioning and running VMs
Provisioning: once a VM (or even a physical server) is up and running enough to be SSH'd into, you should have a configuration management tool (Ansible, etc) apply whatever configuration you want. This would generally involve provisioning users, disabling some stupid defaults (SSH password authentication, etc), installing required packages, etc.
To get a VM to an SSH'able state in the first place, you can configure your hypervisor to pass through "user data" which will be picked up by something like cloud-init (integrated by most distros) and interpreted at first boot - this allows you to do things like include an initial SSH key, create a user, etc.
To run VMs on self-managed hardware: libvirt, proxmox in the Linux world. bhyve in the BSD world. Unfortunately most of these have rough edges, so commercial solutions there are worth exploring. Alternatively, consider if you actually need VMs or if things like containers (which have much nicer tooling and a better performance profile) would fit your use-case.
> deployments
Depends on your application. But let's assume it can fit in a container - there's nothing wrong with a systemd service that just reads a container image reference in /etc/... and uses `docker run` to run it. Your deployment task can just SSH into the server, update that reference in /etc/ and bounce the service. Evaluate Kamal which is a slightly fancier version of the above. Need more? Explore cluster managers like Hashicorp Nomad or even Kubernetes.
> Network side of things like VNet
Wireguard tunnels set up (by your config management tool) between your machines, which will appear as standard network interfaces with their own (typically non-publicly-routable) IP addresses, and anything sent over them will transparently be encrypted.
> DNS
Generally very little reason not to outsource that to a cloud provider or even your (reputable!) domain registrar. DNS is mostly static data though, which also means if you do need to do it in-house for whatever reason, it's just a matter of getting a CoreDNS/etc container running on multiple machines (maybe even distributed across the world). But really, there's no reason not to outsource that and hosted offerings are super cheap - so go open an AWS account and configure Route53.
> securely opening ports
To begin with, you shouldn't have anything listening that you don't want to be accessible. Then it's not a matter of "opening" or closing ports - the only ports that actually listen are the ones you want open by definition because it's your application listening for outside traffic. But you can configure iptables/nftables as a second layer of defense, in case you accidentally start something that unexpectedly exposes some control socket you're not aware of.
> Monitoring setup across the stack
collectd running on each machine (deployed by your configuration management tool) sending metrics to a central machine. That machine runs Grafana/etc. You can also explore "modern" stuff that the cool kids play with nowadays like VictoriaMetrics, etc, but metrics is mostly a solved problem so there's nothing wrong with using old tools if they work and fit your needs.
For logs, configure rsyslogd to log to a central machine - on that one, you can have log rotation. Or look into an ELK stack. Or use a hosted service - again nothing prevents you from picking the best of cloud and bare-metal, it's not one or the other.
> safely expose an application externally
There's a lot of snake oil and fear-mongering around this. First off, you need to differentiate between vulnerabilities of your application and vulnerabilities of the underlying infrastructure/host system/etc.
App vulnerabilities, in your code or dependencies: cloud won't save you. It runs your application just like it's been told. If your app has an SQL injection vuln or one of your dependencies has an RCE, you're screwed either way. To manage this you'd do the same as you do in cloud - code reviews, pentesting, monitoring & keeping dependencies up to date, etc.
Infrastructure-level vulnerabilities: cloud providers are responsible for keeping the host OS and their provided services (load balancers, etc) up to date and secure. You can do the same. Some distros provide unattended updates (which your config management tool) can enable. Stuff that doesn't need to be reachable from the internet shouldn't be (bind internal stuff to your Wireguard interfaces). Put admin stuff behind some strong auth - TLS client certificates are the gold standard but have management overheads. Otherwise, use an IdP-aware proxy (like mentioned above). Don't always trust app-level auth. Beyond that, it's the usual - common sense, monitoring for "spooky action at a distance", and luck. Not too much different from your cloud provider, because they won't compensate you either if they do get hacked.
> For more context, I worked at a very large hedge fund briefly which had a small DC worth of VERY beefy machines but absolutely no platform on top of it...
No, using Ansible to distribute public keys does not get you very far. It's fine for a personal project or even a team of 5-6 with a handful, but beyond that you really need a better way to onboard, offboard, and modify accounts. If you're doing anything but a toy project, you're better off starting off with something like IPA for host access controls.
Why do think that? I did something similar at a previous work for something bordering on 1k employees.
User administration was done by modifying a yaml file in git. Nothing bad to say about it really. It sure beats point-and-click Active Directory any day of the week. Commit log handy for audits.
If there are no externalities demanding anything else, I'd happily do it again.
What's the risk you're trying to protect against, that a "better" (which one?) way would mitigate that this one wouldn't?
> IPA
Do you mean https://en.wikipedia.org/wiki/FreeIPA ? That seems like a huge amalgamation of complexity in a non-memory-safe language that I feel like would introduce a much bigger security liability than the problem it's trying to solve.
I'd rather pony up the money and use Teleport at that point.
> which are technologies old and reliable as dirt.
Technologies, sure. Implementations? Not so much.
I can trust OpenSSH because it's deployed everywhere and I can be confident all the low-hanging fruits are gone by now, and if not, its widespreadness means I'm unlikely to be the most interesting target, so I am more likely to escape a potential zero-day unscathed.
What't the marketshare of IPA in comparison? Has it seen any meaningful action in the last decade years, and the same attention, from both white-hats (audits, pentesting, etc) as well as black-hats (trying to break into every exposed service)? I very much doubt it, so the safe thing to assume is that it's nowhere as bulletproof as OpenSSH and that it's more likely for a dedicated attacker to find a vuln there.
Love this article and I'm also running some stuff on old enterprise servers in some racks somehwere. Now over the last year I've had to dive into Azure Cloud as we have customers using this (b2b company) and I finally understood why everyone is doing cloud despite the price:
Global permissions, seamless organization and IaC. If you are Fastmail or a small startup - go buy some used dell poweredge with epycs in some Colo rack with 10Gbe transit and save tons of money.
If you are a company with tons of customers, ton's of requirements it's powerful to put each concern into a landing zone, run some bicep/terraform - have a ressource group to control costs and get savings on overall core-count and be done with it.
Assign permissions into a namespace for your employe or customer - have some back and forth about requirements and it's done. No need to sysadmin across servers. No need to check for broken disks.
I'm also blaming the hell of vmware and virtual machines for everything that is a PITA to maintain as a sysadmin but is loved because it's common knowledge. I would only do k8s on bare-metal today and skip the whole virtualization thing completly. I guess it's also these pains that are softened in the cloud.
Because the default for companies today is cloud, even though it almost never makes sense. Sure, if you have really spikey load, need to dynamically scale at any point and don't care about your spend, it might make sense.
Ive even worked in companies where the engineering team spent effort and time on building "scalable infrastructure" before the product itself even found product-market fit...
Nobody said it's surprising though, they are well aware of it having done it for more than two decades. Many newcomers are not aware of it though, as their default is "cloud" and they never even shopped for servers, colocation or looked around on the dedicated server market.
I don't think they're not just aware. But purely from scaling and distribution perspective it'd be wiser to start on cloud while you're still on the product-market fit phase. Also 'bare metal' requires more on the capex end and with how our corporate tax system is set it's just discouraging to go on this lane first and it'd be better off to spend on acquiring clients.
Also I'd guess a lot of technical founders are more familiar with cloud/server-side than with dealing or delegating sysadmin taks that might require adding members to the team.
I agree, the cloud definitely has a lot of use cases and when you are building more complicated systems it makes sense to just have to do a few clicks to get a new stack setup vs. having someone evaluate solutions and getting familiar with operating them on a deep level (backups etc.).
Would be interesting to know how files get stored. They don't mention any distributed FS solutions like SeaweedFS so once a drive is full, does the file get sent to another one via some service? Also ZFS seems an odd choice since deletions (esp of small files) at +80% full drive are crazy slow.
Unlike ext4 that locks the directory when unlinking, ZFS is able to scale on parallel unlinking. In specific, ZFS has range locks that permit directory entries to be removed in parallel from the extendible hash trees that store them. While this is relatively slow for sequential workloads, it is fast on parallel workloads. If you want to delete a large directory subtree fast on ZFS, do the rm operations in parallel. For example, this will run faster on ZFS than a naive rm operation:
A friend had this issue on spinning disks the other day. I suggested he do this and the remaining files were gone in seconds when at the rate his naive rm was running, it should have taken minutes. It is a shame that rm does not implement a parallel unlink option internally (e.g. -j), which would be even faster, since it would eliminate the execve overhead and likely would eliminate some directory lookup overhead too, versus using find and parallel to run many rm processes.
For something like fast mail that has many users, unlinking should be parallel already, so unlinking on ZFS will not be slow for them.
By the way, that 80% figure has not been true for more than a decade. You are referring to the best fit allocator being used to minimize external fragmentation under low space conditions. The new figure is 96%. It is controlled by metaslab_df_free_pct in metaslab.c:
Modification operations become slow when you are at/above 96% space filled, but that is to prevent even worse problems from happening. Note that my friend’s pool was below the 96% threshold when he was suffering from a slow rm -r. He just had a directory subtree with a large amount of directory entries he wanted to remove.
For what it is worth, I am the ryao listed here and I was around when the 80% to 96% change was made:
I discovered this yesterday! Blew my mind. I had to check 3 times that the files were actually gone and that I specified the correct directory as I couldn't believe how quick it ran. Super cool
if you don't have high bandwidth requirements, like for background / batch processing, the ovh eco family [1] of bare metal servers is incredibly cheap
> Fastmail has some of the best uptime in the business, plus a comprehensive multi data center backup system. It starts with real-time replication to geographically dispersed data centers, with additional daily backups and checksummed copies of everything. Redundant mirrors allow us to failover a server or even entire rack in the case of hardware failure, keeping your mail running.
I absolutely love Fastmail. I moved off of Gmail years ago with zero regrets. Better UI, better apps, better company, and need I say better service? I still maintain and fetch from a Gmail account so it all just works seamlessly for receiving and sending Gmail, so you don’t have to give anything up either.
I moved from my own colocated 1U running Mailcow to Fastmail and don't regret it one bit. This was an interesting read, glad to see they think things through nice and carefully.
The only things I wish FM had are all software:
1. A takeout-style API to let me grab a complete snapshot once a week with one call
I use Fastmail for my personal mail, and I don’t regret it, but I’m not quite as sold as you are, I guess maybe because I still have a few Google work accounts I need to use. Spam filtering in Fastmail is a little worse, and the search is _terrible_. The iOS app is usable but buggy. The easy masked emails are a big win though, and setting up new domains feels like less of a hassle with FM. I don’t regret using Fastmail, and I’d use them again for my personal email, but it doesn’t feel like a slam dunk.
I take that back; this is (to me)t he most interesting part:
"Although we’ve only ever used datacenter class SSDs and HDDs failures and replacements every few weeks were a regular occurrence on the old fleet of servers. Over the last 3+ years, we’ve only seen a couple of SSD failures in total across the entire upgraded fleet of servers. This is easily less than one tenth the failure rate we used to have with HDDs."
I am working on a personal project(some would call it startup, but i have no intention of getting external financing and other americanisms) where i have set up my own cdn and video encoding, among other things. These days, whenever you have a problem, everyone answers "just use cloud" and that results in people really knowing nothing any more. It is saddening. But on the other hand it ensures all my decades of knowledge will be very well paid in the future, if i'd need to get a job.
I'm a little surprised it seems they didn't have some existing compression solution before moving to zfs. With so much repetitive text across emails I would think there would be a LOT to gain, such as from dictionaries, compressing many emails into bigger blobs, and fine-tuning compression options.
Better an asshole than a moron in my opinion. If he maneuvers his cursor over the link he can click it and transform from “very confused” to “unconfused” provided he can comprehend English - which is admittedly in question.
> So after the success of our initial testing, we decided to go all in on ZFS for all our large data storage needs. We’ve now been using ZFS for all our email servers for over 3 years and have been very happy with it. We’ve also moved over all our database, log and backup servers to using ZFS on NVMe SSDs as well with equally good results.
If you're looking at ZFS on NVMe you may want to look at Alan Jude's talk on the topic, "Scaling ZFS for the future", from the 2024 OpenZFS User and Developer Summit:
gmail does spam filtering very well for me. fastmail on the other hands, puts lots of legit emails into spam folder. manually marking "not spam" doesn't help
If I look at my Gmail SPAM folder, there is very rarely something genuinely important in it. What there is a fair bit of though is random newsletters and announcements that I may have signed up for in some way shape or form that I don't really care about or generally look at. I assume they've been reported as SPAM by enough people rather than simply unsubscribed to that Google now labels them as such.
The cloud providers really kill you on IO for your VMs. Even if 'remote' SSDs are available with configurable ($$) IOPs/bandwidth limits, the size of your VM usually dictates a pitiful max IO/BW limit. In Azure, something like a 4-core 16GB RAM VM will be limited to 150MB/s across all attached disks. For most hosting tasks, you're going to hit that limit far before you max out '4 cores' of a modern CPU or 16GB of RAM.
On the other hand, if you buy a server from Dell and run your own hypervisor, you get a massive reserve of IO, especially with modern SSDs. Sure, you have to share it between your VMs, but you own all of the IO of the hardware, not some pathetic slice of it like in the cloud.
As is always said in these discussions, unless you're able to move your workload to PaaS offerings in the cloud (serverless), you're not taking advantage of what large public clouds are good at.
Biggest issue isn't even sequential speed but latency. In the cloud all persistent storage is networked and has significantly more latency than direct-attached disks. This is a physical (speed of light) limit, you can't pay your way out of it, or throw more CPU at it. This has a huge impact for certain workloads like relational databases.
And then come the weird aspects of bad cloud service providers, like IONOS, who have broken OS images, a provisioning API, that is a bottleneck, where what other people do and how much they do can slow down your own provisioning and creating network interfaces can take minutes via their API and their customer services says "That's how it is, cannot change it.", and you get a very shitty web user interface, that desperately tries to be a single page app, yet has all the default browser functionality like the back button broken. Yet they still cost literally 10x what Hetzner cloud costs, while Hetzner basically does everything better.
And then it is still also about other people's hardware in addition to that.
Yeah, Cloud is a bit of a scam innit? Oxide is looking more and more attractive every day as the industry corrects itself from overspending on capabilities they would never need.
Fake news.
I've got my bare metal server deployed and installed with my ansible playbook even before you manage to log into the bazillion layers of abstraction that is AWS.
Hosts online service seems to think deserving of medal for discovering that S3 buckets from a cloud provider are crap and cost a fortune.
The heading in this space makes your think they're running custom FPGAs such as with Gmail, not just running on metal... As for drive failures, welcome to storage at scale. Build your solution so it's a weekly task to replace 10disks at a time not critical at 2am when a single disk dies...
Storing/Accessing tonnes of <4kB files is difficult, but other providers are doing this on their own metal with CEPH at the PB scale.
I love ZFS, it's great with per-disk redundancy but CEPH is really the only game in town for inter-rack/DC resilience which I would hope my email provider has.
A mail-cloud provider uses its own hardware? Well, that’s to be expected, it would be a refreshing article if it was written by one of their customers.
But what about the cost and complexity of a room with the racks and the cooling needs of running these machines? And the uninterrupted power setup? The wiring mess behind the racks.
Yes they have and they feel they deserve credit for discovering a WiFi cable is more reliable to the new shiny kit that was sold to them by a vendor...
There is a very competitive market for colo providers in basically every major metropolitan area in the US, Europe, and Asia. The racks, power, cooling, and network to your machines is generally very robust and clearly documented on how to connect. Deploying servers in house or in a colo is a well understood process with many experts who can help if you don’t have these skills.
Colo offers the ability to ship and deploy and keep latencies down if you're global, but if you're local yes you should just get someone on site and the modern equivalent of a T1 line setup to your premises if you're running "online" services.
Since I moved from gmail to fastmail, my mailbox is full of spam. I tried setting up rules but there are just too many of them, so I abandoned that strategy after a month. Now I just label mail from senders that are not in my contacts differently. But it's still a mess. I'm at the point that I prefer WhatsApp over email.
So, Fastmail please fix this or tell me what I'm doing wrong. IMHO when uninteresting mail arrives it should take at most two clicks to install a new rule and apply it.
Your comment is confusing because you start this one saying your inbox is full of spam, but respond to a suggestion to mark it as spam by saying it's not actually spam.
If something is not spam but you want it out of your inbox there's a few options:
- click Unsubscribe next to the sender. This should be possible for essentially all promotional email.
- click Actions -> click Block <sender>. Messages from this address will now immediately go to trash.
- click Actions -> click Add rule from message (-> optionally change the suggested conditions) -> check Archive (or if you don't use labels click Move to) -> click Save. Messages matching the conditions will now skip your inbox.
There's not much they could do to make that easier without magically knowing what you care about and what you don't.
This last week, gmail failed to filter as spam an email with subject "#T Anitra", body,
> oF1 d 4440 - 2 B 32677 83
> R Teri E x E q
>
> k 50347733 Safoorabegum
and an attachment "7330757559.pdf". It let through 8 similar emails in the same week, and many more even more egregiously gibberish emails over the years. I'm not pleased with the quality of gmail's spam filter.
I moved to FastMail three years ago, and, for a contrasting experience, found that spam filtering was almost on a par with Gmail. I had feared it would be otherwise.
Fastmail has wildcard email support, so it’s pretty easy to have an email per purchase you make (for example). This makes it easy to see who leaked your email to spammers. Anyway, I have nowhere near the volume of spam with Fastmail that I had with Gmail.
Never had that after the first few years, but I hear other people do have that. Maybe it's because I used it for 2 decades now? I tried alternatives including fastmail but I always leave them because I get swamped by spam while gmail works fine.
I don't want to report everything as spam. For example, promotional emails from businesses that I bought something from. I don't want to punish those businesses; and those emails might contain vouchers that I could use later. But I want those emails moved out of the way without any action from my side.
That's like Spotify telling me "keep disliking" when I complained to them why songs from a certain language (which I never liked or listened to and I certainly don't speak) keeps filling the home after I told them in the first complaint that I have been doing that since months.
If you meet someone new at a social event and give them your email address, where do you want your email provider to put the message that this person sent?
I get no spam on fastmail, I assume this is because I never give out my email to anyone and creating new ones for every interaction. This way I keep track of who I'm interacting with and also who's selling my alias emails.
Just wish there was a decent way to do this with mobile numbers!
Same, I religiously create a masked email for every website (just checked, it's now at 163!). I simply don't give my "main" email out.
Oddly enough, simply unsubscribing from the things websites themselves has kept thing clean, I've yet to notice any true spam from a random source aimed at any of my emails since I joined last year.
The whole push to the cloud has always fascinated me. I get it - most people aren't interested in babysitting their own hardware. On the other hand, a business of just about any size that has any reasonable amount of hosting is better off with their own systems when it comes purely to cost.
All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding, but serve to introduce doubt to non-technical people and to trick people who don't examine what they're told.
What's particularly fascinating to me, though, is how some people are so pro-cloud that they'd argue with a writeup like this with silly cloud talking points. They don't seem to care much about data or facts, just that they love cloud and want everyone else to be in cloud, too. This happens much more often on sites like Reddit (r/sysadmin, even), but I wouldn't be surprised to see a little of it here.
It makes me wonder: how do people get so sold on a thing that they'll go online and fight about it, even when they lack facts or often even basic understanding?
I can clearly state why I advocate for avoiding cloud: cost, privacy, security, a desire to not centralize the Internet. The reason people advocate for cloud for others? It puzzles me. "You'll save money," "you can't secure your own machines," "it's simpler" all have worlds of assumptions that those people can't possibly know are correct.
So when I read something like this from Fastmail which was written without taking an emotional stance, I respect it. If I didn't already self-host email, I'd consider using Fastmail.
There used to be so much push for cloud everything that an article like this would get fanatical responses. I hope that it's a sign of progress that that fanaticism is waning and people aren't afraid to openly discuss how cloud isn't right for many things.
"All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding,"
This is false. AWS infrastructure is vastly more secure than almost all company data centers. AWS has a rule that the same person cannot have logical access and physical access to the same storage device. Very few companies have enough IT people to have this rule. The AWS KMS is vastly more secure than what almost all companies are doing. The AWS network is vastly better designed and operated than almost all corporate networks. AWS S3 is more reliable and scalable than anything almost any company could create on their own. To create something even close to it you would need to implement something like MinIO using 3 separate data centers.
OTOH:
1. big clouds are very lucrative targets for spooks, your data seem pretty likely to be hoovered up as "bycatch" (or maybe main catch depending on your luck) by various agencies and then traded around as currency
2. you never hear about security probems (incidents or exposure) in the platforms, there's no transparency
3. better than most coporate stuff is a low bar
>3. better than most corporate stuff is a low bar
I think it's a very relevant bar, though. The top level commenter made points about "a business of just about any size", which seems pretty exactly aligned with "most corporate stuff".
> AWS infrastructure is vastly more secure than almost all company data centers
Secure in what terms? Security is always about a threat model and trade-offs. There's no absolute, objective term of "security".
> AWS has a rule that the same person cannot have logical access and physical access to the same storage device.
Any promises they make aren't worth anything there's contractually-stipulated damages that AWS should pay in case of breach, those damages actually corresponding to the costs of said breach for the customer, and a history of actually paying out said damages without shenanigans. They've already got a track record of lying on their status pages, so it doesn't bode well.
But I'm actually wondering what this specific rule even tries to defend against? You presumably care about data protection, so logical access is what matters. Physical access seems completely irrelevant no?
> Very few companies have enough IT people to have this rule
Maybe, but that doesn't actually mitigate anything from the company's perspective? The company itself would still be in the same position, aka not enough people to reliably separate responsibilities. Just that instead of those responsibilities being physical, they now happen inside the AWS console.
> The AWS KMS is vastly more secure than what almost all companies are doing.
See first point about security. Secure against what - what's the threat model you're trying to protect against by using KMS?
But I'm not necessarily denying that (at least some) AWS services are very good. Question is, is that "goodness" required for your use-case, is it enough to overcome its associated downsides, and is the overall cost worth it?
A pragmatic approach would be to evaluate every component on its merits and fitness to the problem at hand instead of going all in, one way or another.
one of my greatest learnings in life is to differentiate between facts and opinions- sometimes opinions are presented as facts and vice-versa. if you think about it- the statement "this is false" is a response to an opinion (presented as a fact) but not a fact. there is no way one can objectively define and defend what does "real technical understanding" means. the cloud space is vast with millions of people having varied understanding and thus opinions.
so let's not fight the battle that will never be won. there is no point in convincing pro-cloud people that cloud isn't the right choice and vice-versa. let people share stories where it made sense and where it didn't.
as someone who has lived in cloud security space since 2009 (and was founder of redlock - one of the first CSPMs), in my opinion, there is no doubt that AWS is indeed superiorly designed than most corp. networks- but is that you really need? if you run entire corp and LOB apps on aws but have poor security practices, will it be right decision? what if you have the best security engineers in the world but they are best at Cisco type of security - configuring VLANS and managing endpoints but are not good at detecting someone using IMDSv1 in ec2 exposed to the internet and running a vulnerable (to csrf) app?
when the scope of discussion is as vast as cloud vs on-prem, imo, it is a bad idea to make absolute statements.
Great points. Also if you end up building your apps as rube goldberg machines living up to "AWS Well Architected" criteria (indoctrinated by staff lots of AWS certifications, leading to a lot of AWS certified staff whose paycheck now depends on following AWS recommended practices) the complexity will kill your security, as nobody will understand the systems anymore.
The other part is that when us-east-1 goes down, you can blame AWS, and a third of your customer's vendors will be doing the same. When you unplug the power to your colo rack while installing a new server, that's on you.
AWS is so complicated, we usually find more impactful permission problems than in any company using their own hardware
Making API calls from a VM on shared hardware to KMS is vastly more secure than doing AES locally? I'm skeptical to say the least.
Encrypting data is easy, securely managing keys is the hard part. KMS is the Key Management Service. And AWS put a lot of thought and work into it.
https://docs.aws.amazon.com/kms/latest/cryptographic-details...
about security, most businesses using AWS invest little to nothing in securing their software, or even adopt basic security practices for their employees
having the most secure data center doesn't matter if you load your secrets as env vars in a system that can be easily compromised by a motivated attacker
so i don't buy this argument as a general reason pro-cloud
The cloud is someone else’s computer.
It’s like putting something in someone’s desk drawer under the guise of convenience at the expense of security.
Why?
Too often, someone other than the data owner has or can get access to the drawer directly or indirectly.
Also, Cloud vs self hosted to me is a pendulum that has swung back and forth for a number of reasons.
The benefits of the cloud outlined here are often a lot of open source tech packaged up and sold as manageable from a web browser, or a command line.
One of the major reasons the cloud became popular was networking issues in Linux to manage volume at scale. At the time the cloud became very attractive for that reason, plus being able to virtualize bare metal servers to put into any combination of local to cloud hosting.
Self-hosting has become easier by an order of magnitude or two for anyone who knew how to do it, except it’s something people who haven’t done both self-hosting and cloud can really discuss.
Cloud has abstracted away the cost of horsepower, and converted it to transactions. People are discovering a fraction of the horsepower is needed to service their workloads than they thought.
At some point the horsepower got way beyond what they needed and it wasn’t noticed. But paying for a cloud is convenient and standardized.
Company data centres can be reasonably secured using a number of PaaS or IaaS solutions readily available off the shelf. Tools from VMware, Proxmox and others are tremendous.
It may seem like there’s a lot to learn, except most problems they are new to someone have often been thought of a ton by both people with and without experience that is beyond cloud only.
> The cloud is someone else’s computer.
And in the case of AWS it is someone else's extremely well designed and managed computer and network.
> The cloud is someone else’s computer
Isn’t it more like leasing in a public property? Meaning it is yours as long as you are paying the lease? Analogous to renting an apartment instead of owning a condo?
Not at all. You can inspect the apartment you rent. The cloud is totally opaque in that regard.
<citations needed>
But isn't using Fastmail akin to using a cloud provider (managed email vs managed everything else)? They are similarly a service provider, and as a customer, you don't really care "who their ISP is?"
The discussion matters when we are talking about building things: whether you self-host or use managed services is a set of interesting trade-offs.
Yes, FastMail is a SAAS. But there adepts of a religion which would tell you that companies like FastMail should be built on top of AWS and it is the only true way. It is good to have some counter narrative to this.
Being cloud compatible (packaged well) can be as important as being cloud-agnostic (work on any cloud).
Too many projects become beholden to one cloud.
<ctoHatTime> Dunno man, it's really really easy to set up an S3 and use it to share datasets for users authorized with IAM....
And IAM and other cloud security and management considerations is where the opex/capex and capability argument can start to break down. Turns out, the "cloud" savings comes from not having capabilities in house to manage hardware. Sometimes, for most businesses, you want some of that lovely reliability.
(In short, I agree with you, substantially).
Like code. It is easy to get something basic up, but substantially more resources are needed for non-trivial things.
I feel like IAM may be the sleeper killer-app of cloud.
I self-host a lot of things, but boy oh boy if I were running a company it would be a helluvalotta work to get IAM properly set up.
I strongly agree with this and also strongly lament it.
I find IAM to be a terrible implementation of a foundationally necessary system. It feels tacked on to me, except now it's tacked onto thousands of other things and there's no way out.
like terraform! isn't pulumi 100% better but there's no way out of terraform.
That's essentially why "platform engineering" is a hot topic. There are great FOSS tools for this, largely in the Kubernetes ecosystem.
To be clear, authentication could still be outsourced, but authorizing access to (on-prem) resources in a multi-tenant environment is something that "platforms" are frequently designed for.
> All the pro-cloud talking points... don't persuade anyone with any real technical understanding
This is a very engineer-centric take. The cloud has some big advantages that are entirely non-technical:
- You don't need to pay for hardware upfront. This is critical for many early-stage startups, who have no real ability to predict CapEx until they find product/market fit.
- You have someone else to point the SOC2/HIPAA/etc auditors at. For anyone launching a company in a regulated space, being able to checkbox your entire infrastructure based on AWS/Azure/etc existing certifications is huge.
You can over-provision your own baremetal resources 20x and it will be still cheaper than cloud. The capex talking point is just that, a talking point.
The real cost wins of self-hosted are that anything using new hardware becomes an ordeal, and engineers won't use high-cost, value-added services. I agree that there's often too little restraint in cloud architectures, but if a business truly believes in a project, it shouldn't be held up for six months waiting for server budget with engineers spending doing ops work to get three nines of DB reliability.
There is a size where self-hosting makes sense, but it's much larger than you think.
Most companies severely understaff ops, infra, and security. Your talking points might be good but, in practice, won’t apply in many cases because of the intractability of that management mindset. Even when they should know better.
I’ve worked at tech companies with hundreds of developers and single digit ops staff. Those people will struggle to build and maintain mature infra. By going cloud, you get access to mature infra just by including it in build scripts. Devops is an effective way to move infra back to project teams and cut out infra orgs (this isn’t great but I see it happen everywhere). Companies will pay cloud bills but not staffing salaries.
Using a commercial cloud provider only cements understaffing in, in too many cases.
I'm curious about what "reasonable amount of hosting" means to you, because from my experience, as your internal network's complexity goes up, it's far better for your to move systems to a hyperscaler. The current estimate is >90% of Fortune 500 companies are cloud-based. What is it that you know that they don't?
>What's particularly fascinating to me, though, is how some people are so pro-cloud that they'd argue with a writeup like this with silly cloud talking points. They don't seem to care much about data or facts, just that they love cloud and want everyone else to be in cloud, too.
The irony is absolutely dripping off this comment, wow.
Commenter makes emotionally charge comment with no data or facts and decries anyone who disagrees with them as "silly talking points" for not caring about data and facts.
Your comment is entirely talking about itself.
Cloud expands the capabilities of what one team can manage by themselves, enabling them to avoid a huge amount of internal politics.
This is worth astronomical amounts of money in big corps.
This is absolutely spot on.
What do you mean, I can't scale up because I've used my hardware capex budget for the year?
I’m not convinced this is entirely true. The upfront cost if you don’t have the skills, sure – it takes time to learn Linux administration, not to mention management tooling like Ansible, Puppet, etc.
But once those are set up, how is it different? AWS is quite clear with their responsibility model that you still have to tune your DB, for example. And for the setup, just as there are Terraform modules to do everything under the sun, there are Ansible (or Chef, or Salt…) playbooks to do the same. For both, you _should_ know what all of the options are doing.
The only way I see this sentiment being true is that a dev team, with no infrastructure experience, can more easily spin up a lot of infra – likely in a sub-optimal fashion – to run their application. When it inevitably breaks, they can then throw money at the problem via vertical scaling, rather than addressing the root cause.
I think this is only true for teams and apps of a certain size.
I've worked on plenty of teams with relatively small apps, and the difference between:
1. Cloud: "open up the cloud console and start a VM"
2. Owned hardware: "price out a server, order it, find a suitable datacenter, sign a contract, get it racked, etc."
Is quite large.
#1 is 15 minutes for a single team lead.
#2 requires the team to agree on hardware specs, get management approval, finance approval, executives signing contracts. And through all this you don't have anything online yet for... weeks?
If your team or your app is large, this probably all averages out in favor of #2. But small teams often don't have the bandwidth or the budget.
I work for a 50 person subsidiary of a 30k person organisation. I needed a domain name. I put in the purchase request and 6 months later eventually gave up, bought it myself and expensed it.
Our AWS account is managed by an SRE team. It’s a 3 day turnaround process to get any resources provisioned, and if you don’t get the exact spec right (you forgot to specify the iops on the volume? Oops) 3 day turnaround. Already started work when you request an adjustment? Better hope as part of your initial request you specified backups correctly or you’re starting again.
The overhead is absolutely enormous, and I actually don’t even have billing access to the AWS account that I’m responsible for.
Manageability of cloud without a dedicated resource is a form of resource creep, and shadow labour costs that aren’t factored in.
How many things don’t end up happening because of this? When they need a sliver of resources in the start?
You gave me flashbacks to a far worse bureaucratic nightmare with #2 in my last job.
I supported an application with a team of about three people for a regional headquarters in the DoD. We had one stack of aging hardware that was racked, on a handshake agreement with another team, in a nearby facility under that other team's control. We had to periodically request physical access for maintenance tasks and the facility routinely lost power, suffered local network outages, etc. So we decided that we needed new hardware and more of it spread across the region to avoid the shaky single-point-of-failure.
That began a three year process of: waiting for budget to be available for the hardware / licensing / support purchases; pitching PowerPoints to senior management to argue for that budget (and getting updated quotes every time from the vendors); working out agreements with other teams at new facilities to rack the hardware; traveling to those sites to install stuff; and working through the cybersecurity compliance stuff for each site. I left before everything was finished, so I don't know how they ultimately dealt with needing, say, someone in Japan to physically reseat a cable or something.
I’ve never worked at a company with these particular problems, but:
#1: A cloud VM comes with an obligation for someone at the company to maintain it. The cloud does not excuse anyone from doing this.
#2: Sounds like a dysfunctional system. Sure, it may be common, but a medium sized org could easily have some datacenter space and allow any team to rent a server or an instance, or to buy a server and pay some nominal price for the IT team to keep it working. This isn’t actually rocket science.
Sure, keeping a fifteen year old server working safely is a chore, but so is maintaining a fifteen-year-old VM instance!
Obligation? Far from it. I've worked at some poorly staffed companies. Nobody is maintaining old VMs or container images. If it works, nobody touches it.
I worked at a supposedly properly staffed company that had raised 100's of millions in investment, and it was the same thing. VMs running 5 year old distros that hadn't been updated in years. 600 day uptimes, no kernel patches, ancient versions of Postgres, Python 2.7 code everywhere, etc. This wasn't 10 years ago. This was 2 years ago!
The cloud is someone else’s computer.
Having redirected of a vm provider or installing a hyper visor on equipment is another thing.
The SMB I work for runs a small on-premise data center that is shared between teams and projects, with maybe 3-4 FTEs managing it (the respective employees also do dev and other work). This includes self-hosting email, storage, databases, authentication, source control, ticketing, company wiki, and other services. The current infrastructure didn’t start out that way and developed over many years, so it’s not necessarily something a small startup can start out with, but beyond a certain company size (a couple dozen employees or more) it shouldn’t really be a problem to develop that, if management shares the philosophy. I certainly find it preferable culturally if not technically to maximize independence in that way, have the local expertise and much better control over everything.
One (the only?) indisputable benefit of cloud is the ability to scale up faster (elasticity), but most companies don’t really need that. And if you do end up needing it after all, then it’s a good problem to have, as they say.
You're assuming that hosting something in-house implies that each application gets its own physical server.
You buy a couple of beastly things with dozens of cores. You can buy twice as much capacity as you actually use and still be well under the cost of cloud VMs. Then it's still VMs and adding one is just as fast. When the load gets above 80% someone goes through the running VMs and decides if it's time to do some house cleaning or it's time to buy another host, but no one is ever waiting on approval because you can use the reserve capacity immediately while sorting it out.
There is a large gap between "own the hardware" and "use cloud hosting". Many people rent the hardware, for example, and you can use managed databases which is one step up than "starting a vm".
But your comparison isn't fair. The difference between running your own hardware and using the cloud (which is perhaps not even the relevant comparison but let's run with it) is the difference between:
1. Open up the cloud console, and
2. You already have the hardware so you just run "virsh" or, more likely, do nothing at all because you own the API so you have already included this in your Ansible or Salt or whatever you use for setting up a server.
Because ordering a new physical box isn't really comparable to starting a new VM, is it?
I've always liked the theory of #2, I just haven't worked anywhere yet that has executed it well.
Before the cloud, you could get a VM provisioned (virtual servers) or a couple of apps set up (LAMP stack on a shared host ;)) in a few minutes over a web interface already.
"Cloud" has changed that by providing an API to do this, thus enabling IaC approach to building combined hardware and software architectures.
3. "Dedicated server" at any hosting provider
Open their management console, press order now, 15 mins later get your server's IP address.
For purposes of this discussion, isn't AWS just a very large hosting provider?
I.e. most hosting providers give you the option for virtual or dedicated hardware. So does Amazon (metal instances).
Like, "cloud" was always an ill-defined term, but in the case of "how do I provision full servers" I think there's no qualitative difference between Amazon and other hosting providers. Quantitative, sure.
> Amazon (metal instances)
But you still get nickel & dimed and pay insane costs, including on bandwidth (which is free in most conventional hosting providers, and overages are 90x cheaper than AWS' costs).
More like 15 seconds.
You have omitted the option between the two, which is renting a server. No hardware to purchase, maintain or set up. Easily available in 15 minutes.
There is. Middle ground between the extremes of those pendulums of all cloud or physical metal.
You can start with using a cloud only for VMs and only run services on it using IaaS or PaaS. Very serviceable.
You can get pretty far without any of that fancy stuff. You can get plenty done by using parallel-ssh and then focusing on the actual thing you develop instead of endless tooling and docker and terraform and kubernetes and salt and puppet and ansible. Sure, if you know why you need them and know what value you get from them OK. But many people just do it because it's the thing to do...
Do you need those tools? It seems that for fundamental web hosting, you need your application server, nginx or similar, postgres or similar, and a CLI. (And an interpreter etc if your application is in an interpreted lang)
I suppose that depends on your RTO. With cloud providers, even on a bare VM, you can to some extent get away with having no IaC, since your data (and therefore config) is almost certainly on networked storage which is redundant by design. If an EC2 fails, or even if one of the drives in your EBS drive fails, it'll probably come back up as it was.
If it's your own hardware, if you don't have IaC of some kind – even something as crude as a shell script – then a failure may well mean you need to manually set everything up again.
Get two servers (or three, etc)?
Well, sure – I was trying to do a comparison in favor of cloud, because the fact that EBS Volumes can magically detach and attach is admittedly a neat trick. You can of course accomplish the same (to a certain scale) with distributed storage systems like Ceph, Longhorn, etc. but then you have to have multiple servers, and if you have multiple servers, you probably also have your application load balanced with failover.
For fundamentals, that list is missing:
- Some sort of firewall or network access control. Being able to say "allow http/s from the world (optionally minus some abuser IPs that cause problems), and allow SSH from developers (by IP, key, or both)" at a separate layer from nginx is prudent. Can be ip/tables config on servers or a separate firewall appliance.
- Some mechanism of managing storage persistence for the database, e.g. backups, RAID, data files stored on fast network-attached storage, db-level replication. Not losing all user data if you lose the DB server is table stakes.
- Something watching external logging or telemetry to let administrators know when errors (e.g. server failures, overload events, spikes in 500s returned) occur. This could be as simple as Pingdom or as involved as automated alerting based on load balancer metrics. Relying on users to report downtime events is not a good approach.
- Some sort of CDN, for applications with a frontend component. This isn't required for fundamental web hosting, but for sites with a frontend and even moderate (10s/sec) hit rates, it can become required for cost/performance; CDNs help with egress congestion (and fees, if you're paying for metered bandwidth).
- Some means of replacing infrastructure from nothing. If the server catches fire or the hosting provider nukes it, having a way to get back to where you were is important. Written procedures are fine if you can handle long downtime while replacing things, but even for a handful of application components those procedures get pretty lengthy, so you start wishing for automation.
- Some mechanism for deploying new code, replacing infrastructure, or migrating data. Again, written procedures are OK, but start to become unwieldy very early on ('stop app, stop postgres, upgrade the postgres version, start postgres, then apply application migrations to ensure compatibility with new version of postgres, then start app--oops, forgot to take a postgres backup/forgot that upgrading postgres would break the replication stream, gotta write that down for net time...').
...and that's just for a very, very basic web hosting application--one that doesn't need caches, blob stores, the ability to quickly scale out application server or database capacity.
Each of those things can be accomplished the traditional way--and you're right, that sometimes that way is easier for a given item in the list (especially if your maintainers have expertise in that item)! But in aggregate, having a cloud provider handle each of those concerns tends to be easier overall and not require nearly as much in-house expertise.
You are focusing on technology. And sure of course you can get most of the benefits of AWS a lot cheaper when self-hosting.
But when you start factoring internal processes and incompetent IT departments, suddenly that's not actually a viable option in many real-world scenarios.
Exactly. With the cloud you can suddenly do all the things your tyrannical Windows IT admin has been saying are impossible for the last 30 years.
It is similar to cooking at home vs ordering cooked food everyday. If some guarantees the taste & quality people would happy to outsource it.
I have never ever worked somewhere with one of these "cloud-like but custom on our own infrastructure" setups that didn't leak infrastructure concerns through the abstraction, to a significantly larger degree than AWS.
I believe it can work, so maybe there are really successful implementations of this out there, I just haven't seen it myself yet!
All of that is... completely unrelated to the GP's post.
Did you reply to the right comment? Do you think "politics" is something you solve with Ansible?
> Cloud expands the capabilities of what one team can manage by themselves, enabling them to avoid a huge amount of internal politics.
It's related to the first part. Re: the second, IME if you let dev teams run wild with "managing their own infra," the org as a whole eventually pays for that when the dozen bespoke stacks all hit various bottlenecks, and no one actually understands how they work, or how to troubleshoot them.
I keep being told that "reducing friction" and "increasing velocity" are good things; I vehemently disagree. It might be good for short-term profits, but it is poison for long-term success.
Our big company locked all cloud resources behind a floating/company-wide DevOps team (git and CI too). We have an old on-prem server that we jealously guard because it allows us to create remotes for new git repos and deploy prototypes without consulting anyone.
(To be fair, I can see why they did it - a lot of deployments were an absolute mess before.)
I have said for years the value of cloud is mainly its api, thats the selling point in large enterprise.
Self-hosted software also has APIs, and Terraform libraries, and Ansible playbooks, etc. It’s just that you have to know what it is you’re trying to do, instead of asking AWS what collection of XaaS you should use.
Well cloud providers often give more than just VMs in a data enter somewhere. You may not be able to find good equivalents if you aren’t using the cloud. Some third-party products are also only available on clouds. How much of a difference those things make will depend on what you’re trying to do.
I think there are accounting reasons for companies to prefer paying opex to run things on the cloud instead of more capex-intensive self-hosting, but I don’t understand the dynamics well.
It’s certainly the case that clouds tend to be more expensive than self-hosting, even when taking account of the discounts that moderately sized customers can get, and some of the promises around elastic scaling don’t really apply when you are bigger.
To some of your other points: the main customers of companies like AWS are businesses. Businesses generally don’t care about the centralisation of the internet. Businesses are capable of reading the contracts they are signing and not signing them if privacy (or, typically more relevant to businesses, their IP) cannot be sufficiently protected. It’s not really clear to me that using a cloud is going to be less secure than doing things on-prem.
It seems that the preference is less about understanding or misunderstanding the technical requirements but more that it moves a capital expenditure with some recurring operational expenditure entirely into the opex column.
The fact is, managing your own hardware is a pita and a distraction from focusing on the core product. I loathe messing with servers and even opt for "overpriced" paas like fly, render, vercel. Because every minute messing with and monitoring servers is time not spent on product. My tune might change past a certain size and a massive cloud bill and there's room for full time ops people, but to offset their salary, it would have to be huge.
That argument makes sense for PaaS services like the ones you mention. But for bare "cloud" like AWS, I'm not convinced it is saving any effort, it's merely swapping one kind of complexity with another. Every place I've been in had full-time people messing with YAML files or doing "something" with the infrastructure - generally trying to work around the (self-inflicted) problems introduced by their cloud provider - whether it's the fact you get 2010s-era hardware or that you get nickel & dimed on absolutely arbitrary actions that have no relationship to real-world costs.
In what sense is AWS "bare cloud"? S3, DynamoDB, Lambda, ECS?
How do you configure S3 access control? You need to learn & understand how their IAM works.
How do you even point a pretty URL to a lambda? Last time I looked you need to stick an "API gateway" in front (which I'm sure you also get nickel & dimed for).
How do you go from "here's my git repo, deploy this on Fargate" with AWS? You need a CI pipeline which will run a bunch of awscli commands.
And I'm not even talking about VPCs, security groups, etc.
Somewhat different skillsets than old-school sysadmin (although once you know sysadmin basics, you realize a lot of these are just the same concepts under a branded name and arbitrary nickel & diming sprinkled on top), but equivalent in complexity.
EC2
Counterpoint: if you’re never “messing with servers,” you probably don’t have a great understanding of how their metrics map to those of your application’s, and so if you bottleneck on something, it can be difficult to figure out what to fix. The result is usually that you just pay more money to vertically scale.
To be fair, you did say “my tune might change past a certain size.” At small scale, nothing you do within reason really matters. World’s worst schema, but your DB is only seeing 100 QPS? Yeah, it doesn’t care.
I don’t think you’re correct. I’ve watched junior/mid-level engineers figure things out solely by working on the cloud and scaling things to a dramatic degree. It’s really not a rocket science.
I didn't say it's rocket science, nor that it's impossible to do without having practical server experience, only that it's more difficult.
Take disks, for example. Most cloud-native devs I've worked with have no clue what IOPS are. If you saturate your disk, that's likely to cause knock-on effects like increased CPU utilization from IOWAIT, and since "CPU is high" is pretty easy to understand for anyone, the seemingly obvious solution is to get a bigger instance, which depending on the application, may inadvertently solve the problem. For RDBMS, a larger instance means a bigger buffer pool / shared buffers, which means fewer disk reads. Problem solved, even though actually solving the root cause would've cost 1/10th or less the cost of bumping up the entire instance.
Writing piles of IaC code like Terraform and CloudFormation is also a PITA and a distraction from focusing on your core product.
PaaS is probably the way to go for small apps.
A small app (or a larger one, for that matter) can quite easily run on infra that's instantiated from canned IaC, like TF AWS Modules [0]. If you can read docs, you should be able to quite trivially get some basic infra up in a day, even with zero prior experience managing it.
[0]: https://github.com/terraform-aws-modules
Yes, I've used several of these modules myself. They save tons of time! Unfortunately, for legacy projects, I inherited a bunch of code from individuals that built everything "by hand" then copy-pasted everything. No re-usability.
But that effort has a huge payoff in that it can be used to disaster recovery in a new region and to spin up testing environments.
Anecdotal - but I once worked for a company where the product line I built for them after acquisition was delayed by 5 months because that's how long it took to get the hardware ordered and installed in the datacenter. Getting it up on AWS would have been a days work, maybe two.
Yes, it is death by 1000 cuts. Speccing, negotiating with hardware vendors, data center selection and negotiating, DC engineer/remote hands, managing security cage access, designing your network, network gear, IP address ranges, BGP, secure remote console access, cables, shipping, negotiating with bandwidth providers (multiple, for redundancy), redundant hardware, redundant power sources, UPS. And then you get to plug your server in. Now duplicate other stuff your cloud might provide, like offsite backups, recovery procedures, HA storage, geographic redundancy. And do it again when you outgrown your initial DC. Or build your own DC (power, climate, fire protection, security, fiber, flooring, racks)
Much of this is still required in cloud. Also, I think you're missing the middle ground where 99.99% of companies could happily exist indefinitely: colo. It makes little to no financial or practical sense for most to run their own data centers.
Oh, absolutely, with your own hardware you need planning. Time to deployment is definitely a thing.
Really, the one major thing that bites on cloud providers in there 99.9% margin on egress. The markup is insane.
I'm with you there, with stuff like fly.io, there's really no reason to worry about infrastructure.
AWS, on the other hand, seems about as time consuming and hard as using root servers. You're at a higher level of abstraction, but the complexity is about the same I'd say. At least that's my experience.
I agree with this position and actively avoid AWS complexity.
> every minute messing with and monitoring servers
You're not monitoring your deployments because "cloud"?
> On the other hand, a business of just about any size that has any reasonable amount of hosting is better off with their own systems when it comes purely to cost
From a cost PoV, sure, but when you're taking money out of capex it represents a big hit to the cash flow, while taking out twice that amount from opex has a lower impact on the company finances.
There is a whole ecosystem that pushes cloud to ignorant/fresh graduates/developers. Just take a look at the sponsors for all the most popular frameworks. When your system is super complex and depends on the cloud they make more money. Just look at the PHP ecosystem, Laravel needs 4 times the servers to server something that a pure PHP system would need. Most projects don't need the cloud. Only around 10% of projects actually need what the cloud provides. But they were able to brainwash a whole generation of developers/managers to think that they do. And so it goes.
I want to see an article like this, but written from a Fortune 500 CTO perspective
It seems like they all abandoned their VMware farms or physical server farms for Azure (they love Microsoft).
Are they actually saving money? Are things faster? How's performance? What was the re-training/hiring like?
In one case I know we got rid of our old database greybeards and replaced them with "DevOps" people that knew nothing about performance etc
And the developers (and many of the admins) we had knew nothing about hardware or anything so keeping the physical hardware around probably wouldn't have made sense anyways
Complicating this analysis is that computers have still been making exponential improvements in capability as clouds became popular (e.g. disks are 1000-10000x faster than they were 15 years ago), so you'd naturally expect things to become easier to manage over time as you need fewer machines, assuming of course that your developers focus on e.g. learning how to use a database well instead of how to scale to use massive clusters.
That is, even if things became cheaper/faster, they might have been even better without cloud infrastructure.
1. People are credulous
2. People therefore repeat talking points which seem in their interest
3. With enough repetition these become their beliefs
4. People will defend their beliefs as theirs against attack
5. Goto 1
The one convincing argument from technical people I saw, that would be repeated to your comment, is that by now, you dont find enough experienced engineers to reliably setup some really big systems. Because so much went to the cloud, a lot of the knowledge is buried there.
That came from technical people who I didn't perceive as being dogmatically pro-cloud.
I think part of it was a way for dev teams to get an infra team that was not empowered to say no. Plus organizational theory, empire building, etc.
Yep. I had someone tell me last week that they didn't want a more rigid schema because other teams rely on it, and anything adding "friction" to using it would be poorly received.
As an industry, we are largely trading correctness and performance for convenience, and this is not seen as a negative by most. What kills me is that at every cloud-native place I've worked at, the infra teams were both responsible for maintaining and fixing the infra that product teams demanded, but were not empowered to push back on unreasonable requests or usage patterns. It's usually not until either the limits of vertical scaling are reached, or a SEV0 occurs where these decisions were the root cause does leadership even begin to consider changes.
They spent time and career points learning cloud things and dammit it's going to matter!
You can't even blame them too much, the amount of cash poured into cloud marketing is astonishing.
The thing that frustrates me is it’s possible to know how to do both. I have worked with multiple people who are quite proficient in both areas.
Cloud has definite advantages in some circumstances, but so does self-hosting; moreover, understanding the latter makes the former much, much easier to reason about. It’s silly to limit your career options.
Being good at both is twice the work, because even if some concepts translate well, IME people won't hire someone based on that. "Oh you have experience with deploying RabbitMQ but not AWS SQS? Sorry, we're looking for someone more qualified."
That's a great filter for places I don't want to work at, then.
As someone who ran a startup with 100’s of hosts. As soon as I start to count the salaries, hiring, desk space, etc of the people needed to manage the hosts AWS would look cheap again. Yea, hardware costs they are aggressively expensive. But TCO wise, they’re cheap for any decent sized company.
Add in compliance, auditing, etc. all things that you can set up out of the box (PCI, HIPPA, lawsuit retention). Gets even cheaper.
There was a time when cloud was significantly cheaper then owning.
I'd expect that there are people who moved to the cloud then, and over time started using services offered by their cloud provider (e.g., load balancers, secret management, databases, storage, backup) instead of running those services themselves on virtual machines, and now even if it would be cheaper to run everything on owned servers they find it would be too much effort to add all those services back to their own servers.
The cloud wasn’t about cheap, it was about fast. If you’re VC funded, time is everything, and developer velocity above all else to hyperscale and exit. That time has passed (ZIRP), and the public cloud margin just doesn’t make sense when you can own and operate (their margin is your opportunity) on prem with similar cloud primitives around storage and compute.
Elasticity is a component, but has always been from a batch job bin packing scheduling perspective, not much new there. Before k8s and Nomad, there was Globus.org.
(Infra/DevOps in a previous life at a unicorn, large worker cluster for a physics experiment prior, etc; what is old is a new again, you’re just riding hype cycle waves from junior to retirement [mainframe->COTS on prem->cloud->on prem cloud, and so on])
That was never true except in the case that the required hardware resources were significantly smaller than a typical physical machine.
Also, by the way, I found it interesting that you framed your side of this disagreement as the technically correct one, but then included this:
> a desire to not centralize the Internet
This is an ideological stance! I happen to share this desire. But you should be aware of your own non-technical - "emotional" - biases when dismissing the arguments of others on the grounds that they are "emotional" and+l "fanatical".
> If I didn't already self-host email, I'd consider using Fastmail.
Same sentiment all of what you said.
Cloud solves one problem quite well: Geographic redundancy. It's extremely costly with on-prem.
Only if you’re literally running your own datacenters, which is in no way required for the majority of companies. Colo giants like Equinix already have the infrastructure in place, with a proven track record.
If you enable Multi-AZ for RDS, your bill doubles until you cancel. If you set up two servers in two DCs, your initial bill doubles from the CapEx, and then a very small percentage of your OpEx goes up every month for the hosting. You very, very quickly make this back compared to cloud.
But reliable connectivity between regions/datacenters remains a challenge, right? Compute is only one part of the equation.
Disclaimer: I work on a cloud networking product.
It depends on how deep you want to go. Equinix for one (I'm sure others as well, but I'm most familiar with them) offers managed cross-DC fiber. You will probably need to manage the networking, to be fair, and I will readily admit that's not trivial.
Except, almost nobody, outside of very large players, does cross region redundancy. us-east-1 is like a SPOF for the entire Internet.
Cloud noob here. But if I have a central database what can I distribute across geographic regions? Static assets? Maybe a cache?
Yep. Cross-region RDBMS is a hard problem, even when you're using a managed service – you practically always have to deal with eventual consistency, or increased latency for writes.
Does it? I've seen outages around "Sorry, us-west_carolina-3 is down". AWS is particularly good at keeping you aware of their datacenters.
It can be useful. I run a latency sensitive service with global users. A cloud lets me run it in 35 locations dealing with one company only. Most of those locations only have traffic to justify a single, smallish, instance.
In the locations where there's more traffic, and we need more servers, there are more cost effective providers, but there's value in consistency.
Elasticity is nice too, we doubled our instance count for the holidays, and will return to normal in January. And our deployment style starts a whole new cluster, moves traffic, then shuts down the old cluster. If we were on owned hardware, adding extra capacity for the holidays would be trickier, and we'd have to have a more sensible deployment method. And the minimum service deployment size would probably not be a little quad processor box with 2GB ram.
Using cloud for the lower traffic locations and a cost effective service for the high traffic locations would probably save a bunch of money, but add a lot of deployment pain. And a) it's not my decision and b) the cost difference doesn't seem to be quite enough to justify the pain at our traffic levels. But if someone wants to make a much lower margin, much simpler service with lots of locations and good connectivity, be sure to post about it. But, I think the big clouds have an advantage in geographic expansion, because their other businesses can provide capital and justification to build out, and high margins at other locations help cross subsidize new locations when they start.
I agree it can be useful (latency, availability, using off-peak resources), but running globally should be a default and people should opt-in into fine-grained control and responsibility.
From outside it seems that either AWS picked the wrong default to present their customers, or that it's unreasonably expensive and it drives everyone into the in-depth handling to try to keep cloud costs down.
if you see that you are doing it wrong :)
AWS has had multiple outages which were caused by a single AZ failing.
My company used to do everything on-prem. Until a literal earthquake and tsunami took down a bunch of systems.
After that, yeah we’ll let AWS do the hard work of enabling redundancy for us.
Cloud is more than instances. If all you need is a bunch of boxes, then cloud is a terrible fit.
I use AWS cloud a lot, and almost never use any VMs or instances. Most instances I use are along the lines of a simple anemic box for a bastion host or some such.
I use higher level abstractions (services) to simplify solutions and outsource maintenance of these services to AWS.
The bottom line > babysitting hardware. Businesses are transitioning to cloud because it's better for business.
In the public sector, cloud solves the procurement problem. You just need to go through the yearlong process once to use a cloud service, instead of for each purchase > 1000€.
> What's particularly fascinating to me, though, is how some people are so pro-cloud that they'd argue with a writeup like this with silly cloud talking points.
I’m sure I’ll be downvoted to hell for this, but I’m convinced that it’s largely their insecurities being projected.
Running your own hardware isn’t tremendously difficult, as anyone who’s done it can attest, but it does require a much deeper understanding of Linux (and of course, any services which previously would have been XaaS), and that’s a vanishing trait these days. So for someone who may well be quite skilled at K8s administration, serverless (lol) architectures, etc. it probably is seen as an affront to suggest that their skill set is lacking something fundamental.
> So for someone who may well be quite skilled at K8s administration ...
And running your own hardware is not incompatible with Kubernetes: on the contrary. You can fully well have your infra spin up VMs and then do container orchestration if that's your thing.
And part your hardware monitoring and reporting tool can work perfectly fine from containers.
Bare metal -> Hypervisor -> VM -> container orchestration -> a container running a "stateless" hardware monitoring service. And VMs themselves are "orchestrated" too. Everything can be automated.
Anyway say a harddisk being to show errors? Notifications being sent (email/SMS/Telegram/whatever) by another service in another container, dashboard shall show it too (dashboards are cool).
Go to the machine once the spare disk as already been resilvered, move it where the failed disk was, plug in a new disk that becomes the new spare.
Boom, done.
I'm not saying all self-hosted hardware should do container orchestration: there are valid use cases for bare metal too.
But something as to be said about controlling everything on your own infra: from the bare metal to the VMs to container orchestration. To even potentially your own IP address space.
This is all within reach of an individual, both skill-wise and price-wise (including obtaining your own IP address space). People who drank the cloud kool-aid should ponder this and wonder how good their skills truly are if they cannot get this up and working.
Fully agree. And if you want to take it to the next level (and have a large budget), Oxide [0] seems to have neatly packaged this into a single coherent product. They don't quite have K8s fully running, last I checked, but there are of course other container orchestration systems.
> Go to the machine once the spare disk as already been resilvered
Hi, fellow ZFS enthusiast :-)
[0]: https://oxide.computer
> And running your own hardware is not incompatible with Kubernetes: on the contrary
Kubernetes actually makes so much more sense on bare-metal hardware.
On the cloud, I think the value prop is dubious - your cloud provider is already giving you VMs, why would you need to subdivide them further and add yet another layer of orchestration?
Not to mention that you're getting 2010s-era performance on those VMs, so subdividing them is terrible from a performance point of view too.
> All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding, but serve to introduce doubt to non-technical people and to trick people who don't examine what they're told.
This feels like "no true scotsman" to me. I've been building software for close to two decades, but I guess I don't have "any real technical understanding" because I think there's a compelling case for using "cloud" services for many (honestly I would say most) businesses.
Nobody is "afraid to openly discuss how cloud isn't right for many things". This is extremely commonly discussed. We're discussing it right now! I truly cannot stand this modern innovation in discourse of yelling "nobody can talk about XYZ thing!" while noisily talking about XYZ thing on the lowest-friction publishing platforms ever devised by humanity. Nobody is afraid to talk about your thing! People just disagree with you about it! That's ok, differing opinions are normal!
Your comment focuses a lot on cost. But that's just not really what this is all about. Everyone knows that on a long enough timescale with a relatively stable business, the total cost of having your own infrastructure is usually lower than cloud hosting.
But cost is simply not the only thing businesses care about. Many businesses, especially new ones, care more about time to market and flexibility. Questions like "how many servers do we need? with what specs? and where should we put them?" are a giant distraction for a startup, or even for a new product inside a mature firm.
Cloud providers provide the service of "don't worry about all that, figure it out after you have customers and know what you actually need".
It is also true that this (purposefully) creates lock-in that is expensive either to leave in place or unwind later, and it definitely behooves every company to keep that in mind when making architecture decisions, but lots of products never make it to that point, and very few of those teams regret the time they didn't spend building up their own infrastructure in order to save money later.
The problem with your claims here is they can only be right if the entire industry is experiencing mass psychosis. I reject a theory that requires that, because my ego just isn't that large.
I once worked for several years at a publicly traded firm well-known for their return-to-on-prem stance, and honestly it was a complete disaster. The first-party hardware designs didn't work right because they didn't have the hardware designs staffing levels to have de-risked to possibility that AMD would fumble the performance of Zen 1, leaving them with a generation of useless hardware they nonetheless paid for. The OEM hardware didn't work right because they didn't have the chops to qualify it either, leaving them scratching their heads for months over a cohort of servers they eventually discovered were contaminated with metal chips. And, most crucially, for all the years I worked there, the only thing they wanted to accomplish was failover from West Coast to East Coast, which never worked, not even once. When I left that company they were negotiating with the data center owner who wanted to triple the rent.
These experiences tell me that cloud skeptics are sometimes missing a few terms in their equations.
> The problem with your claims here is they can only be right if the entire industry is experiencing mass psychosis.
Yes. Mass psychosis explains an incredible number of different and apparently unrelated problems with the industry.
"Vendor problems" is a red herring, IMO; you can have those in the cloud, too.
It's been my experience that those who can build good, reliable, high-quality systems, can do so either in the cloud or on-prem, generally with equal ability. It's just another platform to such people, and they will use it appropriately and as needed.
Those who can only make it work in the cloud are either building very simple systems (which is one place where the cloud can be appropriate), or are building a house of cards that will eventually collapse (or just cost them obscene amounts of money to keep on life support).
Engineering is engineering. Not everyone in the business does it, unfortunately.
Like everything, the cloud has its place -- but don't underestimate the number of decisions that get taken out of the hands of technical people by the business people who went golfing with their buddy yesterday. He just switched to Azure, and it made his accountants really happy!
The whole CapEx vs. OpEx issue drives me batty; it's the number one cause of cloud migrations in my career. For someone who feels like spent money should count as spent money regardless of the bucket it comes out of, this twists my brain in knots.
I'm clearly not a finance guy...
> or are building a house of cards that will eventually collapse (or just cost them obscene amounts of money to keep on life support)
Ding ding ding. It's this.
> The whole CapEx vs. OpEx issue drives me batty
Seconded. I can't help but feel like it's not just a "I don't understand money" thing, but more of a "the way Wall Street assigns value is fundamentally broken." Spending $100K now, once, vs. spending $25K/month indefinitely does not take a genius to figure out.
you forgot cogs
it's all about painting the right picture for your investors, so you make up shit and classify as cogs or opex depending on what is most beneficial for you in the moment
> The problem with your claims here is they can only be right if the entire industry is experiencing mass psychosis.
What's the market share of Windows again? ;)
There's however a middle-ground between run your own colocated hardware and cloud. It's called "dedicated" servers and many hosting providers (from budget bottom-of-the-barrel to "contact us" pricing) offer it.
Those take on the liability of sourcing, managing and maintaining the hardware for a flat monthly fee, and would take on such risk. If they make a bad bet purchasing hardware, you won't be on the hook for it.
This seems like a point many pro-cloud people (intentionally?) overlook.
> All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding ...
And moreover most of the actual interesting things, like having VM templates and stateless containers, orchestration, etc. is very easy to run yourself and gets you 99.9% of the benefits of the cloud.
About just any and every service is available as container file already written for you. And if it doesn't exist, it's not hard to plumb up.
A friend of mine runs more than 700 containers (yup, seven hundreds), split over his own rack at home (half of them) and the other half on dedicated servers (he runs stuff like FlightRadar, AI models, etc.). He'll soon get his own IP addresses space. Complete "chaos monkey" ready infra where you can cut any cable and the thing shall keep working: everything is duplicated, can be spun up on demand, etc. Someone could still his entire rack and all his dedicated server, he'd still be back operational in no time.
If an individual can do that, a company, no matter its size, can do it too. And arguably 99.9% of all the companies out there don't have the need for an infra as powerful as the one most homelab enthusiast have.
And another thing: there's even two in-betweens between "cloud" and "our own hardware located at our company". First is colocating your own hardware but in a datacenter. Second is renting dedicated servers from a datacenter.
They're often ready to accept cloud-init directly.
And it's not hard. I'd say learning to configure hypervisors on bare metal, then spin VMs from templates, then running containers inside the VMs is actually much easier than learning all the idiosyncrasies of all the different cloud vendors APIs and whatnots.
Funnily enough when the pendulum swung way too far on the "cloud all the things" side, those saying at some point we'd read story about repatriation were being made fun of.
> If an individual can do that, a company, no matter its size, can do it too.
Fully agreed. I don't have physical HA – if someone stole my rack, I would be SOL – but I can easily ride out a power outage for as long as I want to be hauling cans of gasoline to my house. The rack's UPS can keep it up at full load for at least 30 minutes, and I can get my generator running and hooked up in under 10. I've done it multiple times. I can lose a single server without issue. My only SPOF is internet, and that's only by choice, since I can get both AT&T and Spectrum here, and my router supports dual-WAN with auto-failover.
> And arguably 99.9% of all the companies out there don't have the need for an infra as powerful as the one most homelab enthusiast have.
THIS. So many people have no idea how tremendously fast computers are, and how much of an impact latency has on speed. I've benchmarked my 12-year old Dells against the newest and shiniest RDS and Aurora instances on both MySQL and Postgres, and the only ones that kept up were the ones with local NVMe disks. Mine don't even technically have _local_ disks; they're NVMe via Ceph over Infiniband.
Does that scale? Of course not; as soon as you want geo-redundant, consistent writes, you _will_ have additional latency. But most smaller and medium companies don't _need_ that.
Plugging https://BareMetalSavings.com
in case you want to ballpark-estimate your move off of the cloud
Bonus points: I'm a Fastmail customer, so it tangentially tracks
----
Quick note about the article: ZFS encryption can be flaky, be sure you know what you're doing before deploying for your infrastructure.
Relevant Reddit discussion: https://www.reddit.com/r/zfs/comments/1f59zp6/is_zfs_encrypt...
A spreadsheet of related issues that I can't remember who made:
https://docs.google.com/spreadsheets/d/1OfRSXibZ2nIE9DGK6sww...
Such an awesome article. I like how they didn't just go with the Cloud wave but kept sysadmin'ing, like ol' Unix graybeards. Two interesting things they wrote about their SSDs:
1) "At this rate, we’ll replace these [SSD] drives due to increased drive sizes, or entirely new physical drive formats (such E3.S which appears to finally be gaining traction) long before they get close to their rated write capacity."
and
2) "We’ve also anecdotally found SSDs just to be much more reliable compared to HDDs (..) easily less than one tenth the failure rate we used to have with HDDs."
To avoid sysadmin tasks, and keep costs down, you've got to go so deep in the cloud, that it becomes just another arcane skill set. I run most of my stuff on virtual Linux servers, but some on AWS, and that's hard to learn, and doesn't transfer to GCP or Azure. Unless your needs are extreme, I think sysadmin'ing is the easier route in most cases.
For so many things the cloud isn't really easier or cheaper, and most cloud providers stopped advertising it as such. My assumption is that cloud adoption is mainly driven by 3 forces:
- for small companies: free credits
- for large companies: moving prices as far away as possible from the deploy button, allowing dev and it to just deploy stuff without purchase orders
- self-perpetuating due to hype, cv-driven development, and ease of hiring
All of these are decent reasons, but none of them may apply to a company like fastmail
Also CYA. If you run your own servers and something goes wrong its your fault. if its an outage at AWS its their fault.
Also a huge element of follow the crowd, branding non-technical management are familiar with, and so on. I have also found some developers (front end devs, or back end devs who do not have sysadmin skills) feel cloud is the safe choice. This is very common for small companies as they may have limited sysadmin skills (people who know how to keep windows desktops running are not likely to be who you want to deploy servers) and a web GUI looks a lot easier to learn.
> If its an outage at AWS its their fault.
Well, still your fault, but easy to judo the risk into clients saying supporting multi-cloud is expensive and not a priority.
Management in many places will not even know what multi-cloud is (or even multi-region).
As Cloudstrike showed, if you follow the crowd and tick the right boxes you will not be blamed.
There are other, if often at least tangentially related, reasons but more than I can give justice to in a comment.
Many people largely got a lot of things wrong about cloud that I've been meaning to write about for a while. I'll get to it after the holidays. But probably none more than the idea that massive centralized computing (which was wrongly characterized as a utility like the electric grid) would have economics with which more local computing options could never compete.
In small companies, cloud also provides the ability to work around technical debt and to reduce risk.
For example, I have seen several cases where poorly designed systems that unexpectedly used too much memory, and there was no time to fix it, so the company increased the memory on all instances with a few clicks. When you need to do this immediately to avoid a botched release that has already been called "successful" and announced as such to stakeholders, that is a capability that saves the day.
An example of de-risking is using a cloud filesystem like EFS to provide a pseudo-infinite volume. No risk of an outage due to an unexpectedly full disk.
Another example would be using a managed database system like RDS vs self-managing the same RDBMS: using the managed version saves on labor and reduces risk for things like upgrades. What would ordinarily be a significant effort for a small company becomes automatic, and RDS includes various sanity checks to help prevent you from making mistakes.
The reality of the industry is that many companies are just trying to hit the next milestone of their business by a deadline, and the cloud can help despite the downsides.
> For example, I have seen several cases where poorly designed systems that unexpectedly used too much memory
> using a managed database system like RDS vs self-managing the same RDBMS: using the managed version saves on labor
As a DBRE / SRE, I can confidently assert that belief in the latter is often directly responsible for the former. AWS is quite clear in their shared responsibility model [0] that you are still responsible for making sound decisions, tuning various configurations, etc. Having staff that knows how to do these things often prevents the poor decisions from being made in the first place.
[0]: https://aws.amazon.com/compliance/shared-responsibility-mode...
I'm very interested in approaches that avoid cloud, so please don't read this as me saying cloud is superior. I can think of some other advantages of cloud:
- easy to setup different permissions for users (authorisation considerations).
- able to transfer assets to another owner (e.g., if there's a sale of a business) without needing to move physical hardware.
- other outsiders (consultants, auditors, whatever) can come in and verify the security (or other) of your setup, because it's using a standard well known cloud platform.
I predict a slow but unstoppable comeback of the sysadmin job over the next 5-10 years.
It never disappeared in some places. In my region there's been zero interest in "the cloud" because of physical remoteness from all major GCP/AWS/Azure datacenters (resulting in high latency), for compliance reasons, and because it's easier and faster to solve problems by dealing with a local company than pleading with a global giant that gives zero shits about you because you're less than a rounding error in its books.
> it becomes just another arcane skill set
Its an arcane skill set with a GUI. It makes it look much easier to learn.
The power of Moore's law.
I don't see how point 2 could have come as a surprise to anyone.
The fact that Fastmail work like this, are transparent about what they're up to and how they're storing my email and the fact that they're making logical decisions and have been doing so for quite a long time is exactly the reason I practically trip over myself to pay them for my email. Big fan of Fastmail.
They are also active in contributing to cyrus-imap
Aside: Fastmail was the best email provider I ever used. The interface was intuitive and responsive, both on mobile and web. They have extensive documentation for everything. I was able to set up a custom domain and and a catch-all email address in a few minutes. Customer support is great, too. I emailed them about an issue and they responded within the hour (turns out it was my fault). I feel like it's a really mature product/company and they really know what they're doing, and have a plan for where they're going.
I ended up switching to Protonmail, because of privacy (Fastmail is within the Five Eyes (Australia)), which is the only thing I really like about Protonmail. But I'm considering switching back to Fastmail, because I liked it so much.
I was told Fastmail is excellent, and I am not a big fan of gmail. Once locked out for good in gmail, your email and apps associated with it, are gone forever. Source? Personal experience.
"A private inbox $60 for 12 months". I assume it is USD, not AU$ (AFAIK, Fastmail is based in Australia.) Still pricey.
At https://www.infomaniak.com/ I can buy email service for an (in my case external) domain for 18 Euro a year and I get 5 inboxes. And it is based in Switzerland, so no EU or US jurisdiction.
I have a few websites and fastmail would just be prohibitive expensive for me.
I have seen a common sentiment that self hosting is almost always better than cloud. What these discussions does not mention is how to effectively run your business applications on this infrastructure.
Things like identity management (AAD/IAM), provisioning and running VMs, deployments. Network side of things like VNet, DNS, securely opening ports etc. Monitoring setup across the stack. There is so much functionalities that will be required to safely expose an application externally that I can't even coherently list them out here. Are people just using Saas for everything (which I think will defeat the purpose of on-prem infra) or a competent Sys admin can handle all this to give a cloud like experience for end developers?
Can someone share their experience or share any write ups on this topic?
For more context, I worked at a very large hedge fund briefly which had a small DC worth of VERY beefy machines but absolutely no platform on top of it. Hosting application was done by copying the binaries on a particular well known machine and running npm commands and restarting nginx. Log a ticket with sys admin to create a DNS entry to point a reserve and point a internal DNS to this machine (no load balancer). Deployment was a shell script which rcp new binaries and restarts nginx. No monitoring or observability stack. There was a script which will log you into a random machine for you to run your workloads (be ready to get angry IMs from more senior quants running their workload in that random machine if your development build takes up enough resources to effect their work). I can go on and on but I think you get the idea.
> identity management (AAD/IAM)
Do you mean for administrative access to the machines (over SSH, etc) or for "normal" access to the hosted applications?
Admin access: Ansible-managed set of UNIX users & associated SSH public keys, combined with remote logging so every access is audited and a malicious operator wiping the machine can't cover their tracks will generally get you pretty far. Beyond that, there are commercial solutions like Teleport which provide integration with an IdP, management web UI, session logging & replay, etc.
Normal line-of-business access: this would be managed by whatever application you're running, not much different to the cloud. But if your application isn't auth-aware or is unsafe to expose to the wider internet, you can stick it behind various auth proxies such as Pomerium - it will effectively handle auth against an IdP and only pass through traffic to the underlying app once the user is authenticated. This is also useful for isolating potentially vulnerable apps.
> provisioning and running VMs
Provisioning: once a VM (or even a physical server) is up and running enough to be SSH'd into, you should have a configuration management tool (Ansible, etc) apply whatever configuration you want. This would generally involve provisioning users, disabling some stupid defaults (SSH password authentication, etc), installing required packages, etc.
To get a VM to an SSH'able state in the first place, you can configure your hypervisor to pass through "user data" which will be picked up by something like cloud-init (integrated by most distros) and interpreted at first boot - this allows you to do things like include an initial SSH key, create a user, etc.
To run VMs on self-managed hardware: libvirt, proxmox in the Linux world. bhyve in the BSD world. Unfortunately most of these have rough edges, so commercial solutions there are worth exploring. Alternatively, consider if you actually need VMs or if things like containers (which have much nicer tooling and a better performance profile) would fit your use-case.
> deployments
Depends on your application. But let's assume it can fit in a container - there's nothing wrong with a systemd service that just reads a container image reference in /etc/... and uses `docker run` to run it. Your deployment task can just SSH into the server, update that reference in /etc/ and bounce the service. Evaluate Kamal which is a slightly fancier version of the above. Need more? Explore cluster managers like Hashicorp Nomad or even Kubernetes.
> Network side of things like VNet
Wireguard tunnels set up (by your config management tool) between your machines, which will appear as standard network interfaces with their own (typically non-publicly-routable) IP addresses, and anything sent over them will transparently be encrypted.
> DNS
Generally very little reason not to outsource that to a cloud provider or even your (reputable!) domain registrar. DNS is mostly static data though, which also means if you do need to do it in-house for whatever reason, it's just a matter of getting a CoreDNS/etc container running on multiple machines (maybe even distributed across the world). But really, there's no reason not to outsource that and hosted offerings are super cheap - so go open an AWS account and configure Route53.
> securely opening ports
To begin with, you shouldn't have anything listening that you don't want to be accessible. Then it's not a matter of "opening" or closing ports - the only ports that actually listen are the ones you want open by definition because it's your application listening for outside traffic. But you can configure iptables/nftables as a second layer of defense, in case you accidentally start something that unexpectedly exposes some control socket you're not aware of.
> Monitoring setup across the stack
collectd running on each machine (deployed by your configuration management tool) sending metrics to a central machine. That machine runs Grafana/etc. You can also explore "modern" stuff that the cool kids play with nowadays like VictoriaMetrics, etc, but metrics is mostly a solved problem so there's nothing wrong with using old tools if they work and fit your needs.
For logs, configure rsyslogd to log to a central machine - on that one, you can have log rotation. Or look into an ELK stack. Or use a hosted service - again nothing prevents you from picking the best of cloud and bare-metal, it's not one or the other.
> safely expose an application externally
There's a lot of snake oil and fear-mongering around this. First off, you need to differentiate between vulnerabilities of your application and vulnerabilities of the underlying infrastructure/host system/etc.
App vulnerabilities, in your code or dependencies: cloud won't save you. It runs your application just like it's been told. If your app has an SQL injection vuln or one of your dependencies has an RCE, you're screwed either way. To manage this you'd do the same as you do in cloud - code reviews, pentesting, monitoring & keeping dependencies up to date, etc.
Infrastructure-level vulnerabilities: cloud providers are responsible for keeping the host OS and their provided services (load balancers, etc) up to date and secure. You can do the same. Some distros provide unattended updates (which your config management tool) can enable. Stuff that doesn't need to be reachable from the internet shouldn't be (bind internal stuff to your Wireguard interfaces). Put admin stuff behind some strong auth - TLS client certificates are the gold standard but have management overheads. Otherwise, use an IdP-aware proxy (like mentioned above). Don't always trust app-level auth. Beyond that, it's the usual - common sense, monitoring for "spooky action at a distance", and luck. Not too much different from your cloud provider, because they won't compensate you either if they do get hacked.
> For more context, I worked at a very large hedge fund briefly which had a small DC worth of VERY beefy machines but absolutely no platform on top of it...
Nomad or Kubernetes.
No, using Ansible to distribute public keys does not get you very far. It's fine for a personal project or even a team of 5-6 with a handful, but beyond that you really need a better way to onboard, offboard, and modify accounts. If you're doing anything but a toy project, you're better off starting off with something like IPA for host access controls.
Why do think that? I did something similar at a previous work for something bordering on 1k employees.
User administration was done by modifying a yaml file in git. Nothing bad to say about it really. It sure beats point-and-click Active Directory any day of the week. Commit log handy for audits.
If there are no externalities demanding anything else, I'd happily do it again.
What's the risk you're trying to protect against, that a "better" (which one?) way would mitigate that this one wouldn't?
> IPA
Do you mean https://en.wikipedia.org/wiki/FreeIPA ? That seems like a huge amalgamation of complexity in a non-memory-safe language that I feel like would introduce a much bigger security liability than the problem it's trying to solve.
I'd rather pony up the money and use Teleport at that point.
It's basically Kerberos and an LDAP server, which are technologies old and reliable as dirt.
This sort of FUD is why people needlessly spend so much money on cloud.
> which are technologies old and reliable as dirt.
Technologies, sure. Implementations? Not so much.
I can trust OpenSSH because it's deployed everywhere and I can be confident all the low-hanging fruits are gone by now, and if not, its widespreadness means I'm unlikely to be the most interesting target, so I am more likely to escape a potential zero-day unscathed.
What't the marketshare of IPA in comparison? Has it seen any meaningful action in the last decade years, and the same attention, from both white-hats (audits, pentesting, etc) as well as black-hats (trying to break into every exposed service)? I very much doubt it, so the safe thing to assume is that it's nowhere as bulletproof as OpenSSH and that it's more likely for a dedicated attacker to find a vuln there.
Love this article and I'm also running some stuff on old enterprise servers in some racks somehwere. Now over the last year I've had to dive into Azure Cloud as we have customers using this (b2b company) and I finally understood why everyone is doing cloud despite the price:
Global permissions, seamless organization and IaC. If you are Fastmail or a small startup - go buy some used dell poweredge with epycs in some Colo rack with 10Gbe transit and save tons of money.
If you are a company with tons of customers, ton's of requirements it's powerful to put each concern into a landing zone, run some bicep/terraform - have a ressource group to control costs and get savings on overall core-count and be done with it.
Assign permissions into a namespace for your employe or customer - have some back and forth about requirements and it's done. No need to sysadmin across servers. No need to check for broken disks.
I'm also blaming the hell of vmware and virtual machines for everything that is a PITA to maintain as a sysadmin but is loved because it's common knowledge. I would only do k8s on bare-metal today and skip the whole virtualization thing completly. I guess it's also these pains that are softened in the cloud.
Why is it surprising? It's well known cloud is 3 times the price.
Because the default for companies today is cloud, even though it almost never makes sense. Sure, if you have really spikey load, need to dynamically scale at any point and don't care about your spend, it might make sense.
Ive even worked in companies where the engineering team spent effort and time on building "scalable infrastructure" before the product itself even found product-market fit...
Nobody said it's surprising though, they are well aware of it having done it for more than two decades. Many newcomers are not aware of it though, as their default is "cloud" and they never even shopped for servers, colocation or looked around on the dedicated server market.
I don't think they're not just aware. But purely from scaling and distribution perspective it'd be wiser to start on cloud while you're still on the product-market fit phase. Also 'bare metal' requires more on the capex end and with how our corporate tax system is set it's just discouraging to go on this lane first and it'd be better off to spend on acquiring clients.
Also I'd guess a lot of technical founders are more familiar with cloud/server-side than with dealing or delegating sysadmin taks that might require adding members to the team.
I agree, the cloud definitely has a lot of use cases and when you are building more complicated systems it makes sense to just have to do a few clicks to get a new stack setup vs. having someone evaluate solutions and getting familiar with operating them on a deep level (backups etc.).
Would be interesting to know how files get stored. They don't mention any distributed FS solutions like SeaweedFS so once a drive is full, does the file get sent to another one via some service? Also ZFS seems an odd choice since deletions (esp of small files) at +80% full drive are crazy slow.
Unlike ext4 that locks the directory when unlinking, ZFS is able to scale on parallel unlinking. In specific, ZFS has range locks that permit directory entries to be removed in parallel from the extendible hash trees that store them. While this is relatively slow for sequential workloads, it is fast on parallel workloads. If you want to delete a large directory subtree fast on ZFS, do the rm operations in parallel. For example, this will run faster on ZFS than a naive rm operation:
A friend had this issue on spinning disks the other day. I suggested he do this and the remaining files were gone in seconds when at the rate his naive rm was running, it should have taken minutes. It is a shame that rm does not implement a parallel unlink option internally (e.g. -j), which would be even faster, since it would eliminate the execve overhead and likely would eliminate some directory lookup overhead too, versus using find and parallel to run many rm processes.For something like fast mail that has many users, unlinking should be parallel already, so unlinking on ZFS will not be slow for them.
By the way, that 80% figure has not been true for more than a decade. You are referring to the best fit allocator being used to minimize external fragmentation under low space conditions. The new figure is 96%. It is controlled by metaslab_df_free_pct in metaslab.c:
https://github.com/openzfs/zfs/blob/zfs-2.2.0/module/zfs/met...
Modification operations become slow when you are at/above 96% space filled, but that is to prevent even worse problems from happening. Note that my friend’s pool was below the 96% threshold when he was suffering from a slow rm -r. He just had a directory subtree with a large amount of directory entries he wanted to remove.
For what it is worth, I am the ryao listed here and I was around when the 80% to 96% change was made:
https://github.com/openzfs/zfs/graphs/contributors
I discovered this yesterday! Blew my mind. I had to check 3 times that the files were actually gone and that I specified the correct directory as I couldn't believe how quick it ran. Super cool
Thank you very much for sharing this, very insightful.
The open-source Cyrus IMAP server which they mention using, has replication built-in. ZFS also has built-in replication available.
Deletion of files depends on how they have configured the message store - they may be storing a lot of data into a database, for example.
ZFS replication is quite unreliable when used with ZFS native encryption, in my experience. Didn't lose data but constant bugs.
Keeping enough free space should be much less of a problem with SSDs. They can tune it so the array needs to be 95% full before the slower best-fit allocator kicks in. https://openzfs.readthedocs.io/en/latest/performance-tuning....
I think that 80% figure is from when drives were much smaller and finding free space over that threshold with the first-fit allocator was harder.
if you don't have high bandwidth requirements, like for background / batch processing, the ovh eco family [1] of bare metal servers is incredibly cheap
[1] https://eco.ovhcloud.com/en/
Didn’t see this in the article, do they have multi az redundancy? I.e. if the entire raid goes up in flames what’s the recovery process?
Looks like they do mention that elsewhere: https://www.fastmail.com/features/reliability/
> Fastmail has some of the best uptime in the business, plus a comprehensive multi data center backup system. It starts with real-time replication to geographically dispersed data centers, with additional daily backups and checksummed copies of everything. Redundant mirrors allow us to failover a server or even entire rack in the case of hardware failure, keeping your mail running.
Yeah, that makes me feel uneasy as a long time fastmail user.
I absolutely love Fastmail. I moved off of Gmail years ago with zero regrets. Better UI, better apps, better company, and need I say better service? I still maintain and fetch from a Gmail account so it all just works seamlessly for receiving and sending Gmail, so you don’t have to give anything up either.
I moved from my own colocated 1U running Mailcow to Fastmail and don't regret it one bit. This was an interesting read, glad to see they think things through nice and carefully.
The only things I wish FM had are all software:
1. A takeout-style API to let me grab a complete snapshot once a week with one call
2. The ability to be an IdP for Tailscale.
Their UI is definitely faster but I do prefer the gmail UI, for example how new messages are displayed in threads is quite useless in fastmail.
Their android app has always been much snappier than Gmail, it's the little things that drew me to it years ago
I use Fastmail for my personal mail, and I don’t regret it, but I’m not quite as sold as you are, I guess maybe because I still have a few Google work accounts I need to use. Spam filtering in Fastmail is a little worse, and the search is _terrible_. The iOS app is usable but buggy. The easy masked emails are a big win though, and setting up new domains feels like less of a hassle with FM. I don’t regret using Fastmail, and I’d use them again for my personal email, but it doesn’t feel like a slam dunk.
I’ve started to host my own sites and stuff on an old MacBook in a cupboard with a shit old external hardware Ava microk8s and it’s great!
Another homelabber joins the ranks!!
"WHY we use our own hardware..."
The why is is the interesting part of this article.
I take that back; this is (to me)t he most interesting part:
"Although we’ve only ever used datacenter class SSDs and HDDs failures and replacements every few weeks were a regular occurrence on the old fleet of servers. Over the last 3+ years, we’ve only seen a couple of SSD failures in total across the entire upgraded fleet of servers. This is easily less than one tenth the failure rate we used to have with HDDs."
I am working on a personal project(some would call it startup, but i have no intention of getting external financing and other americanisms) where i have set up my own cdn and video encoding, among other things. These days, whenever you have a problem, everyone answers "just use cloud" and that results in people really knowing nothing any more. It is saddening. But on the other hand it ensures all my decades of knowledge will be very well paid in the future, if i'd need to get a job.
I'm a little surprised it seems they didn't have some existing compression solution before moving to zfs. With so much repetitive text across emails I would think there would be a LOT to gain, such as from dictionaries, compressing many emails into bigger blobs, and fine-tuning compression options.
They use ZFS with zstd which likely compresses well enough.
Custom compression code can introduce bugs that can kill Fastmail's reputation of reliability.
It's better to use a well tested solution that cost a bit more.
FYI - Fastmail web client has Offline support in beta right now.
https://www.fastmail.com/blog/offline-in-beta/
Very confused by this. What is in beta? I've had "offline" email access for 25 years. It's called an IMAP client.
[flagged]
Hey, this response makes you look like an adolescent asshole. Parent poster was clearly asking about prioritization.
Better an asshole than a moron in my opinion. If he maneuvers his cursor over the link he can click it and transform from “very confused” to “unconfused” provided he can comprehend English - which is admittedly in question.
> So after the success of our initial testing, we decided to go all in on ZFS for all our large data storage needs. We’ve now been using ZFS for all our email servers for over 3 years and have been very happy with it. We’ve also moved over all our database, log and backup servers to using ZFS on NVMe SSDs as well with equally good results.
If you're looking at ZFS on NVMe you may want to look at Alan Jude's talk on the topic, "Scaling ZFS for the future", from the 2024 OpenZFS User and Developer Summit:
* https://www.youtube.com/watch?v=wA6hL4opG4I
* https://openzfs.org/wiki/OpenZFS_Developer_Summit_2024
There are some bottlenecks that get in the way of getting all the performance that the hardware often is capable of.
gmail does spam filtering very well for me. fastmail on the other hands, puts lots of legit emails into spam folder. manually marking "not spam" doesn't help
other than that, i'm happy with fastmail.
If I look at my Gmail SPAM folder, there is very rarely something genuinely important in it. What there is a fair bit of though is random newsletters and announcements that I may have signed up for in some way shape or form that I don't really care about or generally look at. I assume they've been reported as SPAM by enough people rather than simply unsubscribed to that Google now labels them as such.
iCloud is just as bad, sends important things to spam constantly and marking as “not spam” has never done anything perceivable.
Anyone know what are some good data centers or providers to host your bare metal servers?
Are those backups geographically distributed?
The biggest win with running your own infra is disk/IO speeds, as noted here and in DHH's series on leaving cloud (https://world.hey.com/dhh/we-have-left-the-cloud-251760fb)
The cloud providers really kill you on IO for your VMs. Even if 'remote' SSDs are available with configurable ($$) IOPs/bandwidth limits, the size of your VM usually dictates a pitiful max IO/BW limit. In Azure, something like a 4-core 16GB RAM VM will be limited to 150MB/s across all attached disks. For most hosting tasks, you're going to hit that limit far before you max out '4 cores' of a modern CPU or 16GB of RAM.
On the other hand, if you buy a server from Dell and run your own hypervisor, you get a massive reserve of IO, especially with modern SSDs. Sure, you have to share it between your VMs, but you own all of the IO of the hardware, not some pathetic slice of it like in the cloud.
As is always said in these discussions, unless you're able to move your workload to PaaS offerings in the cloud (serverless), you're not taking advantage of what large public clouds are good at.
Biggest issue isn't even sequential speed but latency. In the cloud all persistent storage is networked and has significantly more latency than direct-attached disks. This is a physical (speed of light) limit, you can't pay your way out of it, or throw more CPU at it. This has a huge impact for certain workloads like relational databases.
I like this writeup, informative and to-the-point.
Today, the cloud isn’t about other people’s hardware.
It’s about infrastructure being an API call away. Not just virtual machines but also databases, load-balancers, storage, and so on.
The cost isn’t the DC or the hardware, but the hours spend on operations.
And you can abuse developers to do operations on the side :-)
And then come the weird aspects of bad cloud service providers, like IONOS, who have broken OS images, a provisioning API, that is a bottleneck, where what other people do and how much they do can slow down your own provisioning and creating network interfaces can take minutes via their API and their customer services says "That's how it is, cannot change it.", and you get a very shitty web user interface, that desperately tries to be a single page app, yet has all the default browser functionality like the back button broken. Yet they still cost literally 10x what Hetzner cloud costs, while Hetzner basically does everything better.
And then it is still also about other people's hardware in addition to that.
Yeah, Cloud is a bit of a scam innit? Oxide is looking more and more attractive every day as the industry corrects itself from overspending on capabilities they would never need.
It’s trading time for money
Fake news. I've got my bare metal server deployed and installed with my ansible playbook even before you manage to log into the bazillion layers of abstraction that is AWS.
But can you do that on demand in minutes for 1000 application teams that have unique snowflake needs. Because terraform or bicep can.
Yes, welcome to business. But frankly an email provider needs to have their own metal, if they don't they're not worth doing business with
longtime FM user here
good on them, understanding infrastructure and cost/benefit is essential in any business you hope to run for the long haul
I would like to know the tech stack behind it.
Hosts online service seems to think deserving of medal for discovering that S3 buckets from a cloud provider are crap and cost a fortune.
The heading in this space makes your think they're running custom FPGAs such as with Gmail, not just running on metal... As for drive failures, welcome to storage at scale. Build your solution so it's a weekly task to replace 10disks at a time not critical at 2am when a single disk dies...
Storing/Accessing tonnes of <4kB files is difficult, but other providers are doing this on their own metal with CEPH at the PB scale.
I love ZFS, it's great with per-disk redundancy but CEPH is really the only game in town for inter-rack/DC resilience which I would hope my email provider has.
A mail-cloud provider uses its own hardware? Well, that’s to be expected, it would be a refreshing article if it was written by one of their customers.
But what about the cost and complexity of a room with the racks and the cooling needs of running these machines? And the uninterrupted power setup? The wiring mess behind the racks.
I'm not fastmail but this is not rocket science. Has everyone forgotten how datacentre services work in 2024?
Yes they have and they feel they deserve credit for discovering a WiFi cable is more reliable to the new shiny kit that was sold to them by a vendor...
There is a very competitive market for colo providers in basically every major metropolitan area in the US, Europe, and Asia. The racks, power, cooling, and network to your machines is generally very robust and clearly documented on how to connect. Deploying servers in house or in a colo is a well understood process with many experts who can help if you don’t have these skills.
Colo offers the ability to ship and deploy and keep latencies down if you're global, but if you're local yes you should just get someone on site and the modern equivalent of a T1 line setup to your premises if you're running "online" services.
Even for cloud providers, these are mostly other people's problems, eg: Equinix
Own hardware doesn't mean own data center. Many data centers offer colocation.
Do colocation facilities solve that?
Since I moved from gmail to fastmail, my mailbox is full of spam. I tried setting up rules but there are just too many of them, so I abandoned that strategy after a month. Now I just label mail from senders that are not in my contacts differently. But it's still a mess. I'm at the point that I prefer WhatsApp over email.
So, Fastmail please fix this or tell me what I'm doing wrong. IMHO when uninteresting mail arrives it should take at most two clicks to install a new rule and apply it.
Your comment is confusing because you start this one saying your inbox is full of spam, but respond to a suggestion to mark it as spam by saying it's not actually spam.
If something is not spam but you want it out of your inbox there's a few options:
- click Unsubscribe next to the sender. This should be possible for essentially all promotional email.
- click Actions -> click Block <sender>. Messages from this address will now immediately go to trash.
- click Actions -> click Add rule from message (-> optionally change the suggested conditions) -> check Archive (or if you don't use labels click Move to) -> click Save. Messages matching the conditions will now skip your inbox.
There's not much they could do to make that easier without magically knowing what you care about and what you don't.
This last week, gmail failed to filter as spam an email with subject "#T Anitra", body,
> oF1 d 4440 - 2 B 32677 83
> R Teri E x E q
>
> k 50347733 Safoorabegum
and an attachment "7330757559.pdf". It let through 8 similar emails in the same week, and many more even more egregiously gibberish emails over the years. I'm not pleased with the quality of gmail's spam filter.
I moved to FastMail three years ago, and, for a contrasting experience, found that spam filtering was almost on a par with Gmail. I had feared it would be otherwise.
my inbox at fastmail is near empty from spam. the main spam i see in my inbox is forwarded from my gmail.
That probably says more about the email address that’s out there than anything else.
Fastmail has wildcard email support, so it’s pretty easy to have an email per purchase you make (for example). This makes it easy to see who leaked your email to spammers. Anyway, I have nowhere near the volume of spam with Fastmail that I had with Gmail.
Gmail puts most of my email in the spam folder, including a lot of non-spam. Manually labeling it as non-spam is not helping.
Never had that after the first few years, but I hear other people do have that. Maybe it's because I used it for 2 decades now? I tried alternatives including fastmail but I always leave them because I get swamped by spam while gmail works fine.
There is a "Report Spam" function which is two clicks away (it's in the "More" menu).
I don't want to report everything as spam. For example, promotional emails from businesses that I bought something from. I don't want to punish those businesses; and those emails might contain vouchers that I could use later. But I want those emails moved out of the way without any action from my side.
That's like Spotify telling me "keep disliking" when I complained to them why songs from a certain language (which I never liked or listened to and I certainly don't speak) keeps filling the home after I told them in the first complaint that I have been doing that since months.
What can I say, "Report Spam" seems to work for me. I'm just a customer of Fastmail.
If you get 12 spam mails everyday and after 3 months of clicking "report spam" it still doesn't filter it, then it's not en par with Gmail.
If you meet someone new at a social event and give them your email address, where do you want your email provider to put the message that this person sent?
I get no spam on fastmail, I assume this is because I never give out my email to anyone and creating new ones for every interaction. This way I keep track of who I'm interacting with and also who's selling my alias emails.
Just wish there was a decent way to do this with mobile numbers!
Same, I religiously create a masked email for every website (just checked, it's now at 163!). I simply don't give my "main" email out.
Oddly enough, simply unsubscribing from the things websites themselves has kept thing clean, I've yet to notice any true spam from a random source aimed at any of my emails since I joined last year.
Cost isn’t always the most important metric. If that was the case, people would always buy the cheapest option of everything.