Why we use our own hardware

933 points by nmjenkins 6 months ago

jph00 6 months ago

The original answer to "why does FastMail use their own hardware" is that when I started the company in 1999 there weren't many options. I actually originally used a single bare metal server at Rackspace, which at that time was a small scrappy startup. IIRC it cost $70/month. There weren't really practical VPS or SaaS alternatives back then for what I needed.

Rob (the author of the linked article) joined a few months later, and when we got too big for our Rackspace server, we looked at the cost of buying something and doing colo instead. The biggest challenge was trying to convince a vendor to let me use my Australian credit card but ship the server to a US address (we decided to use NYI for colo, based in NY). It turned out that IBM were able to do that, so they got our business. Both IBM and NYI were great for handling remote hands and hardware issues, which obviously we couldn't do from Australia.

A little bit later Bron joined us, and he automated absolutely everything, so that we were able to just have NYI plug in a new machine and it would set itself up from scratch. This all just used regular Linux capabilities and simple open source tools, plus of course a whole lot of Perl.

As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing. They seemed orders of magnitude more expensive for something that was more complex to manage and would have locked us into a specific vendor's tooling. But everyone seemed to be flocking to them.

To this day I still use bare metal servers for pretty much everything, and still love having the ability to use simple universally-applicable tools like plain Linux, Bash, Perl, Python, and SSH, to handle everything cheaply and reliably.

I've been doing some planning over the last couple of years on teaching a course on how to do all this, although I was worried that folks are too locked in to SaaS stuff -- but perhaps things are changing and there might be interest in that after all?...

llm_trw 6 months ago

>As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing. They seemed orders of magnitude more expensive for something that was more complex to manage and would have locked us into a specific vendor's tooling. But everyone seemed to be flocking to them.
In 2006 when the first aws instances showed up it would take you two years of on demand bills to match the cost of buying the hardware from a retail store and using it continuously.
Today it's between 2 weeks for ML workloads to three months for the mid sized instances.
AWS made sense in big Corp when it would take you six months to get approval for buying the hardware and another six for the software. Today I'd only use it to do a prototype that I move on prem the second it looks like it will make it past one quarter.
- bluGill 6 months ago
  
  Aws is useful if you have uneven loads. why pay for the number of servers you need for christmas the rest of the year? But if your load is more even it doesn't make as much sense.
  - comprev 6 months ago
    
    The business case I give is a website which has a predictable spike in traffic which tails off.
    In the UK we have a huge charity fundraising event called Red Nose Day and the public can donate online (or telephone if they want to speak to a volunteer).
    The website probably sees 90% of their traffic on the day itself - millions of users - and the remaining 10% tailing off a few days later. Then nothing.
    The elasticity of the cloud allows the charity to massively scale their compute power for ONE day, then reduce it for a few days, and drop back down to a skeleton infrastructure until the next event - in a few years time.
    (FWIW I have no clue if Red Nose Day ever uses the cloud but it's a great example of a business case requiring temporary high capacity compute to minimise costs)
  - leshenka 6 months ago
    
    But how does it look from aws point of view?
    Everyone scales up around Christmas then scales down afterwards. What do THEY do with all the unneeded CPU-seconds for the rest of the year?
    
    jedberg 6 months ago
    
    Only consumer scales up for the holidays. Most other industries scale down. The more companies they have, the more even the overall demand is for them.
    Also, every unused resource goes into the spot market. They just have a bigger spot market during the year.
    And lastly, that's why they charge a premium. Because they amortize the cost of spare hardware across all their customers.
    
    prmoustache 6 months ago
    
    We certainly don't scale up around Christmas. Appart from online shops and shipping companies, why would everyone else scale up around Christmas?
    
    bluGill 6 months ago
    
    Not everyone. Ag is in a low time around then and scales way back. I don't know what other industry is like.
  - calmbonsai 6 months ago
    
    This.
    Plus bidding on spot-instances used to be far less gamed so if you had infrequent batch jobs (just an extreme version of low-duty-cycle loading), there was nothing cheaper and easier.
    I've been out of that "game" for a bit, but Google Compute used to have the cheapest bulk-compute instance pricing if all you needed was a big burst of CPU.
    It's all changed if you're running ML workloads though.
- PeterStuer 6 months ago
  
  AWS was built on hordes of VC backed startups drowning in heaps of cash and very little operational expertise.
  - bigfatkitten 6 months ago
    
    And now we've got a generation of IT professionals coming up the ranks who have no idea how to operate their own infrastructure.
- ForOldHack 6 months ago
  
  "buying the hardware from a retail store." Never buy wholesale and never develop on immature hardware, I have seen c** with multiple 9 y.o. dev servers. I could shorten the ROI to less than 6 months.
  - BrandoElFollito 6 months ago
    
    What is c*? (seriously, I am not a native speaker and cannot turn the stars into a word that makes sense)
    
    collingreen 6 months ago
    
    No worries, I AM a native speaker and I can't figure out the stars OR the specific parsing of that comment
benterix 6 months ago

> As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing.
You are not the only one. There are several factors at play but I believe one of the strongest today is the generational divide: the people lost the ability to manage their own infra or don't know it well enough to do it well so it's true when they say "It's too much hassle". I say this as an AWS guy who occasionally works on on-prem infra.[0]
[0] As a side note, I don't believe the lack of skills is the main reason organizations have problem - skills can be learned, but if you mess up the initial architecture design, fixing that can easily take years.
- mbrumlow 6 months ago
  
  > I don't believe the lack of skills is the main reason organizations have problem
  IDK. More and more I see the argument of “I don’t know, and we are not experts in xxx” as a winning argument of why we should just spend money on 3rd party services and products.
  I have seen people getting paid 700k plus a year spend their entire stay at companies writing papers about how they can’t do something and the obvious solution is to spend 400k plus to have some 3rd party handle it, and getting the budget.
  Let’s not get into what the conversation looks like when somebody points out that we might have an issue if we are paying somebody 700k to hire somebody else temporarily for 400k each year, and that we should find these folks who can do it for 400k and just hire Them.
  All this to say that being a SWE in many companies today requires no ability to create software that solves business problems. But rather some sort of quasi system administrator manager who will maybe write a handful of DSL scripts over the course of their career.
- lobsterthief 6 months ago
  
  It’s also human capital/resource allocation. We thought about spinning up our own servers at my last gig; we had the talent in house but that talent was busy building the product, not managing servers. I suppose it depends on what your need is as well.
  - benterix 6 months ago
    
    I see your point but my perspective on this shifted over the years. Whatever infra you set up, whether it's the public cloud or on prem, there is always the initial cost (starting with a simple account for small orgs, a landing zone for larger ones etc.) and this applies to every service, it's just registered in the books in a slightly different way. For example, whn you look at my Jira tickets, on prem we're patchng servers, and in the cloud we're usually updating container images. These two are not that different and you need to set aside some time for that. It's the same with upgrading Postgres on prem and RDS Postgres between major versions - you need to arrange the service window with product teams, do the migration on lower layers first and if all goes well you move on to prod.
    Of course, many infra activities take less time in the public cloud. E.g. control plane maintenance and upgrades on EKS are managed by AWS and are mostly painless so you never worry about stuff like etcd. On the other hand, there is a ton of stuff you need to know anyway to operate AWS in a proficient and safe way so I'm not convinced the difference is that huge today.
jasode 6 months ago

>As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing. They seemed orders of magnitude more expensive [...] To this day I still use bare metal servers for pretty much everything, [...] plain Linux, Bash, Perl, Python, and SSH, to handle everything cheaply
Your FastMail use case of (relatively) predictable server workload and product roadmap combined with agile Linux admins who are motivated to use close-to-bare-metal tools isn't an optimal cost fit for AWS. You're not missing anything and FastMail would have been overpaying for cloud.
Where AWS/GCP/Azure shine is organizations that need higher-level PaaS like managed DynamoDB, RedShift, SQS, etc that run on top of bare metal. Most non-tech companies with internal IT departments cannot create/operate "internal cloud services" that's on par with AWS.[1] Some companies like Facebook and Walmart can run internal IT departments with advanced capabilities like AWS but most non-tech companies can't. This means paying AWS' fat profit margins can actually be cheaper than paying internal IT salaries to "reinvent AWS badly" by installing MySQL, Kafka, etc on bare metal Linux. E.g. Netflix had their own datacenters in 2008 but a 3-day database outage that stopped them from shipping DVDs was one of the reasons they quit running their datacenters and migrated to AWS.[2] Their complex workload isn't a good fit for bare-metal Linux and bash scripts; Netflix uses a ton of high-level PaaS managed services from AWS.
If bare metal is the layer of abstraction the IT & dev departments are comfortable working at, then self-host on-premise, or co-lo, or Hetzner are all cheaper than AWS.
[1] https://web.archive.org/web/20160319022029/https://www.compu...
[2] https://media.netflix.com/en/company-blog/completing-the-net...
- causal 6 months ago
  
  Right, AWS rarely saves on hardware/hosting costs, it saves developer-hours. Especially if you're a fast-moving organization that rapidly changing hardware needs, something like AWS gives you agility.
  That said, most organizations are not nearly so agile as they'd like to believe and would probably be better off paying for something inflexible and cheap.
packtreefly 6 months ago

> although I was worried that folks are too locked in to SaaS stuff
For some people the cloud is straight magic, but for many of us, it just represents work we don't have to do. Let "the cloud" manage the hardware and you can deliver a SaaS product with all the nines you could ask for...
> teaching a course on how to do all this ... there might be interest in that after all?
Idk about a course, but I'd be interested in a blog post or something that addresses the pain points that I conveniently outsource to AWS. We have to maintain SOC 2 compliance, and there's a good chunk of stuff in those compliance requirements around physical security and datacenter hygiene that I get to just point at AWS for.
I've run physical servers for production resources in the past, but they weren't exactly locked up in Fort Knox.
I would find some in-depth details on these aspects interesting, but from a less-clinical viewpoint than the ones presented in the cloud vendors' SOC reports.
- dijit 6 months ago
  
  I’ve never visited a datacenter that wasn’t SOC2 compliant. Bahnhof, SAVVIS, Telecity, Equinox etc.
  Of course, their SOC 2 compliance doesn't mean we are absolved of securing our databases and services.
  Theres a big gap between throwing some compute in a closet and having someone “run the closet” for you.
  There is, a significantly larger gap between having someone “run the closet” and building your own datacenter from scratch.
  - wutwutwat 6 months ago
    
    A datacenter being soc2 compliant doesn’t mean any of your systems are. Same with pci. Same with hipaa. Cloud providers usually have offerings that help meet those requirements as well, but again, you can host bare metal, colo, cloud, or a tower under your bed, their compliance doesn’t do anything to cover your compliance.
    
    dijit 6 months ago
    
    Yes, quite right, that’s what I meant with my “I still have to do the work of securing my services”.
    Would be the same no matter where I’m hosted.
    Going to guess you meant to reply to the parent though?
    
    bigfatkitten 6 months ago
    
    They do cover your physical security requirements, which is still important.
- jph00 6 months ago
  
  You're describing stuff the colo provider does. I have no plans to describe how to setup a colo provider. I've never done that, and haven't seen the need. The cost of colo is not that significant.
milesvp 6 months ago

As someone who lived through that era, I can tell you there are legions of devs and dev adjacent people who have no idea what it’s like to automate mission critical hardware. Everyone had to do it in the early 2000s. But it’s been long enough that there are people in the workforce who just have no idea about running your own hardware since they never had to. I suspect there is a lot of interest, especially since we’re likely approaching the bring it back in house cycle, as CTOs try to reign in their cloud spend.
e12e 6 months ago

I used to help manage a couple of racks worth of on premise hw in early to mid 2000.
We had some old Compaq (?) servers, most of the newer stuff was Dell. Mix of windows and Linux servers.
Even with the Dell boxes, things wasn't really standard across different server generations, and every upgrade was bespoke, except in cases when we bought multiple boxes for redundancy/scaling of a particular service.
What I'd like to see is something like oxide computer servers that scales way down at least down to quarter rack. Like some kind of Supermicro meets backlblaze storage pod - but riffing on Joyent's idea of colocating storage and compute. A sort of composable mainframe for small businesses in the 2020s.
I guess maybe that is part of what Triton is all about.
But anyway - somewhere to start, and grow into the future with sensible redundancies and open source bios/firmware/etc.
Not typical situation for today, where you buy two (for redundancy) "big enough" boxes - and then need to reinvent your setup/deployment when you need two bigger boxes in three years.
- Voultapher 6 months ago
  
  Yeah, having something like oxide but smaller would be awesome.
jedberg 6 months ago

In my 25 years, I've run some really big on-prem workloads and some of the biggest cloud loads (Sendmail.org and it's mail servers and Netflix streaming). Here is why I like the cloud:
Flexibility.
When Netflix wanted to start operating in Europe, we didn't have to negotiate datacenter space, order a bunch of servers, wait for racking and stacking, and all those other things. We just made an API call and had an entire stack built in Europe.
Same thing we we expanded to Asia.
It also saved us a ton of money, because our workload was about 3x peak to trough each day. We would scale up for peak, and scale down for trough.
We used on-prem for the parts where that made sense -- serving the actual video bits. Those were done on custom servers with a very stripped down FreeBSD optimized just for serving video (so optimized that we still used Akamai for images). But the part of the business that needed flexibility (control plane and interface) were all in AWS.
Why would a startup use the cloud? Both flexibility and ease. There aren't a lot of experts around that can configure a linux box from scratch anymore. And even if you can, you can't go from coded-up idea to production in five minutes like you can with the cloud. It would take you at least a few hours to set up the bare metal the first time.
- tiffanyh 6 months ago
  
  When you say “cloud”, are you including old school web hosts that will rent you a dedicated server?
  Like OVH, Hetzner or Hivelocity?
  Because you can get some insane servers for like $300/month (eg brand new 5th gen Epyc 48-core / 0.5TB ram / lots of NVME) and globally available.
  - jedberg 6 months ago
    
    Those could count. But you'll still end up having to do some linux admin, which a lot of people can't do anymore.
    The whole point is that the closer you can get to "write code, run code", the faster you can launch and innovate.
    
    ldng 6 months ago
    
    Linux admin still exists. Except that they are better paid than ever at cloud provider. What you're describing is more payroll flexibility than technical.
    
    jedberg 6 months ago
    
    How is it not technical flexibility? No matter what talent you have on payroll, you can't spin up a whole datacenter's worth of machines in Europe in less than a day without a cloud provider.
    And I mean less than a day from "I think we should operate in Europe" to "we are operating production workloads in Europe".
    
    tiffanyh 6 months ago
    
    It sounds like you’re describing PaaS then.
bob1029 6 months ago

AWS is only expensive if you intend to run a lot of workloads and have a large, competent technical team.
For businesses with <10 servers and half an IT person, the cost difference is practically irrelevant. EC2+EBS+snapshots is a magic bullet abstraction for most scenarios. Bare metal is nice until parts of it start to fail on you.
I can teach someone from accounting how to restore the entire VM farm in an afternoon using the AWS web console. I've never seen an on prem setup where a similar feat is possible. There's always some weird arcane exceptions due to economic compromises that Amazon was not forced to make. When you can afford to build a fleet of data centers, you can provide a degree of standardization in product offering that is extraordinarily hard to beat. If your main goal is to chase customers and build products for them, this kind of stuff goes a long way.
Long term you should always seek total autonomy over your information technology, but you should be careful to not let that goal ruin the principal business that underlies everything.
- bigfatkitten 6 months ago
  
  > For businesses with <10 servers and half an IT person, the cost difference is practically irrelevant.
  If your infrastructure consists of ten t2.micro instances vs ten Raspberry Pis, then sure. In any other case, migrating VM or bare metal workloads from your own hardware straight onto EC2 is one of the most effective ways in the world to incinerate money.
  You can do well if you've got a workload well suited to 'native' PaaS services like S3 and Lambda, but EC2 costs a fortune.
- aragilar 6 months ago
  
  I'm confused why you would even need AWS then (what's running on the VMs)?
  My impression is the standard compute (as in CPUs+RAM) isn't expensive, it's the storage (1 PB is less than half a rack physically now, comparing with the yearly prices listed), and so if you don't have much data, the value of on-prem isn't there.
  - samcat116 6 months ago
    
    For smaller shops I'd argue storage is the hardest part. I've done several OpenStack and baremetal K8s deployments on prem and the part that always stressed me out the most was storage. I'd happily pay a markup for that vs just about anything else that would be more economical to do on prem for smaller simpler workloads.
    
    everfrustrated 6 months ago
    
    Also encrypted storage on AWS is so simple. Encrypted root file systems on prem is not easy.
    
    bigfatkitten 6 months ago
    
    How so?
    If you're a Windows shop, Bitlocker has been available to you since 2008.
    If you're a Red Hat shop, Clevis + Tang has made this a no brainer since 2014.
    If you have lots of money and run your root filesystems via FC or iSCSI from NetApp filers, then NSE has been around for close to 20 years now.
    
    bob1029 6 months ago
    
    This is it for me too. EBS is a bigger deal than the EC2 instances themselves.
edithpixie 6 months ago

For many people and businesses, navigating the frequently dangerous landscape of financial loss can be an intimidating and overwhelming process. Nevertheless, the knowledgeable staff at Wizard Hilton Cyber Tech provides a ray of hope and direction with their indispensable range of services. Their offerings are based on a profound grasp of the far-reaching and terrible effects that financial setbacks, whether they be the result of cyberattacks, data breaches, or other unforeseen tragedies, can have. Their highly-trained analysts work tirelessly to assess the scope of the damage, identifying the root causes and developing tailored strategies to mitigate the fallout. From recovering lost or corrupted data to restoring compromised systems and securing networks, Wizard Hilton Cyber Tech employs the latest cutting-edge technologies and industry best practices to help clients regain their financial footing. But their support goes beyond the technical realm, as their compassionate case managers provide a empathetic ear and practical advice to navigate the emotional and logistical challenges that often accompany financial upheaval. With a steadfast commitment to client success, Wizard Hilton Cyber Tech is a trusted partner in weathering the storm of financial loss, offering the essential services and peace of mind needed to emerge stronger and more resilient than before.
basilgohar 6 months ago

Please do this course. It's still needed and a lot of people would benefit from it. It's just that the loudest voices are all in on Cloud that it seems otherwise.
ksec 6 months ago

>But everyone seemed to be flocking to them.
To the point we have young Devs today that dont know what VPS and Colo ( Colocation) meant.
Back to the article, I am surprised it was only a "A few years ago" Fastmail adopted SSD. Which certainly seems late in the cycle for the benefits of what SSD offers.
Price for Colo on the order of $3000/2U/year. That is $125 /U/month.
- justsomehnguy 6 months ago
  
  > Which certainly seems late in the cycle for the benefits of what SSD offers.
  90% of emails are never read, 9% are read once. What SSD could offer for this use case except at least 2x cost ?
  - bluGill 6 months ago
    
    Don't forget that fastmail is through an internet transport with enough latency to make hdd seek times noise
- brongondwana 6 months ago
  
  We adopted SSD for the current week's email and rust for the deeper storage many years ago. A few years ago we switched to everything on NVMe, so there's no longer two tiers of storage. That's when the pricing switched to make it worthwhile.
- matt-p 6 months ago
  
  Colo is typically sold on power not space, from your example you're either getting ripped off if it's for low power servers or massively undercharged for a 4xa100 machine
- kapone 6 months ago
  
  What??
  I can get an entire rack at Equinix for ~1200/mo with an unlimited 10g internet connect.
- flemhans 6 months ago
  
  HDDs are still the best option for many workloads, including email.
twotwotwo 6 months ago

> I've been doing some planning over the last couple of years on teaching a course on how to do all this
Yes! It's surprisingly common to hear it can't work, or can't scale or run reliably, when all that is done. Talking about how you've done it is great from that perspective.
Also, it's worth talking about what you gain, qualitatively! As this post mentions, your high-performance storage options are far better outside the cloud. People often mention egress, too. The appealing idea to me is using your extra flexibility to deploy better stuff, not saving a bit of cost.
0xbadcafebee 6 months ago

You know how to set up a rock-solid remote hands console to all your servers, I take it? Dial-up modem to a serial console server, serial cables to all the servers (or IPMI on a segregated network and management ports). Then you deal with varying hardware implementations, OSes, setting that up in all your racks in all your colos.
Compare that to AWS, where there are 6 different kinds of remote hands, that work on all hardware and OSes, with no need for expertise, no time taken. No planning, no purchases, no shipment time, no waiting for remote hands to set it up, no diagnosing failures, etc, etc, etc...
That's just one thing. There's a thousand more things, just for a plain old VM. And the cloud provides way more than VMs.
The number of failures you can have on-prem is insane. Hardware can fail for all kinds of reasons (you must know this), and you have to have hot backup/spares, because otherwise you'll find out your spares don't work. Getting new gear in can take weeks (it "shouldn't" take that long, but there's little things like pandemics and global shortages on chips and disks that you can't predict). Power and cooling can go out. There's so many things that can (and eventually will) go wrong.
Why expose your business to that much risk, and have to build that much expertise? To save a few bucks on a server?
- jph00 6 months ago
  
  It's really not like that at all. If it was, I expect after 25 years of growth FastMail would probably have noticed. Much of what you're describing assumes a poorly run company that isn't able to make good choices -- if you have such a mix of odd hardware os OSes then that's pretty bad sign.
  Prioritise simplicity.
  For remote hands, 2 kinds is sufficient: IP KVM, and an actual person walking over to your machine. Can't say I've had an AWS person talk to me on a cell phone whilst standing at my server to help me sort out an issue.
  It's actually really fun, and saving 90% what can be your largest cost can actually be a fundamental driver of startup success. You can undercut the competition on price and offer stuff that's just not available otherwise.
  Every time this conversation has come up online over the last few decades there's always a few people who parrot this claim it's all too hard. I can't imagine these comments come from people that have actually gone and done it.
  - growse 6 months ago
    
    > Every time this conversation has come up online over the last few decades there's always a few people who parrot this claim it's all too hard. I can't imagine these comments come from people that have actually gone and done it.
    My experience of this is that people either fall into the camp of having done it under a set of non-ideal constraints (leading them to do it badly), or it's post-rationalising that they just don't want to.
- jread 6 months ago
  
  > Hardware can fail for all kinds of reasons
  Complex cloud infra can also fail for all kinds of reasons, and they are often harder to troubleshoot than a hardware failure. My experience with server grade hardware in a reliable colo with a good uplink is it's generally an extremely reliable combination.
  - 0xbadcafebee 6 months ago
    
    And my experience is the opposite, on both counts. I guess it's moot because two anecdotes cancel each other out?
    Cloud VMs fail from either the instance itself not coming back online, or an EBS failure, or some other az-wide or region-wide failure that affects networking or control plane. It's very rare, but I have seen it happen - twice, across more than a thousand AWS accounts in 10 years. But even when it does happen, you can just spin up a new instance, restoring from a snapshot or backup. It's ridiculously easier to recover than dealing with an on-prem hardware failure, and actually reliable, as there's always capacity [I guess barring GPU-heavy instances].
    "Server grade hardware in a reliable colo with good uplink" literally failed on my company last week, went hard down, couldn't get it back up. Not only that server but the backup server too. 3 day outage for one of the company's biggest products. But I'm sure you'll claim my real world issue is somehow invalid. If we had just been "more perfect", used "better hardware", "a better colo", or had "better people", nothing bad would have happened.
    
    jread 6 months ago
    
    There is lot of statistical and empirical data on this topic - MTBF estimates from vendors (typically 100k - 1m+ hours), Backblaze and Google drive failure data (~1-2% annual failure rate), IEEE and others. With N+1 redundancy (backup servers/RAID + spare drives) and proper design and change control processes, operational failures should be very rare.
    With cloud hardware issues are just the start - yes you MUST "plan for failure", leveraging load balancers, auto scaling, cloudwatch, and dozens of other proprietary dials and knobs. However, you must also consider control plane, quotas, capacity, IAM, spend, and other non-hardware breaking points.
    You're autoscaling isn't working - is the AZ out of capacity, did you hit a quota limit, run out of IPv4s, or was an AMI inadvertently removed? Your instance is unable to write to S3 - is the metadata service being flakey (for your IAM role), or is it due to an IAM role / S3 policy change? Your Lambda function is failing - did it hit a timeout, or exhaust the (512MB) temp storage? Need help diagnosing an issue - what is your paid support tier - submit a ticket and we'll get back to you sometime in the 24 hours.
- likeabatterycar 6 months ago
  
  > The number of failures you can have on-prem is insane. Hardware can fail for all kinds of reasons (you must know this)
  Cloud vendors are not immune from hardware failure. What do you think their underlying infrastructure runs on, some magical contraption made from Lego bricks, Swiss chocolate, and positive vibes?
  It's the same hardware, prone to the same failures. You've just outsourced worrying about it.
  - 0xbadcafebee 6 months ago
    
    The hardware is prone to the same failures, but the customers rarely experience them, because they handle it for you. EBS means never worrying about disks. S3 means never worrying about objects. EC2 ASG means never worrying about failed machines/VMs. Multi-AZ means never worrying about an entire datacenter going down.
    Yes, you pay someone else to worry about it. That's kinda the whole idea.
- kapone 6 months ago
  
  ok...?
  But, it comes at a cost. And that cost is significant. Like magnitudes significant.
  At what point does it become cheaper to hire an infra engineer? Let's see.
  In the US a good infra engineer might cost you $150K/yr all in. That's not taking into account freelancers/contractors who can do it for less.
  That's ~$12K/mo.
  That's a lot of compute on AWS...but that's not the end of the story. Ever try getting data OUT of AWS? Yeah, those egress costs are not chump change. But that's not even the end of it.
  The more important question is, what's the ratio of hosting/cloud costs to overall revenue? If colo/owned DC will yield better financials over ~few quarters, you'd be bananas as a CTO to recommend the cloud.
  - 0xbadcafebee 6 months ago
    
    The bigger cost is what will happen to your business when you're hard-down for a week because all your SQL servers are down, and you don't have spares, and it will take a week to ship new servers and get them racked. Even if you think you could do that very fast, there is no guarantee. I've seen Murphy's Law laugh in the face of assumptions and expectations too many times.
    But let's not just make vague claims. Everybody keeps saying AWS is more expensive, right? So let's look at one random example: the cost of a server in AWS vs buying your own server in a colo.
    AWS: 1x c6g.8xlarge (32-vCPU, 64GB RAM, us-east-2, Reserved Instance plan @ 3yrs) Cost up front: $5,719 Cost over 3 years: $11,437 ($158.85/month + $5,719 upfront) On-prem: 1x Supermicro 1U WIO A+ Server (AS -1115SV-WTNRT), 1x AMD EPYC™ 8324P Processor 32-Core 2.65GHz 128MB Cache (180W), 2x 32GB DDR5 5600MHz ECC RDIMM Server Memory, 2x 240GB 2.5" PM893 SATA 6Gb/s Solid State Drive (1 x DWPD), 3 Years Parts and Labor + 2 Years of Cross Shipment, MCP-290-00063-0N - Supermicro 1U Rail Kit (Included), 2 10GbE RJ45 Ports : $4,953.40 1x Colo shared rack 1U 2-PS @ 120VAC: $120/month (100Mbps only) Cost up front: $4,953.40 (before shipping & tax) Cost over 3 years: $9,273 (minimum)
    So, yes, the AWS server is double the cost (not an order of magnitude) of a ServerMicro (& this varies depending on configuration). But with colocation fees, remote hands fees, faster internet speeds, taxes, shipping, and all the rest of the nickle-and-diming, the cost of a single server in a colo is almost the same as AWS. Switch to a full rack, buy the networking gear, remote hands gear, APCs, etc that you'll probably want, and it's way, way more expensive to colo. In this one example.
    Obviously, it all depends on a huge number of factors. Which is why it's better not to just take the copious number of "we do on-prem and everything is easy and cheap" stories at face value. Instead one should do a TCO analysis based on business risk, computing requirements, and the non-monetary costs of running your own micro-datacenter.
    
    BackBlast 6 months ago
    
    > The bigger cost is what will happen to your business when you're hard-down for a week because all your SQL servers are down, and you don't have spares, and it will take a week to ship new servers and get them racked. Even if you think you could do that very fast, there is no guarantee. I've seen Murphy's Law laugh in the face of assumptions and expectations too many times.
    Lets ignore the loaded, cherry picked situation of no redundancy, no spares, and no warranty service. Because this is all magically hard since cloud providers appeared even though many of us did this, and have done this for years....
    There is nothing stopping an on-prem user from renting a replacement from a cloud provider while waiting for hardware to show up. That's a good logical use case for the cloud we can all agree upon.
    Next, your cost comparison isn't very accurate. One is isolated dedicated hardware, the other is shared. Junk fees such as egress, IPs, charges for access metal instances, IOPS provisioning for a database, etc will infest the AWS side. The performance of SAN vs local SSD is night and day for a database.
    Finally, I can acquire that level of performance hardware much cheaper if I wanted to, order of magnitude is plausible and depends more on where it's located, colo costs, etc.
    
    aragilar 6 months ago
    
    These servers are kinda tiny, and ignore the cost of storage. From the article, $252,000/y for 1 PB is crazy, and that's just storing it. There's also the CapEx vs OpEx aspect.
    
    brongondwana 6 months ago
    
    Yeah, if you don't have levels of redundancy, then you're pretty screwed. We could theoretically lose 2/3 of our systems and have sufficient capacity, because our metric is 2N primary plus N secondary, and we can run with half the racks switched off in the primary, or with the secondary entirely switched off, or (in theory, there's still some kinks with failover) with just secondary.
- switch007 6 months ago
  
  This. All of this and more. I've got friends who worked for a hosting providers who over the years have echoed this comment. It's endless.
dataflow 6 months ago

> As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing.
How do the availability/fault tolerance compare? If one of your geographical locations gets knocked out (fire, flood, network cutoff, war, whatever) what will the user experience look like, vs. what can cloud providers provide?
riezebos 6 months ago

As a customer of Fastmail and a fan of your work at FastAI and FastHTML I feel a bit stupid now for not knowing you started Fastmail.
Now I'm wondering how much you'd look like tiangolo if you wore a moustache.
- jph00 6 months ago
  
  Now I wonder what he'd look like without the moustache :)
- brongondwana 6 months ago
  
  Jeremy is all the Fast things!
ForOldHack 6 months ago

" teaching a course on how to do all this..." Can you provide some notice of this so I can schedule my vacation time to fully participate? Let me know when registration is open.
lowsong 6 months ago

What is the software side of things like? Is your team managing these servers directly — or is it "cloud like" with containers (Kubernetes?), IaC tools, etc.
bob_theslob646 6 months ago

I would gladly take your course if you offered it.

johnklos 6 months ago

The whole push to the cloud has always fascinated me. I get it - most people aren't interested in babysitting their own hardware. On the other hand, a business of just about any size that has any reasonable amount of hosting is better off with their own systems when it comes purely to cost.

All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding, but serve to introduce doubt to non-technical people and to trick people who don't examine what they're told.

What's particularly fascinating to me, though, is how some people are so pro-cloud that they'd argue with a writeup like this with silly cloud talking points. They don't seem to care much about data or facts, just that they love cloud and want everyone else to be in cloud, too. This happens much more often on sites like Reddit (r/sysadmin, even), but I wouldn't be surprised to see a little of it here.

It makes me wonder: how do people get so sold on a thing that they'll go online and fight about it, even when they lack facts or often even basic understanding?

I can clearly state why I advocate for avoiding cloud: cost, privacy, security, a desire to not centralize the Internet. The reason people advocate for cloud for others? It puzzles me. "You'll save money," "you can't secure your own machines," "it's simpler" all have worlds of assumptions that those people can't possibly know are correct.

So when I read something like this from Fastmail which was written without taking an emotional stance, I respect it. If I didn't already self-host email, I'd consider using Fastmail.

There used to be so much push for cloud everything that an article like this would get fanatical responses. I hope that it's a sign of progress that that fanaticism is waning and people aren't afraid to openly discuss how cloud isn't right for many things.

UltraSane 6 months ago

"All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding,"
This is false. AWS infrastructure is vastly more secure than almost all company data centers. AWS has a rule that the same person cannot have logical access and physical access to the same storage device. Very few companies have enough IT people to have this rule. The AWS KMS is vastly more secure than what almost all companies are doing. The AWS network is vastly better designed and operated than almost all corporate networks. AWS S3 is more reliable and scalable than anything almost any company could create on their own. To create something even close to it you would need to implement something like MinIO using 3 separate data centers.
- noprocrasted 6 months ago
  
  > AWS infrastructure is vastly more secure than almost all company data centers
  Secure in what terms? Security is always about a threat model and trade-offs. There's no absolute, objective term of "security".
  > AWS has a rule that the same person cannot have logical access and physical access to the same storage device.
  Any promises they make aren't worth anything unless there's contractually-stipulated damages that AWS should pay in case of breach, those damages actually corresponding to the costs of said breach for the customer, and a history of actually paying out said damages without shenanigans. They've already got a track record of lying on their status pages, so it doesn't bode well.
  But I'm actually wondering what this specific rule even tries to defend against? You presumably care about data protection, so logical access is what matters. Physical access seems completely irrelevant no?
  > Very few companies have enough IT people to have this rule
  Maybe, but that doesn't actually mitigate anything from the company's perspective? The company itself would still be in the same position, aka not enough people to reliably separate responsibilities. Just that instead of those responsibilities being physical, they now happen inside the AWS console.
  > The AWS KMS is vastly more secure than what almost all companies are doing.
  See first point about security. Secure against what - what's the threat model you're trying to protect against by using KMS?
  But I'm not necessarily denying that (at least some) AWS services are very good. Question is, is that "goodness" required for your use-case, is it enough to overcome its associated downsides, and is the overall cost worth it?
  A pragmatic approach would be to evaluate every component on its merits and fitness to the problem at hand instead of going all in, one way or another.
  - nine_k 6 months ago
    
    Physical access is pretty relevant if you could bribe an engineer to locate some valuable data's physical location, then go service the particular machine, copy the disk (during servicing "degraded hardware"), and thus exflitrate the data without any traces of a breach.
  - cyberax 6 months ago
    
    > They've already got a track record of lying on their status pages, so it doesn't bode well.
    ???
  - Brian_K_White 6 months ago
    
    Physical access and logical root access can't hide things form each other. It takes both to hide an activity. If you only have one, then the other can always be used to uncover or detect in the first place, or at least diagnose after.
- fulafel 6 months ago
  
  OTOH:
  1. big clouds are very lucrative targets for spooks, your data seem pretty likely to be hoovered up as "bycatch" (or maybe main catch depending on your luck) by various agencies and then traded around as currency
  2. you never hear about security probems (incidents or exposure) in the platforms, there's no transparency
  3. better than most coporate stuff is a low bar
  - sfilmeyer 6 months ago
    
    >3. better than most corporate stuff is a low bar
    I think it's a very relevant bar, though. The top level commenter made points about "a business of just about any size", which seems pretty exactly aligned with "most corporate stuff".
  - nine_k 6 months ago
    
    If you don't want your data to be accessible to "various agencies", don't share it with corporations, full stop. Corporations are obliged by law to make it available to the agencies, and the agencies often overreach, while the corporations almost never mind the overreach. There are limitations for stuff like health or financial data, but these are not impenetrable barriers.
    I would just consider all your hosted data to be easily available to any security-related state agency; consider them already having a copy.
    
    immibis 6 months ago
    
    That depends where it's hosted and how it's encrypted. Cloud hosts can just reach into your RAM, but dedicated server hosts would need to provision that before deploying the server, and colocation providers would need to take your server offline to install it.
    
    nine_k 6 months ago
    
    Colocated / Dedicated is not Cloud, AFAICT. It's the "traditional hosting", not elastic / auto-scalable. You of course may put your own, highly tamper-proof boxes in a colocation rack, and be reasonably certain that any attempt to exfiltrate data from them won't be invisible to you.
    By doing so, you share nothing with your hosting provider, you only rent rack space / power / connectivity.
    
    doublerabbit 6 months ago
    
    And this is why I colocate, because all the data that hits my server is my data.
    Sure I do have an AUP/T&C but without proper warrant no one is allowed to touch my server.
    Case is monitored if it's opened. Encrypted on start-up, USB disabled. I just wished I had my own /24.
    
    immibis 6 months ago
    
    At least you can get your own /48, at least if you're under RIPE.
    You should only do it if you expect to multihome though, or you're doing some experimentation that absolutely needs a PI address. Please don't pollute the default-free zone just for no reason.
    
    fulafel 6 months ago
    
    There's much variation by jurisdiction. Eg US based big-cloud companies would seem more risky here if you're from a country with traditionally less invasive (and less funded) spooks.
  - stefan_ 6 months ago
    
    4. we keep hitting hypervisor bugs and having to work around the fact that your software coexists on the same machine with 3rdparty untrusted software who might in fact be actively trying to attack you. All this silliness with encrypted memory buses and the various debilitating workarounds for silicon bugs.
    So yes, the cloud is very secure, except for the very thing that makes it the cloud that is not secure at all and has just been papered over because questioning it means the business model is bust.
    
    UltraSane 6 months ago
    
    What hypervisor bugs are you referring to? AWS does offer bare metal servers.
  - mardifoufs 6 months ago
    
    Most corporations (which is the vast majority of cloud users) absolutely don't care about spooks, sadly enough. If that's the threat model, then it's a very very rare case to care about it. Most datacenters/corporations won't even fight or care about sharing data with local spooks/cops/three letter agencies. The actual threat is data leaks, security breaches, etc.
  - likeabatterycar 6 months ago
    
    > you never hear about security probems (incidents or exposure) in the platforms
    Except that one time...
    https://www.seattlemet.com/news-and-city-life/2023/04/how-a-...
    
    noprocrasted 6 months ago
    
    If I remember right, the attacker’s AWS employment is irrelevant - no privileged AWS access was used in that case. The attacker working for AWS was a pure coincidence, it could’ve been anyone.
- gauravphoenix 6 months ago
  
  one of my greatest learnings in life is to differentiate between facts and opinions- sometimes opinions are presented as facts and vice-versa. if you think about it- the statement "this is false" is a response to an opinion (presented as a fact) but not a fact. there is no way one can objectively define and defend what does "real technical understanding" means. the cloud space is vast with millions of people having varied understanding and thus opinions.
  so let's not fight the battle that will never be won. there is no point in convincing pro-cloud people that cloud isn't the right choice and vice-versa. let people share stories where it made sense and where it didn't.
  as someone who has lived in cloud security space since 2009 (and was founder of redlock - one of the first CSPMs), in my opinion, there is no doubt that AWS is indeed superiorly designed than most corp. networks- but is that you really need? if you run entire corp and LOB apps on aws but have poor security practices, will it be right decision? what if you have the best security engineers in the world but they are best at Cisco type of security - configuring VLANS and managing endpoints but are not good at detecting someone using IMDSv1 in ec2 exposed to the internet and running a vulnerable (to csrf) app?
  when the scope of discussion is as vast as cloud vs on-prem, imo, it is a bad idea to make absolute statements.
  - fulafel 6 months ago
    
    Great points. Also if you end up building your apps as rube goldberg machines living up to "AWS Well Architected" criteria (indoctrinated by staff lots of AWS certifications, leading to a lot of AWS certified staff whose paycheck now depends on following AWS recommended practices) the complexity will kill your security, as nobody will understand the systems anymore.
- rmbyrro 6 months ago
  
  about security, most businesses using AWS invest little to nothing in securing their software, or even adopt basic security practices for their employees
  having the most secure data center doesn't matter if you load your secrets as env vars in a system that can be easily compromised by a motivated attacker
  so i don't buy this argument as a general reason pro-cloud
  - dajonker 6 months ago
    
    This exactly, most leaks don't involve any physical access. Why bother with something hard when you can just get in through an unmaintained Wordpress/SharePoint/other legacy product that some department can't live without.
- j45 6 months ago
  
  The cloud is someone else’s computer.
  It’s like putting something in someone’s desk drawer under the guise of convenience at the expense of security.
  Why?
  Too often, someone other than the data owner has or can get access to the drawer directly or indirectly.
  Also, Cloud vs self hosted to me is a pendulum that has swung back and forth for a number of reasons.
  The benefits of the cloud outlined here are often a lot of open source tech packaged up and sold as manageable from a web browser, or a command line.
  One of the major reasons the cloud became popular was networking issues in Linux to manage volume at scale. At the time the cloud became very attractive for that reason, plus being able to virtualize bare metal servers to put into any combination of local to cloud hosting.
  Self-hosting has become easier by an order of magnitude or two for anyone who knew how to do it, except it’s something people who haven’t done both self-hosting and cloud can really discuss.
  Cloud has abstracted away the cost of horsepower, and converted it to transactions. People are discovering a fraction of the horsepower is needed to service their workloads than they thought.
  At some point the horsepower got way beyond what they needed and it wasn’t noticed. But paying for a cloud is convenient and standardized.
  Company data centres can be reasonably secured using a number of PaaS or IaaS solutions readily available off the shelf. Tools from VMware, Proxmox and others are tremendous.
  It may seem like there’s a lot to learn, except most problems they are new to someone have often been thought of a ton by both people with and without experience that is beyond cloud only.
  - UltraSane 6 months ago
    
    > The cloud is someone else’s computer.
    And in the case of AWS it is someone else's extremely well designed and managed computer and network.
    
    dijit 6 months ago
    
    Extremely well designed? I doubt it.
    Usually the larger the company and the more mission critical the product: the worse the implementation.
    Twitch source code (which, I guess counts as Amazon already), Disney leaks- and my own experience working with very large companies. (Nokia, Ubisoft, Facebook, Activision/Blizzard).
    
    UltraSane 6 months ago
    
    Your comment tells me you have never read any of AWS many documents about how they engineer their components. They put an huge amount of effort into it. AWS is much more reliable that Azure. They have built the largest and most reliable storage system in the world with S3. AWS has stated that some customers have S3 buckets using over 1 million hard drives. Netflix relies heavily on AWS for its streaming services. Lyft runs its ride-sharing platform on AWS. Capital One migrated its entire infrastructure to AWS. Slack relies on AWS for its messaging platform. GE utilizes AWS for industrial IoT (Internet of Things) solutions, predictive maintenance, and data analytics. Twitch streams video to 31 million viewers from AWS.
    https://www.amazon.science/publications/cloud-resource-prote...
    https://www.amazon.science/tag/formal-verification
    https://aws.amazon.com/security/provable-security/resources/
    https://www.amazon.science/blog/custom-policy-checks-help-de...
    https://www.amazon.science/publications/formal-verification-...
    AWS is an industry leader in using formal methods and automated reasoning to prove the security and reliability of critical software and detect insecure configurations
    
    j45 6 months ago
    
    Generally I look to people who could build an AWS on the value of it or doing it themselves because they can do both.
    Happy to hear more.
  - AtlasBarfed 6 months ago
    
    One of the ways the NSA and security services get so much intelligence on targets isn't by direct decryption of what they are storing in data or listening in. A great deal with their intelligence is simply metadata intelligence. They watch what you do. They watch the amount of data you transport. They watch your patterns of movement.
    So even if eight of us is providing direct security and encryption in the sense of what most security professionals are concerned with key strength etc etc etc, Eddie of us still has a great deal about of information about what you do, because they get to watch how much data moves from where to where and other information about what those machines are
  - the_arun 6 months ago
    
    > The cloud is someone else’s computer
    Isn’t it more like leasing in a public property? Meaning it is yours as long as you are paying the lease? Analogous to renting an apartment instead of owning a condo?
    
    adamtulinius 6 months ago
    
    Not at all. You can inspect the apartment you rent. The cloud is totally opaque in that regard.
    
    j45 6 months ago
    
    Totally opaque is a really nice way to describe it.
    
    j45 6 months ago
    
    Nope. It's literally putting private data in a shared drawer in someone else's desk where you have your area of the drawer.
    
    jameshart 6 months ago
    
    Literally?
    I would just like to point out that most of us who have ever had a job at an office, attended an academic institution, or lived in rented accommodation have kept stuff in someone else’s desk drawer from time to time. Often a leased desk in a building rented from a random landlord.
    Keeping things in someone else’s desk drawer can be convenient and offer a sufficient level of privacy for many purposes.
    And your proposed alternative to using ‘someone else’s desk drawer’ is, what, make your own desk?
    I guess, since I’m not a carpenter, I can buy a flatpack desk from ikea and assemble it and keep my stuff in that. I’m not sure that’s an improvement to my privacy posture in any meaningful sense though.
    
    j45 6 months ago
    
    It doesn’t have to be entirely literal, or not literal at all.
    A single point of managed/shared access to a drawer doesn’t fit all levels of data sensitivity and security.
    I understand this kind of wording and analogy might be triggering for the drive by down voters.
    A comment like the above though allows both people to openly consider viewpoints that may not be theirs.
    For me it shed light on something simpler.
    Shared access to shared infrastructure is not always secure as we want to tell ourselves. It’s important to be aware when it might be security through abstraction.
    The dual security and convenience of self-hosting IaaS and PaaS even at a dev, staging or small scale production has improved dramatically, and allows for things to be built in a cloud agnostic way to allow switching clouds to be much easier. It can also easily build a business case to lower cloud costs. Still, it doesn’t have to be for everyone either, where the cloud turns to be everything.
    A small example? For a stable homeland - their a couple of usff small servers running proxmox or something residential fibre behind a tailscale or cloudflare funnel and compare the cost for uptime. It’s surprising how much time servers and apps spend idling.
    Life and the real world is more than binary. Be it all cloud or no cloud.
    
    MadnessASAP 6 months ago
    
    > Keeping things in someone else’s desk drawer can be convenient and offer a sufficient level of privacy for many purposes.
    Too torture a metaphor to death, are you going to keep your bank passwords in somebody else's desk drawer? Are you going to keep 100 million people's bank passwords in that drawer?
    > I guess, since I’m not a carpenter, I can buy a flatpack desk from ikea and assemble it and keep my stuff in that. I’m not sure that’s an improvement to my privacy posture in any meaningful sense though.
    If you're not a carpenter I would recommend you stay out of the business of building safe desk drawers all together. Although you should probably still be able to recognize that the desk drawer you own, that is inside your own locked house is a safer option then the one at the office accessible by any number of people.
    
    jameshart 6 months ago
    
    If you have something physical of equivalent value to 100 million people's bank passwords, you may well not want to risk keeping it in a desk drawer at all, and instead want to look into renting a nice secure drawer from someone else to keep it in. That would be a safety deposit box.
    Which I would argue is rather more like what cloud providers offer than 'someone else's desk drawer' is.
- Aachen 6 months ago
  
  AWS is so complicated, we usually find more impactful permission problems than in any company using their own hardware
- dehrmann 6 months ago
  
  The other part is that when us-east-1 goes down, you can blame AWS, and a third of your customer's vendors will be doing the same. When you unplug the power to your colo rack while installing a new server, that's on you.
  - brandon272 6 months ago
    
    It's not always a full availability zone going down that is the problem. Also, despite the "no one ever got fired for buying Microsoft" logic, in practice I've never actually found stakeholders to be reassured by "its AWS and everyone is affected" when things are down. People want things back up and they want some informed answers about when that might happen, not "ehh its AWS, out of our control".
  - balex 6 months ago
    
    When there's little trust between the business and IT, both are incentivized to move to the cloud.
    It's harder to build trust than the opposite.
  - throwawaysxcd0 6 months ago
    
    OTOH, when your company's web site is down you can do something about it. When the CEO asks about it, you can explain why its offline and more importantly what is being done to bring it back.
    The equivalent situation for those who took a cloud based approach is often... ¯\_(ツ)_/¯
    
    Xylakant 6 months ago
    
    The more relevant question is whether my efforts to do something lead to a better and faster result than my cloud providers efforts to do something. I get it - it feels powerless to do nothing, but for a lot of organizations I’ve seen the average downtime would still be higher.
    
    UltraSane 6 months ago
    
    I worked in IT for a state government and they had a partial outage of their Exchange server that lasted over 2 weeks. It triggered a full migration to Exchange online.
    
    lukevp 6 months ago
    
    With the cloud, in a lot of cases you can have additional regions that incur very little cost as they scale dynamically with traffic. It’s hard to do that with on-prem. Also many AWS services come cross-AZ (AZ is a data center), so their arch is more robust than a single Colo server even if you’re in a single region.
    
    balex 6 months ago
    
    Cross region from on-prem to the cloud for a website is easy. In fact, as long as you don't buy into "cloud native" ("cloud lock-in"?), it's probably more cost effective than two on-prem regions or two cloud regions.
    
    UltraSane 6 months ago
    
    Being able to choose from so many different Availability Zones in so many different regions is one of the best things about AWS. Combined with sophisticated routing strategies that Route 53 supports allows for some very effective designs.
    
    UltraSane 6 months ago
    
    When AWS goes down you can tell your boss that dozens of people are working to get it back up.
    
    szundi 6 months ago
    
    Hey boss, I go to sleep now, site should be up anytime. Cheers
- evantbyrne 6 months ago
  
  Making API calls from a VM on shared hardware to KMS is vastly more secure than doing AES locally? I'm skeptical to say the least.
  - UltraSane 6 months ago
    
    Encrypting data is easy, securely managing keys is the hard part. KMS is the Key Management Service. And AWS put a lot of thought and work into it.
    https://docs.aws.amazon.com/kms/latest/cryptographic-details...
    
    evantbyrne 6 months ago
    
    KMS access is granted by either environment variables or by authorizing the instance itself. Either way, if the instance is compromised, then so is access to KMS. So unless your threat model involves preventing the government from looking at your data through some theoretical sophisticated physical attack, then your primary concerns are likely the same as running a box in another physically secure location. So the same rules of needing to design your encryption scheme to minimize blowout from a complete hostile takeover still apply.
    
    Xylakant 6 months ago
    
    An attacker gaining temporary capability to encrypt/decrypt data through a compromised instance is painful. An attacker gaining a copy of a private key is still an entirely different world of pain.
    
    evantbyrne 6 months ago
    
    Painful is an understatement. Keys for sensitive customer data should be derived from customer secrets either way. Almost nobody does that though, because it requires actual forethought. Instead they just slap secrets in KMS and pretend it's better than encrypted environment variables or other secrets services. If an attacker can read your secrets with the same level of penetration into your system, then it's all the same security wise.
    
    Xylakant 6 months ago
    
    There are many kinds of secrets that are used for purposes where they cannot be derived from customer secrets, and those still need to be secured. TLS private keys for example.
    I do disagree on the second part - there’s a world of a difference whether an attacker obtains a copy of your certificates private key and can impersonate you quietly or whether they gain the capability to perform signing operations on your behalf temporarily while they maintain access to a compromised instance.
    
    evantbyrne 6 months ago
    
    It's all unencrypted secrets from perspective of an attacker. If they somehow already have enough access to read your environment variables, then they can definitely access secrets manager records authorized for that service. By all means put secrets management in a secondary service to prevent leaking keys, but you don't need a cloud service to do that.
    
    fireflash38 6 months ago
    
    It's the same pain, since the resolution is the exact same. You have to rotate.
    
    AtlasBarfed 6 months ago
    
    It's now been two years since I used KMS, but at the time it seemed little more than S3 API interface with Twitter size limitations
    Fundamentally why would KMS be more secure than S3 anyway? Both ultimately have the same fundamental security requirements and do the same thing.
    So the big whirlydoo is KMS has hardware keygen. im sorry, that sounds like something almost guaranteed to have nsa backdoor, or has so much nsa attention it has been compromised.
    
    scrose 6 months ago
    
    If your threat model is the NSA and you’re worried about backdoors then don’t use any cloud provider?
    Maybe I’m just jaded from years doing this, but two things have never failed me for bringing me peace of mind in the infrastructure/ops world:
    1. Use whatever your company has already committed to. Compare options and bring up tradeoffs when committing to a cloud-specific service(ie. AWS Lambdas) versus more generic solutions around cost, security and maintenance.
    2. Use whatever feels right to you for anything else.
    Preventing the NSA from cracking into your system is a fun thought exercise, but life is too short to make that the focus of all your hosting concerns
    
    blackqueeriroh 6 months ago
    
    I guess since this is Hacker News, I shouldn’t be surprised that there are a bunch of commenters who are absolutely certain they and their random colo provider will do a better job of defeating the almighty NSA than AWS.
    You won’t even know when they serve your Colo provider with a warrant under gag order, and I’m certain they’ll be able to bypass your own “tamper-proof” protections.
    
    AtlasBarfed 6 months ago
    
    Soo..... you're saying that KMS hardware key generation isn't that great anyway...
    so, again, why bother with KMS? What does it offer?
    My point about the hardware was asking why KMS hardware key generation has any real value vs a software generated key, and then why bother with KMS and its limited secret size, and you access KMS with a policy/security user or role that can be used equally to lock down S3?
    What is the value of KMS?
    
    UltraSane 6 months ago
    
    If the NSA is part of your threat model then good luck. I'm not sure any single company could withstand the NSA really trying to hack them for years. The threat of possible NSA backdoors is not a reasonable argument against a cloud provider as the NSA could also have backdoors in every CPU AMD and Intel and AWS makes.
    
    champtar 6 months ago
    
    You can securely store your asymmetric key for signing, but if I remember correctly the logs are pretty useless, basically you just know the key was used to make a signature, no option to log the signature or additional metadata, which would help auditing after an account/app compromise.
- psychoslave 6 months ago
  
  Taking for granted all these points. How many businesses out there actually need this kind of security/scalability, compared to how many use cloud services and pay extra cost for something they don't need?
- wslh 6 months ago
  
  From a critical perspective, your comment made me think about the risks posed by rogue IT personnel, especially at scale in the cloud. For example, Fastmail is a single point of failure as a DoS target, whereas attacking an entire datacenter can impact multiple clients simultaneously. It all comes down to understanding the attack vectors.
  - UltraSane 6 months ago
    
    Cloud providers are very big targets but have enormous economic incentive to be secure and thus have very large teams of very competent security experts.
    
    wslh 6 months ago
    
    You can have full security competence but be a rogue actor at the same time.
    
    portaouflop 6 months ago
    
    You can also have rogue actors in your company, you don’t need 3rd parties for that
    
    wslh 6 months ago
    
    That doesn't sum up my comments in the thread. A rogue actor in a datacenter could attack zillions of companies at the same time while rogue actors in a single company only once.
    
    UltraSane 6 months ago
    
    And I bet AWS is also better at detecting rogue actors.
    
    UltraSane 6 months ago
    
    And I bet AWS is better at detecting them.
    
    UltraSane 6 months ago
    
    I don't understand what this is trying to say.
- gooosle 6 months ago
  
  <citations needed>
- likeabatterycar 6 months ago
  
  AWS hires the same cretins that inhabit every other IT department, they just usually happen to be more technically capable. That doesn't make them any more or less trustworthy or reliable.
  - UltraSane 6 months ago
    
    "cretins"?
jandrewrogers 6 months ago

This trivializes some real issues.
The biggest problem the cloud solves is hardware supply chain management. To realize the full benefits of doing your own build at any kind of non-trivial scale you will need to become an expert in designing, sourcing, and assembling your hardware. Getting hardware delivered when and where you need it is not entirely trivial -- components are delayed, bigger customers are given priority allocation, etc. The technical parts are relatively straightforward; managing hardware vendors, logistics, and delivery dates on an ongoing basis is a giant time suck. When you use the cloud, you are outsourcing this part of the work.
If you do this well and correctly then yes, you will reduce costs several-fold. But most people that build their own data infrastructure do a half-ass job of it because they (understandably) don't want to be bothered with any of these details and much of the nominal cost savings evaporate.
Very few companies do security as well as the major cloud vendors. This isn't even arguable.
On the other hand, you will need roughly the same number of people for operations support whether it is private data infrastructure or the cloud, there is little or no savings to be had here. The fixed operations people overhead scales to such a huge number of servers that it is inconsequential as a practical matter.
It also depends on your workload. The types of workloads that benefit most from private data infrastructure are large-scale data-intensive workloads. If your day-to-day is sling tens or hundreds of PB of data for analytics, the economics of private data infrastructure is extremely compelling.
- gizzlon 6 months ago
  
  > managing hardware vendors, logistics, and delivery dates on an ongoing basis is a giant time suck
  You can rent servers and it's still not cloud.
  I'm pretty neutral and definitely see the value of cloud. But a lot of cloud proponents seem to lack, what to me, seems like basic knowledge.
- listenallyall 6 months ago
  
  > don't want to be bothered with any of these details
  Isn't the job to be bothered with the details? 90% of employment for most people is doing shit you don't really want to be doing, but that's the job.
tomrod 6 months ago

<ctoHatTime> Dunno man, it's really really easy to set up an S3 and use it to share datasets for users authorized with IAM....
And IAM and other cloud security and management considerations is where the opex/capex and capability argument can start to break down. Turns out, the "cloud" savings comes from not having capabilities in house to manage hardware. Sometimes, for most businesses, you want some of that lovely reliability.
(In short, I agree with you, substantially).
Like code. It is easy to get something basic up, but substantially more resources are needed for non-trivial things.
- hamandcheese 6 months ago
  
  I feel like IAM may be the sleeper killer-app of cloud.
  I self-host a lot of things, but boy oh boy if I were running a company it would be a helluvalotta work to get IAM properly set up.
  - sanderjd 6 months ago
    
    I strongly agree with this and also strongly lament it.
    I find IAM to be a terrible implementation of a foundationally necessary system. It feels tacked on to me, except now it's tacked onto thousands of other things and there's no way out.
    
    andrewfromx 6 months ago
    
    like terraform! isn't pulumi 100% better but there's no way out of terraform.
  - pphysch 6 months ago
    
    That's essentially why "platform engineering" is a hot topic. There are great FOSS tools for this, largely in the Kubernetes ecosystem.
    To be clear, authentication could still be outsourced, but authorizing access to (on-prem) resources in a multi-tenant environment is something that "platforms" are frequently designed for.
ttul 6 months ago

My firm belief after building a service at scale (tens of millions of end users, > 100K tps) is that AWS is unbeatable. We don’t even think about building our own infrastructure. There’s no way we could ever make it reliable enough, secure enough, and future-proof enough to ever pay back the cost difference.
Something people neglect to mention when they tout their home grown cloud is that AWS spends significant cycles constantly eliminating technical debt that would absolutely destroy most companies - even ones with billion dollar services of their own. The things you rely on are constantly evolving and changing. It’s hard enough to keep up at the high level of a SaaS built on top of someone else’s bulletproof cloud. But imagine also having to keep up with the low level stuff like networking and storage tech?
No thanks.
- balex 6 months ago
  
  I've done it. It's nowhere as complicated as you make it seem. It definitely doesn't kill - no more than failing to manage your software tech debt. In fact, the latter is both harder keep up with and more risky, because it changes faster than the low level stuff, to support business needs.
  With the cloud you have IT/DevOps deal only with scaling the software components of the infra. When doing on-prem they take on the physical layer as well. Do you have enough trust in them to scale the physical part where needed?
- watsocd 6 months ago
  
  ...and power, backup power, HVAC, physical security...
  - balex 6 months ago
    
    Or buy colo space and they do it for you. It's not all cloud vs owning a datacenter - There's a thousand shades of ltgrey
swiftcoder 6 months ago

> All the pro-cloud talking points... don't persuade anyone with any real technical understanding
This is a very engineer-centric take. The cloud has some big advantages that are entirely non-technical:
- You don't need to pay for hardware upfront. This is critical for many early-stage startups, who have no real ability to predict CapEx until they find product/market fit.
- You have someone else to point the SOC2/HIPAA/etc auditors at. For anyone launching a company in a regulated space, being able to checkbox your entire infrastructure based on AWS/Azure/etc existing certifications is huge.
- shortsunblack 6 months ago
  
  You can over-provision your own baremetal resources 20x and it will be still cheaper than cloud. The capex talking point is just that, a talking point.
  - swiftcoder 6 months ago
    
    As an early-stage startup?
    Your spend in the first year on AWS is going to be very close to zero for something like a SaaS shop.
    Nor can you possibly scale in-house baremetal fast enough if you hit the fabled hockey stick growth. By the time you sign a colocation contract and order hardware, your day in the sun may be over.
- rakoo 6 months ago
  
  > You have someone else to point the SOC2/HIPAA/etc auditors at.
  I would assume you still need to point auditors to your software in any case
  - resonious 6 months ago
    
    You do, which makes it very nice to not have to answer questions about the physical security of your servers.
zosima 6 months ago

Cloud expands the capabilities of what one team can manage by themselves, enabling them to avoid a huge amount of internal politics.
This is worth astronomical amounts of money in big corps.
- sgarland 6 months ago
  
  I’m not convinced this is entirely true. The upfront cost if you don’t have the skills, sure – it takes time to learn Linux administration, not to mention management tooling like Ansible, Puppet, etc.
  But once those are set up, how is it different? AWS is quite clear with their responsibility model that you still have to tune your DB, for example. And for the setup, just as there are Terraform modules to do everything under the sun, there are Ansible (or Chef, or Salt…) playbooks to do the same. For both, you _should_ know what all of the options are doing.
  The only way I see this sentiment being true is that a dev team, with no infrastructure experience, can more easily spin up a lot of infra – likely in a sub-optimal fashion – to run their application. When it inevitably breaks, they can then throw money at the problem via vertical scaling, rather than addressing the root cause.
  - tylerchurch 6 months ago
    
    I think this is only true for teams and apps of a certain size.
    I've worked on plenty of teams with relatively small apps, and the difference between:
    1. Cloud: "open up the cloud console and start a VM"
    2. Owned hardware: "price out a server, order it, find a suitable datacenter, sign a contract, get it racked, etc."
    Is quite large.
    #1 is 15 minutes for a single team lead.
    #2 requires the team to agree on hardware specs, get management approval, finance approval, executives signing contracts. And through all this you don't have anything online yet for... weeks?
    If your team or your app is large, this probably all averages out in favor of #2. But small teams often don't have the bandwidth or the budget.
    
    maccard 6 months ago
    
    I work for a 50 person subsidiary of a 30k person organisation. I needed a domain name. I put in the purchase request and 6 months later eventually gave up, bought it myself and expensed it.
    Our AWS account is managed by an SRE team. It’s a 3 day turnaround process to get any resources provisioned, and if you don’t get the exact spec right (you forgot to specify the iops on the volume? Oops) 3 day turnaround. Already started work when you request an adjustment? Better hope as part of your initial request you specified backups correctly or you’re starting again.
    The overhead is absolutely enormous, and I actually don’t even have billing access to the AWS account that I’m responsible for.
    
    mbesto 6 months ago
    
    > 3 day turnaround process to get any resources provisioned
    Now imagine having to deal with procurement to purchase hardware for your needs. 6 months later you have a server. Oh you need a SAN for object storage? There goes another 6 months.
    
    maccard 6 months ago
    
    At a previous job we had some decent on prem resources for internal services. The SRE guys had a bunch of extra compute and you would put in a ticket for a certain amount of resources (2 cpu, SSD, 8GB memory x2 on different hosts). There wasn’t a massive amount of variability between the hardware, and you just requested resources to be allocated from a bunch of hypervisors. Turnaround time was about 3 days too. Except, you were t required to be self sufficient in AWS terminology to request exactly what you needed .
    
    cyberax 6 months ago
    
    > Our AWS account is managed by an SRE team.
    That's an anti-pattern (we call it "the account") in the AWS architecture.
    AWS internally just uses multiple accounts, so a team can get their own account with centrally-enforced guardrails. It also greatly simplifies billing.
    
    maccard 6 months ago
    
    That’s not something that I have control over or influence over.
    
    j45 6 months ago
    
    Manageability of cloud without a dedicated resource is a form of resource creep, and shadow labour costs that aren’t factored in.
    How many things don’t end up happening because of this? When they need a sliver of resources in the start?
    
    AnthonyMouse 6 months ago
    
    You're assuming that hosting something in-house implies that each application gets its own physical server.
    You buy a couple of beastly things with dozens of cores. You can buy twice as much capacity as you actually use and still be well under the cost of cloud VMs. Then it's still VMs and adding one is just as fast. When the load gets above 80% someone goes through the running VMs and decides if it's time to do some house cleaning or it's time to buy another host, but no one is ever waiting on approval because you can use the reserve capacity immediately while sorting it out.
    
    layer8 6 months ago
    
    The SMB I work for runs a small on-premise data center that is shared between teams and projects, with maybe 3-4 FTEs managing it (the respective employees also do dev and other work). This includes self-hosting email, storage, databases, authentication, source control, CI, ticketing, company wiki, chat, and other services. The current infrastructure didn’t start out that way and developed over many years, so it’s not necessarily something a small startup can start out with, but beyond a certain company size (a couple dozen employees or more) it shouldn’t really be a problem to develop that, if management shares the philosophy. I certainly find it preferable culturally, if not technically, to maximize independence in that way, have the local expertise and much better control over everything.
    One (the only?) indisputable benefit of cloud is the ability to scale up faster (elasticity), but most companies don’t really need that. And if you do end up needing it after all, then it’s a good problem to have, as they say.
    
    SoftTalker 6 months ago
    
    Your last paragraph identifies the reason that running their own hardware makes sense for Fastmail. The demand for email is pretty constant. Everyone does roughly the same amount of emailing every day. Daily load is predictable, and growth is predictable.
    If your load is very spiky, it might make more sense to use cloud. You pay more for the baseline, but if your spikes are big enough it can still be cheaper than provisioning your own hardware to handle the highest loads.
    Of course there's also possibly a hybrid approach, you run your own hardware for base load and augment with cloud for spikes. But that's more complicated.
    
    amluto 6 months ago
    
    I’ve never worked at a company with these particular problems, but:
    #1: A cloud VM comes with an obligation for someone at the company to maintain it. The cloud does not excuse anyone from doing this.
    #2: Sounds like a dysfunctional system. Sure, it may be common, but a medium sized org could easily have some datacenter space and allow any team to rent a server or an instance, or to buy a server and pay some nominal price for the IT team to keep it working. This isn’t actually rocket science.
    Sure, keeping a fifteen year old server working safely is a chore, but so is maintaining a fifteen-year-old VM instance!
    
    j45 6 months ago
    
    The cloud is someone else’s computer.
    Having redirected of a vm provider or installing a hyper visor on equipment is another thing.
    
    icedchai 6 months ago
    
    Obligation? Far from it. I've worked at some poorly staffed companies. Nobody is maintaining old VMs or container images. If it works, nobody touches it.
    I worked at a supposedly properly staffed company that had raised 100's of millions in investment, and it was the same thing. VMs running 5 year old distros that hadn't been updated in years. 600 day uptimes, no kernel patches, ancient versions of Postgres, Python 2.7 code everywhere, etc. This wasn't 10 years ago. This was 2 years ago!
    
    xorcist 6 months ago
    
    There is a large gap between "own the hardware" and "use cloud hosting". Many people rent the hardware, for example, and you can use managed databases which is one step up than "starting a vm".
    But your comparison isn't fair. The difference between running your own hardware and using the cloud (which is perhaps not even the relevant comparison but let's run with it) is the difference between:
    1. Open up the cloud console, and
    2. You already have the hardware so you just run "virsh" or, more likely, do nothing at all because you own the API so you have already included this in your Ansible or Salt or whatever you use for setting up a server.
    Because ordering a new physical box isn't really comparable to starting a new VM, is it?
    
    sanderjd 6 months ago
    
    I've always liked the theory of #2, I just haven't worked anywhere yet that has executed it well.
    
    necovek 6 months ago
    
    Before the cloud, you could get a VM provisioned (virtual servers) or a couple of apps set up (LAMP stack on a shared host ;)) in a few minutes over a web interface already.
    "Cloud" has changed that by providing an API to do this, thus enabling IaC approach to building combined hardware and software architectures.
    
    Symbiote 6 months ago
    
    You have omitted the option between the two, which is renting a server. No hardware to purchase, maintain or set up. Easily available in 15 minutes.
    
    tylerchurch 6 months ago
    
    While I did say "VM" in my original comment, to me this counts as "cloud" because the UI is functionally the same.
    
    noprocrasted 6 months ago
    
    3. "Dedicated server" at any hosting provider
    Open their management console, press order now, 15 mins later get your server's IP address.
    
    zbentley 6 months ago
    
    For purposes of this discussion, isn't AWS just a very large hosting provider?
    I.e. most hosting providers give you the option for virtual or dedicated hardware. So does Amazon (metal instances).
    Like, "cloud" was always an ill-defined term, but in the case of "how do I provision full servers" I think there's no qualitative difference between Amazon and other hosting providers. Quantitative, sure.
    
    noprocrasted 6 months ago
    
    > Amazon (metal instances)
    But you still get nickel & dimed and pay insane costs, including on bandwidth (which is free in most conventional hosting providers, and overages are 90x cheaper than AWS' costs).
    
    irunmyownemail 6 months ago
    
    Qualitatively, AWS is greedy and nickle and dime you to death. Their Route53 service doesn't even have all the standard DNS options I need and I can get everywhere else or even on my own running bind9. I do not use IPv6 for several reasons, when AWS decided charge for IPv4, I went looking elsewhere to get my VM's.
    I can't even imagine how much the US Federal Government is charging American taxpayers to pay AWS for hosting there, it has to be astronomical.
    
    everfrustrated 6 months ago
    
    Out of curiosity, which DNS record types do you need that Route53 doesn't support?
    
    goodpoint 6 months ago
    
    More like 15 seconds.
    
    warner25 6 months ago
    
    You gave me flashbacks to a far worse bureaucratic nightmare with #2 in my last job.
    I supported an application with a team of about three people for a regional headquarters in the DoD. We had one stack of aging hardware that was racked, on a handshake agreement with another team, in a nearby facility under that other team's control. We had to periodically request physical access for maintenance tasks and the facility routinely lost power, suffered local network outages, etc. So we decided that we needed new hardware and more of it spread across the region to avoid the shaky single-point-of-failure.
    That began a three year process of: waiting for budget to be available for the hardware / license / support purchases; pitching PowerPoints to senior management to argue for that budget (and getting updated quotes every time from the vendors); working out agreements with other teams at new facilities to rack the hardware; traveling to those sites to install stuff; and working through the cybersecurity compliance stuff for each site. I left before everything was finished, so I don't know how they ultimately dealt with needing, say, someone to physically reseat a cable in Japan (an international flight away).
    
    j45 6 months ago
    
    There is. Middle ground between the extremes of those pendulums of all cloud or physical metal.
    You can start with using a cloud only for VMs and only run services on it using IaaS or PaaS. Very serviceable.
  - bonoboTP 6 months ago
    
    You can get pretty far without any of that fancy stuff. You can get plenty done by using parallel-ssh and then focusing on the actual thing you develop instead of endless tooling and docker and terraform and kubernetes and salt and puppet and ansible. Sure, if you know why you need them and know what value you get from them OK. But many people just do it because it's the thing to do...
  - the__alchemist 6 months ago
    
    Do you need those tools? It seems that for fundamental web hosting, you need your application server, nginx or similar, postgres or similar, and a CLI. (And an interpreter etc if your application is in an interpreted lang)
    
    sgarland 6 months ago
    
    I suppose that depends on your RTO. With cloud providers, even on a bare VM, you can to some extent get away with having no IaC, since your data (and therefore config) is almost certainly on networked storage which is redundant by design. If an EC2 fails, or even if one of the drives in your EBS drive fails, it'll probably come back up as it was.
    If it's your own hardware, if you don't have IaC of some kind – even something as crude as a shell script – then a failure may well mean you need to manually set everything up again.
    
    champtar 6 months ago
    
    All EBS volumes except io2 have advertised durability of 99.8%, which is pretty low, so don't count it in the magic networked storage category.
    
    noprocrasted 6 months ago
    
    Get two servers (or three, etc)?
    
    sgarland 6 months ago
    
    Well, sure – I was trying to do a comparison in favor of cloud, because the fact that EBS Volumes can magically detach and attach is admittedly a neat trick. You can of course accomplish the same (to a certain scale) with distributed storage systems like Ceph, Longhorn, etc. but then you have to have multiple servers, and if you have multiple servers, you probably also have your application load balanced with failover.
    
    zbentley 6 months ago
    
    For fundamentals, that list is missing:
    - Some sort of firewall or network access control. Being able to say "allow http/s from the world (optionally minus some abuser IPs that cause problems), and allow SSH from developers (by IP, key, or both)" at a separate layer from nginx is prudent. Can be ip/tables config on servers or a separate firewall appliance.
    - Some mechanism of managing storage persistence for the database, e.g. backups, RAID, data files stored on fast network-attached storage, db-level replication. Not losing all user data if you lose the DB server is table stakes.
    - Something watching external logging or telemetry to let administrators know when errors (e.g. server failures, overload events, spikes in 500s returned) occur. This could be as simple as Pingdom or as involved as automated alerting based on load balancer metrics. Relying on users to report downtime events is not a good approach.
    - Some sort of CDN, for applications with a frontend component. This isn't required for fundamental web hosting, but for sites with a frontend and even moderate (10s/sec) hit rates, it can become required for cost/performance; CDNs help with egress congestion (and fees, if you're paying for metered bandwidth).
    - Some means of replacing infrastructure from nothing. If the server catches fire or the hosting provider nukes it, having a way to get back to where you were is important. Written procedures are fine if you can handle long downtime while replacing things, but even for a handful of application components those procedures get pretty lengthy, so you start wishing for automation.
    - Some mechanism for deploying new code, replacing infrastructure, or migrating data. Again, written procedures are OK, but start to become unwieldy very early on ('stop app, stop postgres, upgrade the postgres version, start postgres, then apply application migrations to ensure compatibility with new version of postgres, then start app--oops, forgot to take a postgres backup/forgot that upgrading postgres would break the replication stream, gotta write that down for net time...').
    ...and that's just for a very, very basic web hosting application--one that doesn't need caches, blob stores, the ability to quickly scale out application server or database capacity.
    Each of those things can be accomplished the traditional way--and you're right, that sometimes that way is easier for a given item in the list (especially if your maintainers have expertise in that item)! But in aggregate, having a cloud provider handle each of those concerns tends to be easier overall and not require nearly as much in-house expertise.
  - sanderjd 6 months ago
    
    I have never ever worked somewhere with one of these "cloud-like but custom on our own infrastructure" setups that didn't leak infrastructure concerns through the abstraction, to a significantly larger degree than AWS.
    I believe it can work, so maybe there are really successful implementations of this out there, I just haven't seen it myself yet!
  - zosima 6 months ago
    
    You are focusing on technology. And sure of course you can get most of the benefits of AWS a lot cheaper when self-hosting.
    But when you start factoring internal processes and incompetent IT departments, suddenly that's not actually a viable option in many real-world scenarios.
    
    jeffbee 6 months ago
    
    Exactly. With the cloud you can suddenly do all the things your tyrannical Windows IT admin has been saying are impossible for the last 30 years.
    
    the_arun 6 months ago
    
    It is similar to cooking at home vs ordering cooked food everyday. If some guarantees the taste & quality people would happy to outsource it.
  - marcosdumay 6 months ago
    
    All of that is... completely unrelated to the GP's post.
    Did you reply to the right comment? Do you think "politics" is something you solve with Ansible?
    
    sgarland 6 months ago
    
    > Cloud expands the capabilities of what one team can manage by themselves, enabling them to avoid a huge amount of internal politics.
    It's related to the first part. Re: the second, IME if you let dev teams run wild with "managing their own infra," the org as a whole eventually pays for that when the dozen bespoke stacks all hit various bottlenecks, and no one actually understands how they work, or how to troubleshoot them.
    I keep being told that "reducing friction" and "increasing velocity" are good things; I vehemently disagree. It might be good for short-term profits, but it is poison for long-term success.
    
    marcosdumay 6 months ago
    
    > I keep being told that "reducing friction" and "increasing velocity" are good things
    As always, good rules are good, and bad rules are bad.
    Like most people on the internet, you are assuming only one of those sets exist. But you are just assuming a different set from everybody that you are criticizing.
- daemonologist 6 months ago
  
  Our big company locked all cloud resources behind a floating/company-wide DevOps team (git and CI too). We have an old on-prem server that we jealously guard because it allows us to create remotes for new git repos and deploy prototypes without consulting anyone.
  (To be fair, I can see why they did it - a lot of deployments were an absolute mess before.)
- mark242 6 months ago
  
  This is absolutely spot on.
  What do you mean, I can't scale up because I've used my hardware capex budget for the year?
- acedTrex 6 months ago
  
  I have said for years the value of cloud is mainly its api, thats the selling point in large enterprise.
  - sgarland 6 months ago
    
    Self-hosted software also has APIs, and Terraform libraries, and Ansible playbooks, etc. It’s just that you have to know what it is you’re trying to do, instead of asking AWS what collection of XaaS you should use.
ksec 6 months ago

Even as an Anti-Cloud ( Or more accurately Anti-everything Cloud ) person I still think there are many benefits to cloud. Just most of the them are over sold and people dont need it.
Number one is company bureaucracy and politics. No one wants to beg another person or department, go on endless meetings just to have extra hardware provisioned. For engineers that alone is worth perhaps 99% of all current cloud margins.
Number two is also company bureaucracy and politics. CFOs dont like CapX. Turning it into OpeX makes things easier for them. Along with end of year company budget turning into Cloud credits for different departments. Especially for companies with government fundings.
Number three is really company bureaucracy and politics. Dealing with either Google, AWS and Microsoft meant you no longer have to deal with dozens of different vendors from on server, networking hardware, software licenses etc. Instead it is all pre-approved into AWS, GCP or Azure. This is especially useful for things that involves Government contracts or fundings.
There are also things like instant worldwide deployment. You can have things up and running in any regions within seconds. And useful when you have site that gets 10 to 1000x the normal traffic from time to time.
But then a lot of small business dont have these sort of issues. Especially non-consumer facing services. Business or SaaS are highly unlikely to get 10x more customers within short period of time.
I continue to wish there is a middle ground somewhere. You rent dedicated server for cheap as base load and use cloud for everything else.
necovek 6 months ago

But isn't using Fastmail akin to using a cloud provider (managed email vs managed everything else)? They are similarly a service provider, and as a customer, you don't really care "who their ISP is?"
The discussion matters when we are talking about building things: whether you self-host or use managed services is a set of interesting trade-offs.
- citrin_ru 6 months ago
  
  Yes, FastMail is a SAAS. But there adepts of a religion which would tell you that companies like FastMail should be built on top of AWS and it is the only true way. It is good to have some counter narrative to this.
  - j45 6 months ago
    
    Being cloud compatible (packaged well) can be as important as being cloud-agnostic (work on any cloud).
    Too many projects become beholden to one cloud.
cpursley 6 months ago

The fact is, managing your own hardware is a pita and a distraction from focusing on the core product. I loathe messing with servers and even opt for "overpriced" paas like fly, render, vercel. Because every minute messing with and monitoring servers is time not spent on product. My tune might change past a certain size and a massive cloud bill and there's room for full time ops people, but to offset their salary, it would have to be huge.
- noprocrasted 6 months ago
  
  That argument makes sense for PaaS services like the ones you mention. But for bare "cloud" like AWS, I'm not convinced it is saving any effort, it's merely swapping one kind of complexity with another. Every place I've been in had full-time people messing with YAML files or doing "something" with the infrastructure - generally trying to work around the (self-inflicted) problems introduced by their cloud provider - whether it's the fact you get 2010s-era hardware or that you get nickel & dimed on absolutely arbitrary actions that have no relationship to real-world costs.
  - jeffbee 6 months ago
    
    In what sense is AWS "bare cloud"? S3, DynamoDB, Lambda, ECS?
    
    noprocrasted 6 months ago
    
    How do you configure S3 access control? You need to learn & understand how their IAM works.
    How do you even point a pretty URL to a lambda? Last time I looked you need to stick an "API gateway" in front (which I'm sure you also get nickel & dimed for).
    How do you go from "here's my git repo, deploy this on Fargate" with AWS? You need a CI pipeline which will run a bunch of awscli commands.
    And I'm not even talking about VPCs, security groups, etc.
    Somewhat different skillsets than old-school sysadmin (although once you know sysadmin basics, you realize a lot of these are just the same concepts under a branded name and arbitrary nickel & diming sprinkled on top), but equivalent in complexity.
    
    inkyoto 6 months ago
    
    How does one install and run Linux/BSD/another UNIX? One needs to learn and understand how a UNIX works.
    The essence of the complaint that one has to have the knowledge of something before that something can be used. It seems like a reasonable expectation for just about anything in life.
    (The API gateway in AWS is USD 2.35 for 10 million 32 kB requests, a Lambda can have its own private URL if required and Fargate does not deploy Git repos, it runs Docker images.)
    
    noprocrasted 6 months ago
    
    > The essence of the complaint that one has to have the knowledge of something before that something can be used
    My point was to disprove that "cloud" is simpler than conventional sysadmin - it is not, and it involves similar effort, complexity and manpower requirements.
    
    inkyoto 6 months ago
    
    I will have to disagree on that.
    Cloud is simpler than conventional sysadmin, once its foundational principles are understood and the declarative approach to the cloud architecture is adopted. If I want to run a solution, cloud gives me just that – a platform that simply runs my solution and abstracts the sysadmin ugliness away.
    I have experienced both sides, including UNIX kernel and system programming, and I don't want to even think about sysadmin unless I want to tinker with a UNIX box on a weekend as a leisure activity.
    
    inemesitaffia 6 months ago
    
    EC2
    
    bsder 6 months ago
    
    I would actually argue that EC2 is a "cloud smell"--if you're using EC2 you're doing it wrong.
- sgarland 6 months ago
  
  Counterpoint: if you’re never “messing with servers,” you probably don’t have a great understanding of how their metrics map to those of your application’s, and so if you bottleneck on something, it can be difficult to figure out what to fix. The result is usually that you just pay more money to vertically scale.
  To be fair, you did say “my tune might change past a certain size.” At small scale, nothing you do within reason really matters. World’s worst schema, but your DB is only seeing 100 QPS? Yeah, it doesn’t care.
  - tokioyoyo 6 months ago
    
    I don’t think you’re correct. I’ve watched junior/mid-level engineers figure things out solely by working on the cloud and scaling things to a dramatic degree. It’s really not a rocket science.
    
    sgarland 6 months ago
    
    I didn't say it's rocket science, nor that it's impossible to do without having practical server experience, only that it's more difficult.
    Take disks, for example. Most cloud-native devs I've worked with have no clue what IOPS are. If you saturate your disk, that's likely to cause knock-on effects like increased CPU utilization from IOWAIT, and since "CPU is high" is pretty easy to understand for anyone, the seemingly obvious solution is to get a bigger instance, which depending on the application, may inadvertently solve the problem. For RDBMS, a larger instance means a bigger buffer pool / shared buffers, which means fewer disk reads. Problem solved, even though actually solving the root cause would've cost 1/10th or less the cost of bumping up the entire instance.
    
    tokioyoyo 6 months ago
    
    > Most cloud-native devs
    You might be making some generalizations from your personal experience. Since 2015, at all of my jobs, everything has been running on some sort of a cloud. I'm yet to meet a person who doesn't understand IOPS. If I was a junior (and from my experience, that's what they tend to do), I'd just google "slow X potential reasons". You'll most likely see some references to IOPS and continue your research from there.
    We've learned all these things one way or another. My experience started around 2007ish when I was renting out cheap servers from some hosting providers. Others might be dipping their feet into readily available cloud-infrastructure, and learning it from that end. Both works.
- cpursley 6 months ago
  
  Anecdotal - but I once worked for a company where the product line I built for them after acquisition was delayed by 5 months because that's how long it took to get the hardware ordered and installed in the datacenter. Getting it up on AWS would have been a days work, maybe two.
  - stubish 6 months ago
    
    Yes, it is death by 1000 cuts. Speccing, negotiating with hardware vendors, data center selection and negotiating, DC engineer/remote hands, managing security cage access, designing your network, network gear, IP address ranges, BGP, secure remote console access, cables, shipping, negotiating with bandwidth providers (multiple, for redundancy), redundant hardware, redundant power sources, UPS. And then you get to plug your server in. Now duplicate other stuff your cloud might provide, like offsite backups, recovery procedures, HA storage, geographic redundancy. And do it again when you outgrown your initial DC. Or build your own DC (power, climate, fire protection, security, fiber, flooring, racks)
    
    sgarland 6 months ago
    
    Much of this is still required in cloud. Also, I think you're missing the middle ground where 99.99% of companies could happily exist indefinitely: colo. It makes little to no financial or practical sense for most to run their own data centers.
  - sroussey 6 months ago
    
    Oh, absolutely, with your own hardware you need planning. Time to deployment is definitely a thing.
    Really, the one major thing that bites on cloud providers in there 99.9% margin on egress. The markup is insane.
- icedchai 6 months ago
  
  Writing piles of IaC code like Terraform and CloudFormation is also a PITA and a distraction from focusing on your core product.
  PaaS is probably the way to go for small apps.
  - sgarland 6 months ago
    
    A small app (or a larger one, for that matter) can quite easily run on infra that's instantiated from canned IaC, like TF AWS Modules [0]. If you can read docs, you should be able to quite trivially get some basic infra up in a day, even with zero prior experience managing it.
    [0]: https://github.com/terraform-aws-modules
    
    icedchai 6 months ago
    
    Yes, I've used several of these modules myself. They save tons of time! Unfortunately, for legacy projects, I inherited a bunch of code from individuals that built everything "by hand" then copy-pasted everything. No re-usability.
  - UltraSane 6 months ago
    
    But that effort has a huge payoff in that it can be used to disaster recovery in a new region and to spin up testing environments.
- fhd2 6 months ago
  
  I'm with you there, with stuff like fly.io, there's really no reason to worry about infrastructure.
  AWS, on the other hand, seems about as time consuming and hard as using root servers. You're at a higher level of abstraction, but the complexity is about the same I'd say. At least that's my experience.
  - cpursley 6 months ago
    
    I agree with this position and actively avoid AWS complexity.
- xorcist 6 months ago
  
  > every minute messing with and monitoring servers
  You're not monitoring your deployments because "cloud"?
TacticalCoder 6 months ago

> All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding ...
And moreover most of the actual interesting things, like having VM templates and stateless containers, orchestration, etc. is very easy to run yourself and gets you 99.9% of the benefits of the cloud.
About just any and every service is available as container file already written for you. And if it doesn't exist, it's not hard to plumb up.
A friend of mine runs more than 700 containers (yup, seven hundreds), split over his own rack at home (half of them) and the other half on dedicated servers (he runs stuff like FlightRadar, AI models, etc.). He'll soon get his own IP addresses space. Complete "chaos monkey" ready infra where you can cut any cable and the thing shall keep working: everything is duplicated, can be spun up on demand, etc. Someone could still his entire rack and all his dedicated server, he'd still be back operational in no time.
If an individual can do that, a company, no matter its size, can do it too. And arguably 99.9% of all the companies out there don't have the need for an infra as powerful as the one most homelab enthusiast have.
And another thing: there's even two in-betweens between "cloud" and "our own hardware located at our company". First is colocating your own hardware but in a datacenter. Second is renting dedicated servers from a datacenter.
They're often ready to accept cloud-init directly.
And it's not hard. I'd say learning to configure hypervisors on bare metal, then spin VMs from templates, then running containers inside the VMs is actually much easier than learning all the idiosyncrasies of all the different cloud vendors APIs and whatnots.
Funnily enough when the pendulum swung way too far on the "cloud all the things" side, those saying at some point we'd read story about repatriation were being made fun of.
- sgarland 6 months ago
  
  > If an individual can do that, a company, no matter its size, can do it too.
  Fully agreed. I don't have physical HA – if someone stole my rack, I would be SOL – but I can easily ride out a power outage for as long as I want to be hauling cans of gasoline to my house. The rack's UPS can keep it up at full load for at least 30 minutes, and I can get my generator running and hooked up in under 10. I've done it multiple times. I can lose a single server without issue. My only SPOF is internet, and that's only by choice, since I can get both AT&T and Spectrum here, and my router supports dual-WAN with auto-failover.
  > And arguably 99.9% of all the companies out there don't have the need for an infra as powerful as the one most homelab enthusiast have.
  THIS. So many people have no idea how tremendously fast computers are, and how much of an impact latency has on speed. I've benchmarked my 12-year old Dells against the newest and shiniest RDS and Aurora instances on both MySQL and Postgres, and the only ones that kept up were the ones with local NVMe disks. Mine don't even technically have _local_ disks; they're NVMe via Ceph over Infiniband.
  Does that scale? Of course not; as soon as you want geo-redundant, consistent writes, you _will_ have additional latency. But most smaller and medium companies don't _need_ that.
RainyDayTmrw 6 months ago

I hear this debate repeated often, and I think there's another important factor. It took me some time to figure out how to explain it, and the best I came up with was this: It is extremely difficult to bootstrap from zero to baseline competence, in general, and especially in an existing organization.
In particular, there is a limit to paying for competence, and paying more money doesn't automatically get you more competence, which is especially perilous if your organization lacks the competence to judge competence. In the limit case, this gets you the Big N consultancies like PWC or EY. It's entirely reasonable to hire PWC or EY to run your accounting or compliance. Hiring PWC or EY to run your software development lifecycle is almost guaranteed doom, and there is no shortage of stories on this site to support that.
In comparison, if you're one of these organizations, who don't yet have baseline competence in technology, then what the public cloud is selling is nothing short of magical: You pay money, and, in return, you receive a baseline set of tools, which all do more or less what they say they will do. If no amount of money would let you bootstrap this competence internally, you'd be much more willing to pay a premium for it.
As an anecdote, my much younger self worked in mid-sized tech team in a large household brand in a legacy industry. We were building out a web product that, for product reasons, had surprisingly high uptime and scalability requirements, relative to legacy industry standards. We leaned heavily on public cloud and CDNs. We used a lot of S3 and SQS, which allowed us to build systems with strong reliability characteristics, despite none of us having that background at the time.
dan-robertson 6 months ago

Well cloud providers often give more than just VMs in a data enter somewhere. You may not be able to find good equivalents if you aren’t using the cloud. Some third-party products are also only available on clouds. How much of a difference those things make will depend on what you’re trying to do.
I think there are accounting reasons for companies to prefer paying opex to run things on the cloud instead of more capex-intensive self-hosting, but I don’t understand the dynamics well.
It’s certainly the case that clouds tend to be more expensive than self-hosting, even when taking account of the discounts that moderately sized customers can get, and some of the promises around elastic scaling don’t really apply when you are bigger.
To some of your other points: the main customers of companies like AWS are businesses. Businesses generally don’t care about the centralisation of the internet. Businesses are capable of reading the contracts they are signing and not signing them if privacy (or, typically more relevant to businesses, their IP) cannot be sufficiently protected. It’s not really clear to me that using a cloud is going to be less secure than doing things on-prem.
motorest 6 months ago

> All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding,(...)
This is where you lose all credibility.
I'm going to focus on a single aspect: performance. If you're serving a global user base and your business, like practically all online businesses, is greatly impacted by performance problems, the only solution to a physics problem is to deploy your application closer to your users.
With any cloud provider that's done with a few clicks and an invoice of a few hundred bucks a month. If you're running your hardware... What solution do you have to show for? Do you hope to create a corporate structure to rent a place to host your hardware manned by a dedicated team? What options f you have?
- stefan_ 6 months ago
  
  Is everyone running online FPS gaming servers now? If you want your page to load faster, tell your shitty frontend engineers to use less of the latest frameworks. You are not limited by physics, 99% aren't.
  I ping HN, it's 150ms away, it still renders in the same time that the Google frontpage does and that one has a 130ms advantage.
  - pixelesque 6 months ago
    
    Erm, 99%'s clearly wrong and I think you know it, even if you are falling into the typical trap of "only Americans matter"...
    As someone in New Zealand, latency does really matter sometimes, and is painfully obvious at times.
    HN's ping for me is around: 330 ms.
    Anyway, ping doesn't really describe the latency of the full DNS lookup propogation, TCP connection establishment and TLS handshake: full responses for HN are around 900 ms for me till last byte.
    
    justsomehnguy 6 months ago
    
    > latency does really matter sometimes
    Yes, sometimes.
    You know what matters way more?
    If you throw 12MBytes to the client in a multiple connections on multiple domains to display 1KByte of information. Eg: 'new' Reddit.
- noprocrasted 6 months ago
  
  The complexity of scaling out an application to be closer to the users has never been about getting the hardware closer. It's always about how do you get the data there and dealing with the CAP theorem, which requires hard tradeoffs to be decided on when designing the application and can't be just tacked on - there is no magic button to do this, in the AWS console or otherwise.
  Getting the hardware closer to the users has always been trivial - call up any of the many hosting providers out there and get a dedicated server, or a colo and ship them some hardware (directly from the vendor if needed).
- johnklos 6 months ago
  
  > This is where you lose all credibility.
  People who write that, well...
  If you're greatly impacted by performance problems, how does that become a physics problem that has as a solution which is being closer to your users?
  I think you're mixing up your sales points. One, how do you scale hardware? Simple: you buy some more, and/or you plan for more from the beginning.
  How do you deal with network latency for users on the other side of the planet? Either you plan for and design for long tail networking, and/or you colocate in multiple places, and/or you host in multiple places. Being aware of cloud costs, problems and limitations doesn't mean you can't or shouldn't use cloud at all - it just means to do it where it makes sense.
  You're making my point for me - you've got emotional generalizations ("you lose all credibility"), you're using examples that people use often but that don't even go together, plus you seem to forget that hardly anyone advocates for all one or all the other, without some kind of sensible mix. Thank you for making a good example of exactly what I'm talking about.
- jread 6 months ago
  
  If have a global user base, depending on your workload, a simple CDN in front of your hardware can often go a long ways with minimal cost and complexity.
  - motorest 6 months ago
    
    > If have a global user base, depending on your workload, a simple CDN in front of your hardware can often go a long ways with minimal cost and complexity.
    Let's squint hard enough to pretend a CDN does not qualify as "the cloud". That alone requires a lot of goodwill.
    A CDN distributes read-only content. Any usecase that requires interacting with a service is automatically excluded.
    So, no.
    
    jread 6 months ago
    
    > Any usecase that requires interacting with a service is automatically excluded
    This isn't correct. Many applications consist of a mix of static and dynamic content. Even dynamic content is often cacheable for a time. All of this can be served by a CDN (using TTLs) which is a much simpler and more cost effective solution than multi-region cloud infra, with the same performance benefits.
swozey 6 months ago

I have about 30 years as a linux eng, starting with openbsd and have spent a LOT of time with hardware building webhosts and CDNs until about 2020 where my last few roles have been 100% aws/gcloud/heroku.
I love building the cool edge network stuff with expensive bleeding edge hardware, smartnics, nvmeOF, etc but its infinitely more complicated and stressful than terraforming an AWS infra. Every cluster I set up I had to interact with multiple teams like networking, security, storage sometimes maintenance/electrical, etc. You've got some random tech you have to rely on across the country in one of your POPs with a blown server. Every single hardware infra person has had a NOC tech kick/unplug a server at least once if they've been in long enough.
And then when I get the hardware sometimes you have different people doing different parts of setup, like NOC does the boot, maybe boostraps the hardware with something that works over ssh before an agent is installed (ansible, etc), then your linux eng invokes their magic with a ton of bash or perl, then your k8s person sets up the k8s clusters with usually something like terraform/puppet/chef/salt probably calling helm charts. Then your monitoring person gets it into OTEL/grafana, etc. This all organically becomes more automated as time goes on, but I've seen it from a brand new infra where you've got no automation many times.
Now you're automating 90% of this via scripts and IAC, etc, but you're still doing a lot of tedious work.
You also have a much more difficult time hiring good engineers. The markets gone so heavily AWS (I'm no help) that its rare that I come across an ops resume that's ever touched hardware, especially not at the CDN distributed systems level.
So.. aws is the chill infra that stays online and you can basically rely on 99.99something%. Get some terraform blueprints going and your own developers can self serve. Don't need hardware or ops involved.
And none of this is even getting into supporting the clusters. Failing clusters. Dealing with maintenance, zero downtime kernel upgrades, rollbacks, yaddayadda.
- samcat116 6 months ago
  
  This 1000%. There are so many cool networking/virtualization/hardware things I love dealing with. But the stress of doing ceph upgrades isn't the right trade off usually.
browningstreet 6 months ago

Most companies severely understaff ops, infra, and security. Your talking points might be good but, in practice, won’t apply in many cases because of the intractability of that management mindset. Even when they should know better.
I’ve worked at tech companies with hundreds of developers and single digit ops staff. Those people will struggle to build and maintain mature infra. By going cloud, you get access to mature infra just by including it in build scripts. Devops is an effective way to move infra back to project teams and cut out infra orgs (this isn’t great but I see it happen everywhere). Companies will pay cloud bills but not staffing salaries.
- chii 6 months ago
  
  It's the exact same reason why most companies don't just run their own power stations, and instead buy it from a power company.
  Computation has become a utility these days - this includes the fat ISP lines and connectivity etc, not just the CPU and harddrives. These things have economies of scale that smaller companies cannot truly reach, and will pay a huge fixed cost if they want state of the art management, monitoring and redundancy. So unless you are a massive consumer, just like power stations, you really don't need nor want to build your own.
- j45 6 months ago
  
  Using a commercial cloud provider only cements understaffing in, in too many cases.
awholescammy 6 months ago

There is a whole ecosystem that pushes cloud to ignorant/fresh graduates/developers. Just take a look at the sponsors for all the most popular frameworks. When your system is super complex and depends on the cloud they make more money. Just look at the PHP ecosystem, Laravel needs 4 times the servers to server something that a pure PHP system would need. Most projects don't need the cloud. Only around 10% of projects actually need what the cloud provides. But they were able to brainwash a whole generation of developers/managers to think that they do. And so it goes.
- gjsman-1000 6 months ago
  
  Having worked with Laravel, this is absolutely bull.
twoparachute45 6 months ago

>What's particularly fascinating to me, though, is how some people are so pro-cloud that they'd argue with a writeup like this with silly cloud talking points. They don't seem to care much about data or facts, just that they love cloud and want everyone else to be in cloud, too.
The irony is absolutely dripping off this comment, wow.
Commenter makes emotionally charge comment with no data or facts and decries anyone who disagrees with them as "silly talking points" for not caring about data and facts.
Your comment is entirely talking about itself.
cookiengineer 6 months ago

My take on this whole cloud fatigue is that system maintenance got overly complex over the last couple years/decades. So much that management people now think that it's too expensive in terms of hiring people that can do it compared to the higher managed hosting costs.
DevOps and kubernetes come to mind. A lot of people using kubernetes don't know what they're getting into, and k0s or another single machine solution would have been enough for 99% of SMEs.
In terms of cyber security (my field) everything got so ridiculously complex that even the folks that use 3 different dashboards in parallel will guess the answers as to whether or not they're affected by a bug/RCE/security flaw/weakness because all of the data sources (even the expensively paid for ones) are human-edited text databases. They're so buggy that they even have Chinese idiom symbols instead of a dot character in the version fields without anyone ever fixing it upstream in the NVD/CVE process.
I started to build my EDR agent for POSIX systems specifically, because I hope that at some point this can help companies to ditch the cloud and allows them to selfhost again - which in return would indirectly prevent 13 year old kids like from LAPSUS to pwn major infrastructure via simple tech support hotline calls.
When I think of it in terms of hosting, the vertical scalability of EPYC machines is so high that most of the time when you need its resources you are either doing something completely wrong and you should refactor your code or you are a video streaming service.
tzs 6 months ago

There was a time when cloud was significantly cheaper then owning.
I'd expect that there are people who moved to the cloud then, and over time started using services offered by their cloud provider (e.g., load balancers, secret management, databases, storage, backup) instead of running those services themselves on virtual machines, and now even if it would be cheaper to run everything on owned servers they find it would be too much effort to add all those services back to their own servers.
- toomuchtodo 6 months ago
  
  The cloud wasn’t about cheap, it was about fast. If you’re VC funded, time is everything, and developer velocity above all else to hyperscale and exit. That time has passed (ZIRP), and the public cloud margin just doesn’t make sense when you can own and operate (their margin is your opportunity) on prem with similar cloud primitives around storage and compute.
  Elasticity is a component, but has always been from a batch job bin packing scheduling perspective, not much new there. Before k8s and Nomad, there was Globus.org.
  (Infra/DevOps in a previous life at a unicorn, large worker cluster for a physics experiment prior, etc; what is old is a new again, you’re just riding hype cycle waves from junior to retirement [mainframe->COTS on prem->cloud->on prem cloud, and so on])
- dboreham 6 months ago
  
  That was never true except in the case that the required hardware resources were significantly smaller than a typical physical machine.
mjburgess 6 months ago

1. People are credulous
2. People therefore repeat talking points which seem in their interest
3. With enough repetition these become their beliefs
4. People will defend their beliefs as theirs against attack
5. Goto 1
onli 6 months ago

The one convincing argument from technical people I saw, that would be repeated to your comment, is that by now, you dont find enough experienced engineers to reliably setup some really big systems. Because so much went to the cloud, a lot of the knowledge is buried there.
That came from technical people who I didn't perceive as being dogmatically pro-cloud.
tyingq 6 months ago

I think part of it was a way for dev teams to get an infra team that was not empowered to say no. Plus organizational theory, empire building, etc.
- sgarland 6 months ago
  
  Yep. I had someone tell me last week that they didn't want a more rigid schema because other teams rely on it, and anything adding "friction" to using it would be poorly received.
  As an industry, we are largely trading correctness and performance for convenience, and this is not seen as a negative by most. What kills me is that at every cloud-native place I've worked at, the infra teams were both responsible for maintaining and fixing the infra that product teams demanded, but were not empowered to push back on unreasonable requests or usage patterns. It's usually not until either the limits of vertical scaling are reached, or a SEV0 occurs where these decisions were the root cause does leadership even begin to consider changes.
mmcwilliams 6 months ago

It seems that the preference is less about understanding or misunderstanding the technical requirements but more that it moves a capital expenditure with some recurring operational expenditure entirely into the opex column.
glitchc 6 months ago

Cloud solves one problem quite well: Geographic redundancy. It's extremely costly with on-prem.
- sgarland 6 months ago
  
  Only if you’re literally running your own datacenters, which is in no way required for the majority of companies. Colo giants like Equinix already have the infrastructure in place, with a proven track record.
  If you enable Multi-AZ for RDS, your bill doubles until you cancel. If you set up two servers in two DCs, your initial bill doubles from the CapEx, and then a very small percentage of your OpEx goes up every month for the hosting. You very, very quickly make this back compared to cloud.
  - Cyph0n 6 months ago
    
    But reliable connectivity between regions/datacenters remains a challenge, right? Compute is only one part of the equation.
    Disclaimer: I work on a cloud networking product.
    
    sgarland 6 months ago
    
    It depends on how deep you want to go. Equinix for one (I'm sure others as well, but I'm most familiar with them) offers managed cross-DC fiber. You will probably need to manage the networking, to be fair, and I will readily admit that's not trivial.
    
    irunmyownemail 6 months ago
    
    I use Wireguard, pretty simple, where's the challenge?
    
    Cyph0n 6 months ago
    
    I am referring to the layer 3 connectivity that Wireguard is running on top of. Depending on your use case and reliability and bandwidth requirements, routing everything over the “public” internet won’t cut it.
    Not to mention setting up and maintaining your physical network as the number of physical hosts you’re running scales.
- icedchai 6 months ago
  
  Except, almost nobody, outside of very large players, does cross region redundancy. us-east-1 is like a SPOF for the entire Internet.
- liontwist 6 months ago
  
  Cloud noob here. But if I have a central database what can I distribute across geographic regions? Static assets? Maybe a cache?
  - sgarland 6 months ago
    
    Yep. Cross-region RDBMS is a hard problem, even when you're using a managed service – you practically always have to deal with eventual consistency, or increased latency for writes.
- dietr1ch 6 months ago
  
  Does it? I've seen outages around "Sorry, us-west_carolina-3 is down". AWS is particularly good at keeping you aware of their datacenters.
  - toast0 6 months ago
    
    It can be useful. I run a latency sensitive service with global users. A cloud lets me run it in 35 locations dealing with one company only. Most of those locations only have traffic to justify a single, smallish, instance.
    In the locations where there's more traffic, and we need more servers, there are more cost effective providers, but there's value in consistency.
    Elasticity is nice too, we doubled our instance count for the holidays, and will return to normal in January. And our deployment style starts a whole new cluster, moves traffic, then shuts down the old cluster. If we were on owned hardware, adding extra capacity for the holidays would be trickier, and we'd have to have a more sensible deployment method. And the minimum service deployment size would probably not be a little quad processor box with 2GB ram.
    Using cloud for the lower traffic locations and a cost effective service for the high traffic locations would probably save a bunch of money, but add a lot of deployment pain. And a) it's not my decision and b) the cost difference doesn't seem to be quite enough to justify the pain at our traffic levels. But if someone wants to make a much lower margin, much simpler service with lots of locations and good connectivity, be sure to post about it. But, I think the big clouds have an advantage in geographic expansion, because their other businesses can provide capital and justification to build out, and high margins at other locations help cross subsidize new locations when they start.
    
    dietr1ch 6 months ago
    
    I agree it can be useful (latency, availability, using off-peak resources), but running globally should be a default and people should opt-in into fine-grained control and responsibility.
    From outside it seems that either AWS picked the wrong default to present their customers, or that it's unreasonably expensive and it drives everyone into the in-depth handling to try to keep cloud costs down.
  - bdangubic 6 months ago
    
    if you see that you are doing it wrong :)
    
    sgarland 6 months ago
    
    AWS has had multiple outages which were caused by a single AZ failing.
    
    dietr1ch 6 months ago
    
    Yup, I was referring to, I guess, one of these,
    - https://news.ycombinator.com/item?id=29473630: (2021-12-07) AWS us-east-1 outage
    - https://news.ycombinator.com/item?id=29648286: (2021-12-22) Tell HN: AWS appears to be down again
    Maybe things are better now, but it became apparent that people might be misusing cloud providers or betting that things work flawlessly even if they completely ignore AZs.
- ayuhito 6 months ago
  
  My company used to do everything on-prem. Until a literal earthquake and tsunami took down a bunch of systems.
  After that, yeah we’ll let AWS do the hard work of enabling redundancy for us.
jhwhite 6 months ago

> It makes me wonder: how do people get so sold on a thing that they'll go online and fight about it, even when they lack facts or often even basic understanding?
I feel like this can be applied to anything.
I had a manager take one SAFe for Leaders class then came back wanting to implement it. They had no previous AGILE classes or experience. And the Enterprise Agile Office was saying DON'T USE SAFe!!
But they had one class and that was the only way they would agree to structure their group.
jeffbee 6 months ago

The problem with your claims here is they can only be right if the entire industry is experiencing mass psychosis. I reject a theory that requires that, because my ego just isn't that large.
I once worked for several years at a publicly traded firm well-known for their return-to-on-prem stance, and honestly it was a complete disaster. The first-party hardware designs didn't work right because they didn't have the hardware designs staffing levels to have de-risked to possibility that AMD would fumble the performance of Zen 1, leaving them with a generation of useless hardware they nonetheless paid for. The OEM hardware didn't work right because they didn't have the chops to qualify it either, leaving them scratching their heads for months over a cohort of servers they eventually discovered were contaminated with metal chips. And, most crucially, for all the years I worked there, the only thing they wanted to accomplish was failover from West Coast to East Coast, which never worked, not even once. When I left that company they were negotiating with the data center owner who wanted to triple the rent.
These experiences tell me that cloud skeptics are sometimes missing a few terms in their equations.
- floating-io 6 months ago
  
  "Vendor problems" is a red herring, IMO; you can have those in the cloud, too.
  It's been my experience that those who can build good, reliable, high-quality systems, can do so either in the cloud or on-prem, generally with equal ability. It's just another platform to such people, and they will use it appropriately and as needed.
  Those who can only make it work in the cloud are either building very simple systems (which is one place where the cloud can be appropriate), or are building a house of cards that will eventually collapse (or just cost them obscene amounts of money to keep on life support).
  Engineering is engineering. Not everyone in the business does it, unfortunately.
  Like everything, the cloud has its place -- but don't underestimate the number of decisions that get taken out of the hands of technical people by the business people who went golfing with their buddy yesterday. He just switched to Azure, and it made his accountants really happy!
  The whole CapEx vs. OpEx issue drives me batty; it's the number one cause of cloud migrations in my career. For someone who feels like spent money should count as spent money regardless of the bucket it comes out of, this twists my brain in knots.
  I'm clearly not a finance guy...
  - sgarland 6 months ago
    
    > or are building a house of cards that will eventually collapse (or just cost them obscene amounts of money to keep on life support)
    Ding ding ding. It's this.
    > The whole CapEx vs. OpEx issue drives me batty
    Seconded. I can't help but feel like it's not just a "I don't understand money" thing, but more of a "the way Wall Street assigns value is fundamentally broken." Spending $100K now, once, vs. spending $25K/month indefinitely does not take a genius to figure out.
    
    marcosdumay 6 months ago
    
    > Spending $100K now, once, vs. spending $25K/month indefinitely does not take a genius to figure out.
    If you multiply your month payment for 1/i, where i is the interest rate your business can get, you will get how much of up-front money it's worth.
    ... that is, until next month, when the interest rate will change, a fact that always catches everyone by surprise, and you'll need to rush to fix your cash-flow.
    So, yeah, I don't understand that either. Somehow, despite neither of us understanding how it can possibly work, it seems to fail to work empirically too, adding a huge amount of instability to companies.
    That is, unless you decide to look at it from the perspective of executive bonuses, that are capped to 0, but can grow indefinitely. So instability is the point.
  - krsgjerahj 6 months ago
    
    you forgot cogs
    it's all about painting the right picture for your investors, so you make up shit and classify as cogs or opex depending on what is most beneficial for you in the moment
- marcosdumay 6 months ago
  
  > The problem with your claims here is they can only be right if the entire industry is experiencing mass psychosis.
  Yes. Mass psychosis explains an incredible number of different and apparently unrelated problems with the industry.
- noprocrasted 6 months ago
  
  There's however a middle-ground between run your own colocated hardware and cloud. It's called "dedicated" servers and many hosting providers (from budget bottom-of-the-barrel to "contact us" pricing) offer it.
  Those take on the liability of sourcing, managing and maintaining the hardware for a flat monthly fee, and would take on such risk. If they make a bad bet purchasing hardware, you won't be on the hook for it.
  This seems like a point many pro-cloud people (intentionally?) overlook.
- johnklos 6 months ago
  
  > The problem with your claims here is they can only be right if the entire industry is experiencing mass psychosis.
  What's the market share of Windows again? ;)
  - mardifoufs 6 months ago
    
    You're proving their point though. Considering that there are tons of reasons to use windows, some people just don't see them and think that everyone else is crazy :^) (I know you're joking but some people actually unironically have the same sentiment)
hnthrowaway6543 6 months ago

> a desire to not centralize the Internet
> If I didn't already self-host email
this really says all that needs to be said about your perspective. you have an engineer and OSS advocate's mindset. which is fine, but most business leaders (including technical leaders like CTOs) have a business mindset, and their goal is to build a business that makes money, not avoid contributing to the centralization of the internet
lelanthran 6 months ago

> On the other hand, a business of just about any size that has any reasonable amount of hosting is better off with their own systems when it comes purely to cost
From a cost PoV, sure, but when you're taking money out of capex it represents a big hit to the cash flow, while taking out twice that amount from opex has a lower impact on the company finances.
moltar 6 months ago

Cloud is more than instances. If all you need is a bunch of boxes, then cloud is a terrible fit.
I use AWS cloud a lot, and almost never use any VMs or instances. Most instances I use are along the lines of a simple anemic box for a bastion host or some such.
I use higher level abstractions (services) to simplify solutions and outsource maintenance of these services to AWS.
anotherhue 6 months ago

They spent time and career points learning cloud things and dammit it's going to matter!
You can't even blame them too much, the amount of cash poured into cloud marketing is astonishing.
- sgarland 6 months ago
  
  The thing that frustrates me is it’s possible to know how to do both. I have worked with multiple people who are quite proficient in both areas.
  Cloud has definite advantages in some circumstances, but so does self-hosting; moreover, understanding the latter makes the former much, much easier to reason about. It’s silly to limit your career options.
  - noworriesnate 6 months ago
    
    Being good at both is twice the work, because even if some concepts translate well, IME people won't hire someone based on that. "Oh you have experience with deploying RabbitMQ but not AWS SQS? Sorry, we're looking for someone more qualified."
    
    sgarland 6 months ago
    
    That's a great filter for places I don't want to work at, then.
bluedino 6 months ago

I want to see an article like this, but written from a Fortune 500 CTO perspective
It seems like they all abandoned their VMware farms or physical server farms for Azure (they love Microsoft).
Are they actually saving money? Are things faster? How's performance? What was the re-training/hiring like?
In one case I know we got rid of our old database greybeards and replaced them with "DevOps" people that knew nothing about performance etc
And the developers (and many of the admins) we had knew nothing about hardware or anything so keeping the physical hardware around probably wouldn't have made sense anyways
- ndriscoll 6 months ago
  
  Complicating this analysis is that computers have still been making exponential improvements in capability as clouds became popular (e.g. disks are 1000-10000x faster than they were 15 years ago), so you'd naturally expect things to become easier to manage over time as you need fewer machines, assuming of course that your developers focus on e.g. learning how to use a database well instead of how to scale to use massive clusters.
  That is, even if things became cheaper/faster, they might have been even better without cloud infrastructure.
- jrs235 6 months ago
  
  >we got rid of our old database greybeards and replaced them with "DevOps" people that knew nothing about performance etc
  Seems a lot of those DevOps people just see Azures recommendations for adding indexes and either just allow auto applying them or just adding them without actually reviewing it understanding what use loads require them and why. This also lands a bit on developers/product that don't critically think about and communicate what queries are common and should have some forethought on what indexes should be beneficial and created. (Yes followup monitoring of actual index usage and possible missing indexes is still needed.) Too many times I've seen dozens of indexes on tables in the cloud where one could cover all of them. Yes, there still might be worthwhile reasons to keep some narrower/smaller indexes but again DBA and critical query analysis seems to be a forgotten and neglected skill. No one owns monitoring and analysing db queries and it only comes up after a fire has already broken out.
dehrmann 6 months ago

The real cost wins of self-hosted are that anything using new hardware becomes an ordeal, and engineers won't use high-cost, value-added services. I agree that there's often too little restraint in cloud architectures, but if a business truly believes in a project, it shouldn't be held up for six months waiting for server budget with engineers spending doing ops work to get three nines of DB reliability.
There is a size where self-hosting makes sense, but it's much larger than you think.
sanderjd 6 months ago

Also, by the way, I found it interesting that you framed your side of this disagreement as the technically correct one, but then included this:
> a desire to not centralize the Internet
This is an ideological stance! I happen to share this desire. But you should be aware of your own non-technical - "emotional" - biases when dismissing the arguments of others on the grounds that they are "emotional" and+l "fanatical".
- johnklos 6 months ago
  
  I never said that my own reasons were neither personal nor emotional. I was just pointing out that my reasons are easy to articulate.
  I do think it's more than just emotional, though, but most people, even technical people, haven't taken the time to truly consider the problems that will likely come with centralization. That's a whole separate discussion, though.
ants_everywhere 6 months ago

...but your post reads like you do have an emotional reaction to this question and you're ready to believe someone who shares your views.
There's not nearly enough in here to make a judgment about things like security or privacy. They have the bare minimum encryption enabled. That's better than nothing. But how is key access handled? Can they recover your email if the entire cluster goes down? If so, then someone has access to the encryption keys. If not, then how do they meet reliability guarantees?
Three letter agencies and cyber spies like to own switches and firewalls with zero days. What hardware are they using, and how do they mitigate against backdoors? If you really cared about this you would have to roll your own networking hardware down to the chips. Some companies do this, but you need to have a whole lot of servers to make it economical.
It's really about trade-offs. I think the big trade-offs favoring staying off cloud are cost (in some applications), distrust of the cloud providers,and avoiding the US Government.
The last two are arguably judgment calls that have some inherent emotional content. The first is calculable in principle, but people may not be using the same metrics. For example if you don't care that much about security breaches or you don't have to provide top tier reliability, then you can save a ton of money. But if you do have to provide those guarantees, it would be hard to beat Cloud prices.
sgarland 6 months ago

> What's particularly fascinating to me, though, is how some people are so pro-cloud that they'd argue with a writeup like this with silly cloud talking points.
I’m sure I’ll be downvoted to hell for this, but I’m convinced that it’s largely their insecurities being projected.
Running your own hardware isn’t tremendously difficult, as anyone who’s done it can attest, but it does require a much deeper understanding of Linux (and of course, any services which previously would have been XaaS), and that’s a vanishing trait these days. So for someone who may well be quite skilled at K8s administration, serverless (lol) architectures, etc. it probably is seen as an affront to suggest that their skill set is lacking something fundamental.
- TacticalCoder 6 months ago
  
  > So for someone who may well be quite skilled at K8s administration ...
  And running your own hardware is not incompatible with Kubernetes: on the contrary. You can fully well have your infra spin up VMs and then do container orchestration if that's your thing.
  And part your hardware monitoring and reporting tool can work perfectly fine from containers.
  Bare metal -> Hypervisor -> VM -> container orchestration -> a container running a "stateless" hardware monitoring service. And VMs themselves are "orchestrated" too. Everything can be automated.
  Anyway say a harddisk being to show errors? Notifications being sent (email/SMS/Telegram/whatever) by another service in another container, dashboard shall show it too (dashboards are cool).
  Go to the machine once the spare disk as already been resilvered, move it where the failed disk was, plug in a new disk that becomes the new spare.
  Boom, done.
  I'm not saying all self-hosted hardware should do container orchestration: there are valid use cases for bare metal too.
  But something as to be said about controlling everything on your own infra: from the bare metal to the VMs to container orchestration. To even potentially your own IP address space.
  This is all within reach of an individual, both skill-wise and price-wise (including obtaining your own IP address space). People who drank the cloud kool-aid should ponder this and wonder how good their skills truly are if they cannot get this up and working.
  - sgarland 6 months ago
    
    Fully agree. And if you want to take it to the next level (and have a large budget), Oxide [0] seems to have neatly packaged this into a single coherent product. They don't quite have K8s fully running, last I checked, but there are of course other container orchestration systems.
    > Go to the machine once the spare disk as already been resilvered
    Hi, fellow ZFS enthusiast :-)
    [0]: https://oxide.computer
  - noprocrasted 6 months ago
    
    > And running your own hardware is not incompatible with Kubernetes: on the contrary
    Kubernetes actually makes so much more sense on bare-metal hardware.
    On the cloud, I think the value prop is dubious - your cloud provider is already giving you VMs, why would you need to subdivide them further and add yet another layer of orchestration?
    Not to mention that you're getting 2010s-era performance on those VMs, so subdividing them is terrible from a performance point of view too.
    
    sgarland 6 months ago
    
    > Not to mention that you're getting 2010s-era performance on those VMs, so subdividing them is terrible from a performance point of view too.
    I was trying in vain to explain to our infra team a couple of weeks ago why giving my team a dedicated node of a newer instance family with DDR5 RAM would be beneficial for an application which is heavily constrained by RAM speed. People seem to assume that compute is homogenous.
    
    theideaofcoffee 6 months ago
    
    I would wager that the same kind of people that were arguing against your request for a specific hardware config are the same ones in this comment section railing against any sort of self-sufficiency by hosting it yourself on hardware. All they know is cloud, all they know how to do is "ScAlE Up thE InStanCE!" when shit hits the fan. It's difficult to argue against that and make real progress. I understand your frustration completely.
  - irunmyownemail 6 months ago
    
    I agree, I run PROD, TEST and DEV kube clusters all in VM's, works great.
luplex 6 months ago

In the public sector, cloud solves the procurement problem. You just need to go through the yearlong process once to use a cloud service, instead of for each purchase > 1000€.
kevin_thibedeau 6 months ago

Capital expenditures are kryptonite to financial engineers. The cloud selling point was to trade those costs for operational expenses and profit in phase 3.
JOnAgain 6 months ago

As someone who ran a startup with 100’s of hosts. As soon as I start to count the salaries, hiring, desk space, etc of the people needed to manage the hosts AWS would look cheap again. Yea, hardware costs they are aggressively expensive. But TCO wise, they’re cheap for any decent sized company.
Add in compliance, auditing, etc. all things that you can set up out of the box (PCI, HIPPA, lawsuit retention). Gets even cheaper.
mark242 6 months ago

I'm curious about what "reasonable amount of hosting" means to you, because from my experience, as your internal network's complexity goes up, it's far better for your to move systems to a hyperscaler. The current estimate is >90% of Fortune 500 companies are cloud-based. What is it that you know that they don't?
slothtrop 6 months ago

The bottom line > babysitting hardware. Businesses are transitioning to cloud because it's better for business.
- irunmyownemail 6 months ago
  
  Actually, there's been a reversal trend going on, for many companies, better is often on premises or hybrid now.
irunmyownemail 6 months ago

> If I didn't already self-host email, I'd consider using Fastmail.
Same sentiment all of what you said.
samcat116 6 months ago

> how do people get so sold on a thing that they'll go online and fight about it, even when they lack facts or often even basic understanding?
Are you new to the internet?
sanderjd 6 months ago

> All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding, but serve to introduce doubt to non-technical people and to trick people who don't examine what they're told.
This feels like "no true scotsman" to me. I've been building software for close to two decades, but I guess I don't have "any real technical understanding" because I think there's a compelling case for using "cloud" services for many (honestly I would say most) businesses.
Nobody is "afraid to openly discuss how cloud isn't right for many things". This is extremely commonly discussed. We're discussing it right now! I truly cannot stand this modern innovation in discourse of yelling "nobody can talk about XYZ thing!" while noisily talking about XYZ thing on the lowest-friction publishing platforms ever devised by humanity. Nobody is afraid to talk about your thing! People just disagree with you about it! That's ok, differing opinions are normal!
Your comment focuses a lot on cost. But that's just not really what this is all about. Everyone knows that on a long enough timescale with a relatively stable business, the total cost of having your own infrastructure is usually lower than cloud hosting.
But cost is simply not the only thing businesses care about. Many businesses, especially new ones, care more about time to market and flexibility. Questions like "how many servers do we need? with what specs? and where should we put them?" are a giant distraction for a startup, or even for a new product inside a mature firm.
Cloud providers provide the service of "don't worry about all that, figure it out after you have customers and know what you actually need".
It is also true that this (purposefully) creates lock-in that is expensive either to leave in place or unwind later, and it definitely behooves every company to keep that in mind when making architecture decisions, but lots of products never make it to that point, and very few of those teams regret the time they didn't spend building up their own infrastructure in order to save money later.
cyberax 6 months ago

> The whole push to the cloud has always fascinated me. I get it - most people aren't interested in babysitting their own hardware.
For businesses, it's a very typical lease-or-own decision. There's really nothing too special about cloud.
> On the other hand, a business of just about any size that has any reasonable amount of hosting is better off with their own systems when it comes purely to cost.
Nope. Not if you factor-in 24/7 support, geographic redundancy, and uptime guarantees. With EC2 you can break even at about $2-5m a year of cloud spending if you want your own hardware.
favflam 6 months ago

I did compliance for a fintech under heavy regulation.
If we used AWS, we could skip months of certification. If we use a custom data center, we have to certify it ourselves (muuuuuch more expensive).
From this standpoint, cloud beats on-premise.
fnord77 6 months ago

capex vs opex

lukevp 6 months ago

To me, Cloud is all about the shift left of DevOps. It’s not a cost play. I’m a Dev Lead / Manager and have worked in both types of environments over the last 10 years. It’s immeasurable the velocity difference as far as system provisioning between the two approaches. In the hardware space, it took months to years to provision new machines or upgrade OSes. In the cloud, it’s a new terraform script and a CI deploy away. Need more storage? It’s just there, available all the time. Need to add a new firewall between machines or redo the network topology? Free. Need a warm standby in 4 different regions that costs almost nothing but can scale to full production capacity within a couple of minutes? Done. Those types of things are difficult to do with physical hardware. And if you have an engineering culture where the operational work and the development work are at odds (think the old style of Dev / QA / Networking / Servers / Security all being separate teams), processes and handoffs eat your lunch and it becomes crippling to your ability to innovate. Cloud and DevOps are to me about reducing the differentiation between these roles so that a single engineer can do any part of the stack, which cuts out the communication overhead and the handoff time and the processes significantly.

If you have predictable workloads, a competent engineering culture that fights against process culture, and are willing to spend the money to have good hardware and the people to man it 24x7x365 then I don’t think cloud makes sense at all. Seems like that’s what y’all have and you should keep up with it.

drdaeman 6 months ago

> In the hardware space, it took months to years to provision new machines or upgrade OSes.
If it takes this long to manage a machine, I strongly suspect it means that when initially designing the system engineers had failed to account for those for some reason. Was that true in your case?
Back in late '00s until mid '10s, I worked for an ISP startup as a SWE. We had a few core machines (database, RADIUS server, self-service website, etc) - ugly mess TBH - initially provisioned and originally managed entirely by hand as we didn't knew any better back then. Naturally, maintaining those was a major PITA, so they sat on the same dated distro for years. That was before Ansible was a thing, and we haven't really heard about Salt or Chef before we started to feel the pains and started to search for solutions. Virtualization (OpenVZ, then Docker) helped to soften a lot of issues, making it significantly easier to maintain the components, but the pains from our original sins were felt for a long time.
But we also had a fleet of other machines, where we understood our issues with the servers enough to design new nodes to be as stateless as possible, with automatic rollout scripts for whatever we were able to automate. Provisioning a new host took only a few hours, with most time spent unpacking, driving, accessing the server room, and physically connecting things. Upgrades were pretty easy too - reroute customers to another failover node, write a new system image to the old one, reboot, test, re-route traffic back, done.
So it's not like self-owned bare metal is harder to manage - the lesson I learned is that one just gotta think ahead of time what the future would require. Same as the clouds, I guess, one has to follow best practices or they'll end up with crappy architectures that will be painful to rework. Just different set of practices, because of the different nature of the systems.
Jenk 6 months ago

Exactly this. It is culture and organisation (structure) dependent. I'm in the throes of the same discussion with my leader ship team, some of whom have built themselves an ops/qa/etc. empire and want to keep their moat.
Are you running a well understood and predictable (as in, little change, growth, nor feature additions) system? Are your developers handing over to central platform/infra/ops teams? You'll probably save some cash by buying and owning the hardware you need for your use case(s). Elasticity is (probably) not part of your vocabulary, perhaps outside of "I wish we had it" anyway.
Have you got teams and/or products that are scaling rapidly or unpredictably? Have you still got a lot of learning and experimenting to do with how your stack will work? Do you need flexibility but can't wait for that flexibility? Then cloud is for you.
n.b. I don't think I've ever felt more validated by a post/comment than yours.
comprev 6 months ago

Our CI pipelines can spin up some seriously meaty hardware, run some very resource intensive tests, and destroy the infrastructure when finished.
Bonus points: they can do it with spot pricing to further lower the bill.
The cloud offers immense flexibility and empowers _developers_ to easily manage their own infrastructure without depending on other teams.
Speed of development is the primary reason $DayJob is moving into the cloud, while maintaining bare-metal for platforms which rarely change.
RainyDayTmrw 6 months ago

I think I understand your point, and this is not directed at you personally, but: I think "shift left" is another one of those phrases that's lost all meaning, like "synergy" or "agile" before it.
eddsolves 6 months ago

My first job in tech was building servers for companies when they needed more compute, physically building them from our warehouse of components, driving them to their site, and setting it up in their network.
You could get same day builds deployed on prem with the right support bundle!

_bare_metal 6 months ago

Plugging https://BareMetalSavings.com

in case you want to ballpark-estimate your move off of the cloud

Bonus points: I'm a Fastmail customer, so it tangentially tracks

----

Quick note about the article: ZFS encryption can be flaky, be sure you know what you're doing before deploying for your infrastructure.

Relevant Reddit discussion: https://www.reddit.com/r/zfs/comments/1f59zp6/is_zfs_encrypt...

A spreadsheet of related issues that I can't remember who made:

https://docs.google.com/spreadsheets/d/1OfRSXibZ2nIE9DGK6sww...

brongondwana 6 months ago

Yeah, we know about the ZFS encryption with send/receive bug, it's frustrating our attempts to get really nice HA support on our logging system... but so far it appears that just deleting the offsending snapshot and creating a new one works, and we're funding some research into the issue as well.
This is the current script - it runs every minute for each pool synced between the two log servers: https://gist.github.com/brong/6a23fee1480f2d62b8a18ade5aea66...
- _bare_metal 6 months ago
  
  Thanks for sharing!
sneak 6 months ago

My main issue with ZFS encryption is that it only supports one key.
LUKS2 has something like 9 key slots.
I run ZoL over LUKS2 and it works great.

bartvk 6 months ago

Such an awesome article. I like how they didn't just go with the Cloud wave but kept sysadmin'ing, like ol' Unix graybeards. Two interesting things they wrote about their SSDs:

1) "At this rate, we’ll replace these [SSD] drives due to increased drive sizes, or entirely new physical drive formats (such E3.S which appears to finally be gaining traction) long before they get close to their rated write capacity."

and

2) "We’ve also anecdotally found SSDs just to be much more reliable compared to HDDs (..) easily less than one tenth the failure rate we used to have with HDDs."

tgv 6 months ago

To avoid sysadmin tasks, and keep costs down, you've got to go so deep in the cloud, that it becomes just another arcane skill set. I run most of my stuff on virtual Linux servers, but some on AWS, and that's hard to learn, and doesn't transfer to GCP or Azure. Unless your needs are extreme, I think sysadmin'ing is the easier route in most cases.
- wongarsu 6 months ago
  
  For so many things the cloud isn't really easier or cheaper, and most cloud providers stopped advertising it as such. My assumption is that cloud adoption is mainly driven by 3 forces:
  - for small companies: free credits
  - for large companies: moving prices as far away as possible from the deploy button, allowing dev and it to just deploy stuff without purchase orders
  - self-perpetuating due to hype, cv-driven development, and ease of hiring
  All of these are decent reasons, but none of them may apply to a company like fastmail
  - graemep 6 months ago
    
    Also CYA. If you run your own servers and something goes wrong its your fault. if its an outage at AWS its their fault.
    Also a huge element of follow the crowd, branding non-technical management are familiar with, and so on. I have also found some developers (front end devs, or back end devs who do not have sysadmin skills) feel cloud is the safe choice. This is very common for small companies as they may have limited sysadmin skills (people who know how to keep windows desktops running are not likely to be who you want to deploy servers) and a web GUI looks a lot easier to learn.
    
    dietr1ch 6 months ago
    
    > If its an outage at AWS its their fault.
    Well, still your fault, but easy to judo the risk into clients saying supporting multi-cloud is expensive and not a priority.
    
    graemep 6 months ago
    
    Management in many places will not even know what multi-cloud is (or even multi-region).
    As Cloudstrike showed, if you follow the crowd and tick the right boxes you will not be blamed.
    
    bobnamob 6 months ago
    
    nit: Crowdstrike
    Unless the incident is now being referred to as “Cloudstrike”, in which case, eww
    
    dietr1ch 6 months ago
    
    Yeah, he meant Crowdstrike. Cloudstrike is the name of a future security incident affecting multiple cloud provides. I can't disclose more details.
  - oftenwrong 6 months ago
    
    In small companies, cloud also provides the ability to work around technical debt and to reduce risk.
    For example, I have seen several cases where poorly designed systems that unexpectedly used too much memory, and there was no time to fix it, so the company increased the memory on all instances with a few clicks. When you need to do this immediately to avoid a botched release that has already been called "successful" and announced as such to stakeholders, that is a capability that saves the day.
    An example of de-risking is using a cloud filesystem like EFS to provide a pseudo-infinite volume. No risk of an outage due to an unexpectedly full disk.
    Another example would be using a managed database system like RDS vs self-managing the same RDBMS: using the managed version saves on labor and reduces risk for things like upgrades. What would ordinarily be a significant effort for a small company becomes automatic, and RDS includes various sanity checks to help prevent you from making mistakes.
    The reality of the industry is that many companies are just trying to hit the next milestone of their business by a deadline, and the cloud can help despite the downsides.
    
    sgarland 6 months ago
    
    > For example, I have seen several cases where poorly designed systems that unexpectedly used too much memory
    > using a managed database system like RDS vs self-managing the same RDBMS: using the managed version saves on labor
    As a DBRE / SRE, I can confidently assert that belief in the latter is often directly responsible for the former. AWS is quite clear in their shared responsibility model [0] that you are still responsible for making sound decisions, tuning various configurations, etc. Having staff that knows how to do these things often prevents the poor decisions from being made in the first place.
    [0]: https://aws.amazon.com/compliance/shared-responsibility-mode...
    
    graemep 6 months ago
    
    Not a DB admin, but I do install and manage DBs for small clients.
    My experience is that AWS makes the easy things easy and the difficult things difficult, and the knowledge is not transferable.
    With a CLI or non-cloud management tools I can create, admin and upgrade a database (or anything else) exactly the same way, locally, on a local VM, and on a cloud VM from any provider (including AWS). Doing it with a managed database means learning how the provider does it - which takes longer and I personally find it more difficult (and stressful).
    What I cannot do as well as a real DB admin could do is things like tuning. Its not really an issue for small clients (a few generic changes to scale settings to available resources is enough - and cheaper than paying someone to tune it). Come to think of it, I do not even know how to make those changes on AWS and just hope the defaults match the size of RDS you are paying for (and change when you scale up?).
    having written the above I am now doubting whether I have done the right thing in the past.
  - ghaff 6 months ago
    
    There are other, if often at least tangentially related, reasons but more than I can give justice to in a comment.
    Many people largely got a lot of things wrong about cloud that I've been meaning to write about for a while. I'll get to it after the holidays. But probably none more than the idea that massive centralized computing (which was wrongly characterized as a utility like the electric grid) would have economics with which more local computing options could never compete.
  - nine_k 6 months ago
    
    A cloud is really easy to get started with.
    Free tiers, startup credits, easily available managed databases, queues, object storage, lambdas, load-balancing, DNS, TLS, specialist stuff like OCR. It's easy to prototype something, run for free or for peanuts, start getting some revenue.
    Then, as you grow, the costs become steeper, but migrating off of the cloud looks even more expensive, especially if you have accumulated a lot of data (egress costs you, especially from AWS). Congrats, you have become the desirable, typical cloud customer.
  - Winsaucerer 6 months ago
    
    I'm very interested in approaches that avoid cloud, so please don't read this as me saying cloud is superior. I can think of some other advantages of cloud:
    - easy to setup different permissions for users (authorisation considerations).
    - able to transfer assets to another owner (e.g., if there's a sale of a business) without needing to move physical hardware.
    - other outsiders (consultants, auditors, whatever) can come in and verify the security (or other) of your setup, because it's using a standard well known cloud platform.
    
    wongarsu 6 months ago
    
    Those are valid reasons, but not always as straight forward:
    > easy to setup different permissions for users (authorisation considerations)
    Centralized permission management is an advantage of the cloud. At the same time it's easy to do wrong. Without the cloud you usually have more piecemeal solutions depending on segmenting network access and using the permission systems of each service
    > able to transfer assets to another owner (e.g., if there's a sale of a business) without needing to move physical hardware
    The obvious solution here is to not own your hardware but to rent dedicated servers. Removes some of the maintenance burden, and the servers can be moved between entities as you like. The cloud does give you more granularity though
    > other outsiders (consultants, auditors, whatever) can come in and verify the security (or other) of your setup, because it's using a standard well known cloud platform
    There is a huge cottage industry of software trying to scan for security issues in your cloud setups. On the one hand that's an advantage of a unified interface, on the other hand a lot of those issues wouldn't occur outside the cloud. In any case, verifying security isn't easy in or out of the cloud. But if you have an auditor that is used to cloud deployments it will be easier to satisfy them there, that's certainly true
- baxtr 6 months ago
  
  I predict a slow but unstoppable comeback of the sysadmin job over the next 5-10 years.
  - homebrewer 6 months ago
    
    It never disappeared in some places. In my region there's been zero interest in "the cloud" because of physical remoteness from all major GCP/AWS/Azure datacenters (resulting in high latency), for compliance reasons, and because it's easier and faster to solve problems by dealing with a local company than pleading with a global giant that gives zero shits about you because you're less than a rounding error in its books.
- graemep 6 months ago
  
  > it becomes just another arcane skill set
  Its an arcane skill set with a GUI. It makes it look much easier to learn.
brongondwana 6 months ago

My beard isn't entirely grey yet!
The new NVMe drives we've only had for a few years, but so far there's only been a single failure across the whole fleet, and we keep spares in stock. It's been very reliable, not like the weeks back in (hmm, 2006? 2007?) the ancient past, when we were losing 15kRPM velociraptors every other day. They had a firmware fault and we eventually got an update which made them reliable, but it was a wild few months.
- AndrewDavis 6 months ago
  
  A few more than one, but it has been a lot less than when we were dealing with spinner. I think I requested about one or two replacements a year, a far cry from the one a week I was doing before.
  - brongondwana 6 months ago
    
    If I didn't see it, it didn't happen...
    OK, I stand corrected. We lose one or two NVMe drives per year :)
  - karolzlot 6 months ago
    
    Can I ask which brands / models of SSD are you using?
    
    AndrewDavis 6 months ago
    
    We didn’t replace all servers at once, it was progressive, therefore due to availability the models we used changed over time.
    Our first batch of all nvme machines had SSDPE2KX080T8, they became harder to source and we moved to SSDPF2KX076T1.
    With Intel no longer in the ssd business I believe we have some Micron MTFDKCC7T6TGH and MTFDKCC30T7TGR. And as mentioned in the blog post we've recently purchased some Solidigm D5-P5336 which are 61TB monsters.
    Here's a fun related story. Our supplier had so much trouble finding SSDPE2KX080T8 that when we had exhausted our spares, I had to sync everything off a machine, tear it down and pull its drives for spares and rebuild it with the smaller SSDPF2KX076T1. Then we had lots of spares
kwillets 6 months ago

SSD's are also a bit of an achilles heel for AWS -- they have their own Nitro firmware for wear levelling and key rotations, due to the hazards of multitenant. It's possible for one EC2 tenant to use up all the write cycles and then pass it to another, and encryption with key rotation is required to keep data from leaking across tenant changes. It's also slower.
We had one outage where key rotation had been enabled on reboot, so data partitions were lost after what should have been a routine crash. Overall, for data warehousing, our failure rate on on-prem (DC-hosted) hardware was lower IME.
edward28 6 months ago

The power of Moore's law.
jeffbee 6 months ago

I don't see how point 2 could have come as a surprise to anyone.

akpa1 6 months ago

The fact that Fastmail work like this, are transparent about what they're up to and how they're storing my email and the fact that they're making logical decisions and have been doing so for quite a long time is exactly the reason I practically trip over myself to pay them for my email. Big fan of Fastmail.

ylee 6 months ago

I recently officially became a Fastmail user when pobox.com transitioned to Fastmail, and was very impressed with customer service when I had a technical question.
xyst 6 months ago

They are also active in contributing to cyrus-imap

DarkCrusader2 6 months ago

I have seen a common sentiment that self hosting is almost always better than cloud. What these discussions does not mention is how to effectively run your business applications on this infrastructure.

Things like identity management (AAD/IAM), provisioning and running VMs, deployments. Network side of things like VNet, DNS, securely opening ports etc. Monitoring setup across the stack. There is so much functionalities that will be required to safely expose an application externally that I can't even coherently list them out here. Are people just using Saas for everything (which I think will defeat the purpose of on-prem infra) or a competent Sys admin can handle all this to give a cloud like experience for end developers?

Can someone share their experience or share any write ups on this topic?

For more context, I worked at a very large hedge fund briefly which had a small DC worth of VERY beefy machines but absolutely no platform on top of it. Hosting application was done by copying the binaries on a particular well known machine and running npm commands and restarting nginx. Log a ticket with sys admin to create a DNS entry to point a reserve and point a internal DNS to this machine (no load balancer). Deployment was a shell script which rcp new binaries and restarts nginx. No monitoring or observability stack. There was a script which will log you into a random machine for you to run your workloads (be ready to get angry IMs from more senior quants running their workload in that random machine if your development build takes up enough resources to effect their work). I can go on and on but I think you get the idea.

brongondwana 6 months ago

We're (very boring I know) just putting it all in a git repository with a Makefile which deploys it, plus some basic orchestration to run 'make diff' across the cluster and see what's out of sync, and 'make install' across hosts to deploy it into pl ace.
It's clunky, but simple, repeatable, and easily (vsfo) understood.
As for the bigger things, software etc - we have scripts that generate Debian packages which we store in our own private repo. You just install `fastmail-server` and the dependency management updates everything. There's a daily cronjob which checks if there are updated security packages or thing we failed to correctly deploy and emails us as well.
It's amazing what you can build on top of the OS provided tools with not too much complexity if you don't overthink it.
noprocrasted 6 months ago

> identity management (AAD/IAM)
Do you mean for administrative access to the machines (over SSH, etc) or for "normal" access to the hosted applications?
Admin access: Ansible-managed set of UNIX users & associated SSH public keys, combined with remote logging so every access is audited and a malicious operator wiping the machine can't cover their tracks will generally get you pretty far. Beyond that, there are commercial solutions like Teleport which provide integration with an IdP, management web UI, session logging & replay, etc.
Normal line-of-business access: this would be managed by whatever application you're running, not much different to the cloud. But if your application isn't auth-aware or is unsafe to expose to the wider internet, you can stick it behind various auth proxies such as Pomerium - it will effectively handle auth against an IdP and only pass through traffic to the underlying app once the user is authenticated. This is also useful for isolating potentially vulnerable apps.
> provisioning and running VMs
Provisioning: once a VM (or even a physical server) is up and running enough to be SSH'd into, you should have a configuration management tool (Ansible, etc) apply whatever configuration you want. This would generally involve provisioning users, disabling some stupid defaults (SSH password authentication, etc), installing required packages, etc.
To get a VM to an SSH'able state in the first place, you can configure your hypervisor to pass through "user data" which will be picked up by something like cloud-init (integrated by most distros) and interpreted at first boot - this allows you to do things like include an initial SSH key, create a user, etc.
To run VMs on self-managed hardware: libvirt, proxmox in the Linux world. bhyve in the BSD world. Unfortunately most of these have rough edges, so commercial solutions there are worth exploring. Alternatively, consider if you actually need VMs or if things like containers (which have much nicer tooling and a better performance profile) would fit your use-case.
> deployments
Depends on your application. But let's assume it can fit in a container - there's nothing wrong with a systemd service that just reads a container image reference in /etc/... and uses `docker run` to run it. Your deployment task can just SSH into the server, update that reference in /etc/ and bounce the service. Evaluate Kamal which is a slightly fancier version of the above. Need more? Explore cluster managers like Hashicorp Nomad or even Kubernetes.
> Network side of things like VNet
Wireguard tunnels set up (by your config management tool) between your machines, which will appear as standard network interfaces with their own (typically non-publicly-routable) IP addresses, and anything sent over them will transparently be encrypted.
> DNS
Generally very little reason not to outsource that to a cloud provider or even your (reputable!) domain registrar. DNS is mostly static data though, which also means if you do need to do it in-house for whatever reason, it's just a matter of getting a CoreDNS/etc container running on multiple machines (maybe even distributed across the world). But really, there's no reason not to outsource that and hosted offerings are super cheap - so go open an AWS account and configure Route53.
> securely opening ports
To begin with, you shouldn't have anything listening that you don't want to be accessible. Then it's not a matter of "opening" or closing ports - the only ports that actually listen are the ones you want open by definition because it's your application listening for outside traffic. But you can configure iptables/nftables as a second layer of defense, in case you accidentally start something that unexpectedly exposes some control socket you're not aware of.
> Monitoring setup across the stack
collectd running on each machine (deployed by your configuration management tool) sending metrics to a central machine. That machine runs Grafana/etc. You can also explore "modern" stuff that the cool kids play with nowadays like VictoriaMetrics, etc, but metrics is mostly a solved problem so there's nothing wrong with using old tools if they work and fit your needs.
For logs, configure rsyslogd to log to a central machine - on that one, you can have log rotation. Or look into an ELK stack. Or use a hosted service - again nothing prevents you from picking the best of cloud and bare-metal, it's not one or the other.
> safely expose an application externally
There's a lot of snake oil and fear-mongering around this. First off, you need to differentiate between vulnerabilities of your application and vulnerabilities of the underlying infrastructure/host system/etc.
App vulnerabilities, in your code or dependencies: cloud won't save you. It runs your application just like it's been told. If your app has an SQL injection vuln or one of your dependencies has an RCE, you're screwed either way. To manage this you'd do the same as you do in cloud - code reviews, pentesting, monitoring & keeping dependencies up to date, etc.
Infrastructure-level vulnerabilities: cloud providers are responsible for keeping the host OS and their provided services (load balancers, etc) up to date and secure. You can do the same. Some distros provide unattended updates (which your config management tool) can enable. Stuff that doesn't need to be reachable from the internet shouldn't be (bind internal stuff to your Wireguard interfaces). Put admin stuff behind some strong auth - TLS client certificates are the gold standard but have management overheads. Otherwise, use an IdP-aware proxy (like mentioned above). Don't always trust app-level auth. Beyond that, it's the usual - common sense, monitoring for "spooky action at a distance", and luck. Not too much different from your cloud provider, because they won't compensate you either if they do get hacked.
> For more context, I worked at a very large hedge fund briefly which had a small DC worth of VERY beefy machines but absolutely no platform on top of it...
Nomad or Kubernetes.
- rtfusgihkuj 6 months ago
  
  No, using Ansible to distribute public keys does not get you very far. It's fine for a personal project or even a team of 5-6 with a handful, but beyond that you really need a better way to onboard, offboard, and modify accounts. If you're doing anything but a toy project, you're better off starting off with something like IPA for host access controls.
  - xorcist 6 months ago
    
    Why do think that? I did something similar at a previous work for something bordering on 1k employees.
    User administration was done by modifying a yaml file in git. Nothing bad to say about it really. It sure beats point-and-click Active Directory any day of the week. Commit log handy for audits.
    If there are no externalities demanding anything else, I'd happily do it again.
    
    kasey_junk 6 months ago
    
    There is nothing _wrong_ with it, and so long as you can prove that your offboarding is consistent and quick then feel free to use it.
    But a central system that uses the same identity/auth everywhere is much easier to keep consistent and fast. That’s why auditors and security professionals will harp on idp/sso solutions as some of the first things to invest in.
    
    xorcist 6 months ago
    
    I found that the commit log made auditing on- and offboarding easier, not harder. Of course it won't help you if your process is dysfunctional. You still have to trigger the process somehow, which can be a problem in itself when growing from a startup, but once you do that it's smooth.
    However git is a central system, a database if you will, where you can keep identities globally consistent. That's the whole point. In my experience, the reason people leave it is because you grow the need to interoperate with third party stuff which only supports AD or Okta or something. Should I get to grow past that phase myself I would feed my chosen IdM with that data instead.
  - noprocrasted 6 months ago
    
    What's the risk you're trying to protect against, that a "better" (which one?) way would mitigate that this one wouldn't?
    > IPA
    Do you mean https://en.wikipedia.org/wiki/FreeIPA ? That seems like a huge amalgamation of complexity in a non-memory-safe language that I feel like would introduce a much bigger security liability than the problem it's trying to solve.
    I'd rather pony up the money and use Teleport at that point.
    
    dpe82 6 months ago
    
    It's basically Kerberos and an LDAP server, which are technologies old and reliable as dirt.
    This sort of FUD is why people needlessly spend so much money on cloud.
    
    noprocrasted 6 months ago
    
    > which are technologies old and reliable as dirt.
    Technologies, sure. Implementations? Not so much.
    I can trust OpenSSH because it's deployed everywhere and I can be confident all the low-hanging fruits are gone by now, and if not, its widespreadness means I'm unlikely to be the most interesting target, so I am more likely to escape a potential zero-day unscathed.
    What't the marketshare of IPA in comparison? Has it seen any meaningful action in the last decade years, and the same attention, from both white-hats (audits, pentesting, etc) as well as black-hats (trying to break into every exposed service)? I very much doubt it, so the safe thing to assume is that it's nowhere as bulletproof as OpenSSH and that it's more likely for a dedicated attacker to find a vuln there.
    
    dpe82 6 months ago
    
    MIT's Kerberos 5 implementation is 30 years old and has been very widely deployed.

xiande04 6 months ago

Aside: Fastmail was the best email provider I ever used. The interface was intuitive and responsive, both on mobile and web. They have extensive documentation for everything. I was able to set up a custom domain and and a catch-all email address in a few minutes. Customer support is great, too. I emailed them about an issue and they responded within the hour (turns out it was my fault). I feel like it's a really mature product/company and they really know what they're doing, and have a plan for where they're going.

I ended up switching to Protonmail, because of privacy (Fastmail is within the Five Eyes (Australia)), which is the only thing I really like about Protonmail. But I'm considering switching back to Fastmail, because I liked it so much.

kevin_thibedeau 6 months ago

Their Android client has been less than stellar in the past but recent releases are significantly improved. Uploading files, in particular, was a crapshoot.
gausswho 6 months ago

I also chose Proton for the same reason. It hurts that their product development is glacial but that's a crucial component that I don't understand why Fastmail doesn't try to offer.

TheFlyingFish 6 months ago

Lots of people here mentioning reasons to both use and avoid the cloud. I'll just chip in one more on the pro-cloud side: reliability at low scale.

To expand: At $dayjob we use AWS, and we have no plans to switch because we're tiny, like ~5000 DAU last I checked. Our AWS bill is <$600/mo. To get anything remotely resembling the reliability that AWS gives us we would need to spend tens of thousands up-front buying hardware, then something approximating our current AWS bill for colocation services. Or we could host fully on-prem, but then we're paying even more up-front for site-level stuff like backup generators and network multihoming.

Meanwhile, RDS (for example) has given us something like one unexplained 15-minute outage in the last six years.

Obviously every situation is unique, and what works for one won't work for another. We have no expectation of ever having to suddenly 10x our scale, for instance, because we our growth is limited by other factors. But at our scale, given our business realities, I'm convinced that the cloud is the best option.

jjeaff 6 months ago

This is a common false dichotomy I see constantly. Cloud vs, buy and build your own hardware from scratch and colocate/build own datacenter.
Very few non-cloud users are buying their own hardware. You can simply rent dedicated hardware in a datacenter. For significantly cheaper than anything in the cloud. That being said, certain things like object storage, if you don't need very large amounts of data, are very handy and inexpensive from cloud services considering the redundancy and uptime they offer.
ttul 6 months ago

This works even at $1M/mo AWS spend. As you scale, the discounts get better. You get into the range of special pricing where they will make it work against your P&L. If you’re venture funded, they have a special arm that can do backflips for you.
I should note that Microsoft also does this.

nisa 6 months ago

Love this article and I'm also running some stuff on old enterprise servers in some racks somehwere. Now over the last year I've had to dive into Azure Cloud as we have customers using this (b2b company) and I finally understood why everyone is doing cloud despite the price:

Global permissions, seamless organization and IaC. If you are Fastmail or a small startup - go buy some used dell poweredge with epycs in some Colo rack with 10Gbe transit and save tons of money.

If you are a company with tons of customers, ton's of requirements it's powerful to put each concern into a landing zone, run some bicep/terraform - have a ressource group to control costs and get savings on overall core-count and be done with it.

Assign permissions into a namespace for your employe or customer - have some back and forth about requirements and it's done. No need to sysadmin across servers. No need to check for broken disks.

I'm also blaming the hell of vmware and virtual machines for everything that is a PITA to maintain as a sysadmin but is loved because it's common knowledge. I would only do k8s on bare-metal today and skip the whole virtualization thing completly. I guess it's also these pains that are softened in the cloud.

mgaunard 6 months ago

Why is it surprising? It's well known cloud is 3 times the price.

diggan 6 months ago

Because the default for companies today is cloud, even though it almost never makes sense. Sure, if you have really spikey load, need to dynamically scale at any point and don't care about your spend, it might make sense.
Ive even worked in companies where the engineering team spent effort and time on building "scalable infrastructure" before the product itself even found product-market fit...
dewey 6 months ago

Nobody said it's surprising though, they are well aware of it having done it for more than two decades. Many newcomers are not aware of it though, as their default is "cloud" and they never even shopped for servers, colocation or looked around on the dedicated server market.
- aimanbenbaha 6 months ago
  
  I don't think they're not just aware. But purely from scaling and distribution perspective it'd be wiser to start on cloud while you're still on the product-market fit phase. Also 'bare metal' requires more on the capex end and with how our corporate tax system is set it's just discouraging to go on this lane first and it'd be better off to spend on acquiring clients.
  Also I'd guess a lot of technical founders are more familiar with cloud/server-side than with dealing or delegating sysadmin taks that might require adding members to the team.
  - dewey 6 months ago
    
    I agree, the cloud definitely has a lot of use cases and when you are building more complicated systems it makes sense to just have to do a few clicks to get a new stack setup vs. having someone evaluate solutions and getting familiar with operating them on a deep level (backups etc.).

jmakov 6 months ago

Would be interesting to know how files get stored. They don't mention any distributed FS solutions like SeaweedFS so once a drive is full, does the file get sent to another one via some service? Also ZFS seems an odd choice since deletions (esp of small files) at +80% full drive are crazy slow.

ryao 6 months ago
Unlike ext4 that locks the directory when unlinking, ZFS is able to scale on parallel unlinking. In specific, ZFS has range locks that permit directory entries to be removed in parallel from the extendible hash trees that store them. While this is relatively slow for sequential workloads, it is fast on parallel workloads. If you want to delete a large directory subtree fast on ZFS, do the rm operations in parallel. For example, this will run faster on ZFS than a naive rm operation:
```
  find /path/to/subtree -name -type f | parallel -j250 rm --
  rm -r /path/to/subtree
```
A friend had this issue on spinning disks the other day. I suggested he do this and the remaining files were gone in seconds when at the rate his naive rm was running, it should have taken minutes. It is a shame that rm does not implement a parallel unlink option internally (e.g. -j), which would be even faster, since it would eliminate the execve overhead and likely would eliminate some directory lookup overhead too, versus using find and parallel to run many rm processes.
For something like fast mail that has many users, unlinking should be parallel already, so unlinking on ZFS will not be slow for them.
By the way, that 80% figure has not been true for more than a decade. You are referring to the best fit allocator being used to minimize external fragmentation under low space conditions. The new figure is 96%. It is controlled by metaslab_df_free_pct in metaslab.c:
https://github.com/openzfs/zfs/blob/zfs-2.2.0/module/zfs/met...
Modification operations become slow when you are at/above 96% space filled, but that is to prevent even worse problems from happening. Note that my friend’s pool was below the 96% threshold when he was suffering from a slow rm -r. He just had a directory subtree with a large amount of directory entries he wanted to remove.
For what it is worth, I am the ryao listed here and I was around when the 80% to 96% change was made:
https://github.com/openzfs/zfs/graphs/contributors
- switch007 6 months ago
  
  I discovered this yesterday! Blew my mind. I had to check 3 times that the files were actually gone and that I specified the correct directory as I couldn't believe how quick it ran. Super cool
- brongondwana 6 months ago
  
  Unlinking gets done asynchronously on the weekends from Cyrus, using the `cyr_expire` tool. Right now it only runs one unlinking process at a time on the whole machine due to historical ext4 issues ... but maybe we should revisit that now we're on ZFS and NVMe. Thanks for the reminder.
- jmakov 6 months ago
  
  Thank you very much for sharing this, very insightful.
  - ryao 6 months ago
    
    Thank you for posting your original comment. The process of writing my reply gave me a flash of inspiration:
    https://github.com/openzfs/zfs/pull/16896
    I doubt that this will make us as fast as ext4 at unlinking files in a single thread, but it should narrow the gap somewhat. It also should make many other common operations slightly faster.
    I had looked into range lock overhead years ago, but when I saw the majority of time entering range locks was spent in an “unavoidable” memory allocation, I did not feel that making the operations outside the memory allocation faster would make much difference, so I put this down. I imagine many others profiling the code came to the same conclusion. Now that the memory allocation overhead will soon be gone, additional profiling might yield further improvements. :)
shrubble 6 months ago

The open-source Cyrus IMAP server which they mention using, has replication built-in. ZFS also has built-in replication available.
Deletion of files depends on how they have configured the message store - they may be storing a lot of data into a database, for example.
- mastax 6 months ago
  
  ZFS replication is quite unreliable when used with ZFS native encryption, in my experience. Didn't lose data but constant bugs.
  - brongondwana 6 months ago
    
    Yeah, we're only using ZFS replication for logs; we're using the Cyrus replication for emails because it has other sanity checks and data model consistency enforcement which is really valuable.
    (And both are async. We'd need something like drbd for real synchronous replication)
ackshi 6 months ago

Keeping enough free space should be much less of a problem with SSDs. They can tune it so the array needs to be 95% full before the slower best-fit allocator kicks in. https://openzfs.readthedocs.io/en/latest/performance-tuning....
I think that 80% figure is from when drives were much smaller and finding free space over that threshold with the first-fit allocator was harder.
brongondwana 6 months ago

Emails are stored in cyrus-imapd.
For now, the "file storage" product is a Node tree in mysql, with content stored in a content-addressed blob store, which is some custom crap I wrote 15 years ago that is still going strong because it's so simple there's not much to go wrong.
We do plan to eventually move the blob storage into Cyrus as well though, because then we have a single replication and backup system rather than needing separate logic to maintain the blob store.

louwrentius 6 months ago

I like this writeup, informative and to-the-point.

Today, the cloud isn’t about other people’s hardware.

It’s about infrastructure being an API call away. Not just virtual machines but also databases, load-balancers, storage, and so on.

The cost isn’t the DC or the hardware, but the hours spend on operations.

And you can abuse developers to do operations on the side :-)

zelphirkalt 6 months ago

And then come the weird aspects of bad cloud service providers, like IONOS, who have broken OS images, a provisioning API, that is a bottleneck, where what other people do and how much they do can slow down your own provisioning and creating network interfaces can take minutes via their API and their customer services says "That's how it is, cannot change it.", and you get a very shitty web user interface, that desperately tries to be a single page app, yet has all the default browser functionality like the back button broken. Yet they still cost literally 10x what Hetzner cloud costs, while Hetzner basically does everything better.
And then it is still also about other people's hardware in addition to that.

goldeneye13_ 6 months ago

Didn’t see this in the article, do they have multi az redundancy? I.e. if the entire raid goes up in flames what’s the recovery process?

cyrnel 6 months ago

Looks like they do mention that elsewhere: https://www.fastmail.com/features/reliability/
> Fastmail has some of the best uptime in the business, plus a comprehensive multi data center backup system. It starts with real-time replication to geographically dispersed data centers, with additional daily backups and checksummed copies of everything. Redundant mirrors allow us to failover a server or even entire rack in the case of hardware failure, keeping your mail running.
sufehmi 6 months ago

https://www.fastmail.com/blog/throwback-security-confidentia...
Amfy 6 months ago

I believe they replicate from NJ to WA (Seattle). At least that's something they spoke about many years ago.
- brongondwana 6 months ago
  
  PHL to STL these days, but same design:
  https://www.fastmail.com/blog/moving-house-new-datacentre/
comboy 6 months ago

Yeah, that makes me feel uneasy as a long time fastmail user.

Beijinger 6 months ago

I was told Fastmail is excellent, and I am not a big fan of gmail. Once locked out for good in gmail, your email and apps associated with it, are gone forever. Source? Personal experience.

"A private inbox $60 for 12 months". I assume it is USD, not AU$ (AFAIK, Fastmail is based in Australia.) Still pricey.

At https://www.infomaniak.com/ I can buy email service for an (in my case external) domain for 18 Euro a year and I get 5 inboxes. And it is based in Switzerland, so no EU or US jurisdiction.

I have a few websites and fastmail would just be prohibitive expensive for me.

qingcharles 6 months ago

You can have as many domains as you want for free in your Fastmail account. There are no extra fees.
I've used them for 20 years now. Highly recommended.
- steve_adams_86 6 months ago
  
  Wait, really? I pay for two separate domains. What am I missing?
  I'm happy to pay them because I love the service (and it's convenient for taxes), but I feel like I should know how to configure multiple domains under one account.
  - xerp2914 6 months ago
    
    Under Settings => Domains you can add additional domains. If you use Fastmail as domain registrar you have to pay for each additional domain, of course.
aquariusDue 6 months ago

Personally I prefer Migadu and tend to recommend them to tech savvy people. Their admin panel is excellent and straightforward to use, prices are based on usage limits (amount of emails sent/received) instead of number of mailboxes.
Migadu is just all around good, only downsides I can find are subjective. The fact that they're based in Switzerland and unless you're "good with computers" something like Fastmail will probably be better.
- Amfy 6 months ago
  
  Seems Migadu is hosted on OVH though? Huge red flag.. no control over infrastructure (think of Hetzner shutting down customers with little to no warning)
mariusor 6 months ago

My suggestion would be to try Purelymail. They don't offer much in the way of a web interface to email, but if you bring your own client, it's a very good provider.
I'm paying something like $10 per year for multiple domains with multiple email addresses (though with little traffic). I've been using them for about 5 years and I had absolutely no issues.
- crossroadsguy 6 months ago
  
  Purelymail is just one person show. May that one person live long and prosper, but I am not putting my faith or email in that business.
  - mariusor 6 months ago
    
    Why do you need to put faith in them? Switching email providers is just a DNS change away, and email messages can be stored locally - actually it's encouraged to do so.
    
    crossroadsguy 6 months ago
    
    You do have a point. But I still do not feel comfortable. Besides I can't envisage changing providers whenever I face trouble and it gets unresolved in a timely manner which is what I assume in the current setup. Pinboard is another example. Anyway, it kind of doesn't work for me.
- Beijinger 6 months ago
  
  Pricing it hard to understand: https://purelymail.com/advancedpricing
  Usernames on shared domains:
  1 to 6 letters: $0.20 per user per year 7 to 12 letters: $0.05 per user per year 13+ letters: $0.02 per user per year
  WTF?
  - mariusor 6 months ago
    
    I have no idea why that's a thing. :D
    Personally I use the simple pricing scheme and looking at billing page, I pay around ~$0.35-0.4 monthly for 5 domains, with 4 explicitly set email addresses and catch-all for all domains to a common mailbox. Also I must state again, there is quite little traffic on all.

indulona 6 months ago

I am working on a personal project(some would call it startup, but i have no intention of getting external financing and other americanisms) where i have set up my own cdn and video encoding, among other things. These days, whenever you have a problem, everyone answers "just use cloud" and that results in people really knowing nothing any more. It is saddening. But on the other hand it ensures all my decades of knowledge will be very well paid in the future, if i'd need to get a job.

tiffanyh 6 months ago

FYI - Fastmail web client has Offline support in beta right now.

https://www.fastmail.com/blog/offline-in-beta/

mdaniel 6 months ago

And if anyone is curious, I actually live on their https://betaapp.fastmail.com release and find it just as stable as the "mainline" one but with the advantage of getting to play with all the cool toys earlier. Bonus points (for me) in that they will periodically conduct surveys to see how you like things
rob_c 6 months ago

Omg it's 1970 and we have IMAP now... Oh wait...
ForHackernews 6 months ago

Very confused by this. What is in beta? I've had "offline" email access for 25 years. It's called an IMAP client.
- renewiltord 6 months ago
  
  [flagged]
  - ternnoburn 6 months ago
    
    Hey, this response makes you look like an adolescent asshole. Parent poster was clearly asking about prioritization.
    
    renewiltord 6 months ago
    
    [flagged]
    
    ternnoburn 6 months ago
    
    Being an asshole is being a moron.
    
    renewiltord 6 months ago
    
    Obviously untrue.

rmbyrro 6 months ago

if you don't have high bandwidth requirements, like for background / batch processing, the ovh eco family [1] of bare metal servers is incredibly cheap

[1] https://eco.ovhcloud.com/en/

caidan 6 months ago

I absolutely love Fastmail. I moved off of Gmail years ago with zero regrets. Better UI, better apps, better company, and need I say better service? I still maintain and fetch from a Gmail account so it all just works seamlessly for receiving and sending Gmail, so you don’t have to give anything up either.

mlfreeman 6 months ago

I moved from my own colocated 1U running Mailcow to Fastmail and don't regret it one bit. This was an interesting read, glad to see they think things through nice and carefully.
The only things I wish FM had are all software:
1. A takeout-style API to let me grab a complete snapshot once a week with one call
2. The ability to be an IdP for Tailscale.
- brongondwana 6 months ago
  
  1. hoping to have a JMAP archive format at some point which should cover that. I'd hope that normally you'd be fetching a delta update rather than the whole thing. We've got enough bandwidth for a few people do to it, but I wouldn't want every customer pulling their entire archive every week of 99% the same immutable data; that would be kinda sucky.
  2. yeah, I'd love that too - we're keen to integrate with everything else that people are using. We have a basic in-house IdP thing for our own staff to authenticate against our hosted services, but haven't scaled it out. This will happen eventually, though I've been burned enough times I don't want to promise a timeframe.
  - mlfreeman 6 months ago
    
    When I back up machines I only pull a full backup 3-4 times a year and then I stack weekly deltas on top of those.
    I'd start with that and see how it seemed to work when trying to look through backups and test-restore things.
petesergeant 6 months ago

I use Fastmail for my personal mail, and I don’t regret it, but I’m not quite as sold as you are, I guess maybe because I still have a few Google work accounts I need to use. Spam filtering in Fastmail is a little worse, and the search is _terrible_. The iOS app is usable but buggy. The easy masked emails are a big win though, and setting up new domains feels like less of a hassle with FM. I don’t regret using Fastmail, and I’d use them again for my personal email, but it doesn’t feel like a slam dunk.
xerp2914 6 months ago

100% this. I migrated from Gmail to Fastmail about 5 years ago and it has been rock solid. My only regret is that I didn't do it sooner.
jb1991 6 months ago

Their UI is definitely faster but I do prefer the gmail UI, for example how new messages are displayed in threads is quite useless in fastmail.
pawelduda 6 months ago

Their android app has always been much snappier than Gmail, it's the little things that drew me to it years ago

ackshi 6 months ago

I'm a little surprised it seems they didn't have some existing compression solution before moving to zfs. With so much repetitive text across emails I would think there would be a LOT to gain, such as from dictionaries, compressing many emails into bigger blobs, and fine-tuning compression options.

silvestrov 6 months ago

They use ZFS with zstd which likely compresses well enough.
Custom compression code can introduce bugs that can kill Fastmail's reputation of reliability.
It's better to use a well tested solution that cost a bit more.
rob_c 6 months ago

Keen to move certain tasks to ZFS but not the ones that matter...
Frankly given emails are normally ~4kB objects I suspect the compression overheads are probably not that worth it unless it's for attachments only. Not attacking ZFS it's compression and checksumming are among best in class, but the compression would work better if it weren't limited to small files. Here ZFS has made a lot of wins I've not had a problem with many files on ZFS due to L1/L2 ARC but the cost is metadata ops can be painful on many small files.
The evidence they IOPS limited it that they went for SSD or better when they could store the same capacity on rust for much cheaper now.
Yeah I think moving the compression or file access up to abstract what is being written to disk ala protonmail (I don't like their offerings, but like their tech) means you can have compression over 4MB not 4kB blocks which matters when you recall data from disks for , I don't know... Backups or search?
also remember RAID!=backups ;)

antihero 6 months ago

I’ve started to host my own sites and stuff on an old MacBook in a cupboard with a shit old external hardware Ava microk8s and it’s great!

theoreticalmal 6 months ago

Another homelabber joins the ranks!!
- antihero 6 months ago
  
  Just implemented a dyndns system using K8s CronJobs + GitOps + CloudFlare Terraform, however next stage will be moving that over to CloudFlare tunnels which should be more reliable and nicer, fully within the Terraform and not relying on polling a random JSON IP service (which a terrifying SPOF)

throw0101b 6 months ago

> So after the success of our initial testing, we decided to go all in on ZFS for all our large data storage needs. We’ve now been using ZFS for all our email servers for over 3 years and have been very happy with it. We’ve also moved over all our database, log and backup servers to using ZFS on NVMe SSDs as well with equally good results.

If you're looking at ZFS on NVMe you may want to look at Alan Jude's talk on the topic, "Scaling ZFS for the future", from the 2024 OpenZFS User and Developer Summit:

* https://www.youtube.com/watch?v=wA6hL4opG4I

* https://openzfs.org/wiki/OpenZFS_Developer_Summit_2024

There are some bottlenecks that get in the way of getting all the performance that the hardware often is capable of.

ttul 6 months ago

I think mailbox hosting is a special use case. The primary cost is storage and bandwidth and you can indeed do better on storage and bandwidth than what Amazon offers. That being said, if Fastmail asked Amazon for special pricing to make the move, they would get it.

neeeeeeal 6 months ago

What not many people talk about in the comments is how the hardware route is fairly stacked against smaller players. Large enterprises buy the same hardware as small and midsize businesses at a fraction of the cost, which significantly impacts the economics of this decision. Even if you have the capability and desire, if each server costs your business double what an enterprise would pay, it becomes less attractive pretty quickly.

kayson 6 months ago

Any ideas how they manage the ZFS encryption key? I've always wondered what you'd do in an enterprise production setting. Typing the password in at a prompt as any seem scalable (but maybe they have few enough servers that it's manageable) and keeping it in a file on disk or on removable storage would seem to defeat the purpose...

herf 6 months ago

zfs encryption is still corrupting datasets when using zfs send/receive for backup (huge win for mail datasets), would be cautious about using it in production:

https://github.com/openzfs/zfs/issues/12014

rob_c 6 months ago

Please stop using send/recv . Your backups should be based on non ZFS tech to avoid all your eggs in one basket. Yes send/recv is fine for immediate recovery, but other than block level replication for immediate (my server is now inside the tornado) recovery this isn't advised.
Also, who cares if a single filesystem dies, that's why you have inter-server replication. Nuke the bad server and rebuild before the next 3 or 4 die.
klysm 6 months ago

I’ll never use ZFS in production after I was on a team that used it at petabyte scale. It’s too complex and tries to solve problems that should be solved at higher layers.
brongondwana 6 months ago

yeah, we use Cyrus replication still - it's protocol specific so it detects changes very efficiently as well, using the internal MODSEQ system also used for the JMAP /changes and IMAP CONDSTORE/QRESYNC.
Plus it has protocol consistency sanity checks built in.
Plus, I wrote it :p

kwakubiney 6 months ago

If I remember correctly, StackOverflow does something similar. The then Director of Engineering speaks about it on here[1]

[1]https://hanselminutes.com/847/engineering-stack-overflow-wit...

e12e 6 months ago

They also have a SaaS product that lives in the cloud:
https://stackoverflow.blog/2023/08/30/journey-to-the-cloud-p...

Axsuul 6 months ago

Anyone know what are some good data centers or providers to host your bare metal servers?

klysm 6 months ago

You’re probably looking for the term “colo”

veidr 6 months ago

"WHY we use our own hardware..."

The why is is the interesting part of this article.

veidr 6 months ago

I take that back; this is (to me)t he most interesting part:
"Although we’ve only ever used datacenter class SSDs and HDDs failures and replacements every few weeks were a regular occurrence on the old fleet of servers. Over the last 3+ years, we’ve only seen a couple of SSD failures in total across the entire upgraded fleet of servers. This is easily less than one tenth the failure rate we used to have with HDDs."
- veidr 6 months ago
  
  I wanted to revisit this after checking my own anecdata. (But based on logfiles not just like recollections.)
  I've had a ZFS system or some sort for about 10 years, and before that I had proprietary RAID chassis like Pegasus2 and Synology etc.
  I can't quite say how many drives I have used, because my records are not that good. But maybe its like 100 drives since 2008. Maybe 150. Less than 200.
  I had over 10 HDD devices fail (probably 13, confidence of like 90%).
  I've only ever had 1 SSD fail.
  I've also used the absolute cheapest shite SSDs.
  I suspect the failure modes tend to be
  - hard disks fail whenever the fuck, who knows
  - SSDs fail in the beginning or end of their reasonable service life
  P.S. With ZFS though, you don't really care if/when they fail. I've so far (knock on wood) never lost any data with a ZFS config with >1 disk redundancy and reasonable backups.

tuananh 6 months ago

gmail does spam filtering very well for me. fastmail on the other hands, puts lots of legit emails into spam folder. manually marking "not spam" doesn't help

other than that, i'm happy with fastmail.

ghaff 6 months ago

If I look at my Gmail SPAM folder, there is very rarely something genuinely important in it. What there is a fair bit of though is random newsletters and announcements that I may have signed up for in some way shape or form that I don't really care about or generally look at. I assume they've been reported as SPAM by enough people rather than simply unsubscribed to that Google now labels them as such.
TMWNN 6 months ago

>fastmail on the other hands, puts lots of legit emails into spam folder. manually marking "not spam" doesn't help
Fastmail explicitly says that moving mail to/from a spam folder via a mail client does not automaticallyl retrain. <https://www.fastmail.help/hc/en-us/articles/1500000278142-Im...> (I never did figure out if Gmail acts the same way or not.)
jacobdejean 6 months ago

iCloud is just as bad, sends important things to spam constantly and marking as “not spam” has never done anything perceivable.

EdJiang 6 months ago

I was a bit confused by the section on backups. How do they manage moving the data offsite with the on-premises backup servers? Wouldn’t that be a cost savings by going cloud?

IYasha 6 months ago

Very, very reasonable! And the HDD vs. SSD part is just reading my thoughts. :)

xsc 6 months ago

Are those backups geographically distributed?

christophilus 6 months ago

Yes.

briHass 6 months ago

The biggest win with running your own infra is disk/IO speeds, as noted here and in DHH's series on leaving cloud (https://world.hey.com/dhh/we-have-left-the-cloud-251760fb)

The cloud providers really kill you on IO for your VMs. Even if 'remote' SSDs are available with configurable ($$) IOPs/bandwidth limits, the size of your VM usually dictates a pitiful max IO/BW limit. In Azure, something like a 4-core 16GB RAM VM will be limited to 150MB/s across all attached disks. For most hosting tasks, you're going to hit that limit far before you max out '4 cores' of a modern CPU or 16GB of RAM.

On the other hand, if you buy a server from Dell and run your own hypervisor, you get a massive reserve of IO, especially with modern SSDs. Sure, you have to share it between your VMs, but you own all of the IO of the hardware, not some pathetic slice of it like in the cloud.

As is always said in these discussions, unless you're able to move your workload to PaaS offerings in the cloud (serverless), you're not taking advantage of what large public clouds are good at.

noprocrasted 6 months ago

Biggest issue isn't even sequential speed but latency. In the cloud all persistent storage is networked and has significantly more latency than direct-attached disks. This is a physical (speed of light) limit, you can't pay your way out of it, or throw more CPU at it. This has a huge impact for certain workloads like relational databases.
- briHass 6 months ago
  
  I ran into this directly trying to use Azure's SMB as a service offering (Azure Files) for a file-based DB. It currently runs on a network share on-prem, but moving it to an Azure VM using that service killed performance. SMB is chatty as it is, and the latency of tons of small file IO was horrendous.
  Interestingly, creating a file share VM deployed in the same proximity group has acceptable latency.
- sgarland 6 months ago
  
  Yep. This is why my 12-year old Dell R620s with Ceph on NVMe via Infiniband outperform the newest RDS and Aurora instances: the disk latency is measured in microseconds. Locally attached is of course even faster.

beaugunderson 6 months ago

I don't trust anything from fastmail after they bought pobox and forced me onto their new service which fails at the one thing pobox did well--forwarding email. They also refused to give me a refund (prorated or not) for removing the product I was using and substituting a defective one.

ylee 6 months ago

What problems have you had? I also came over from pobox and thought that the transition was quite straightforward.
- beaugunderson 6 months ago
  
  Anything erroneously marked as spam can not be released to the forwarding address—-meaning they fail at their one job, forwarding email. Pobox had a great interface for quickly releasing messages to the forwarding address.
  - ylee 6 months ago
    
    Pre-Fastmail, I did not have mail storage space at Pobox; just forwarding ability. I did not use Pobox's own interface for releasing spam mail; I used the standard filter (can't remember the exact name) and almost never saw nonspam in there (not that I checked often).
    Post-Fastmail, I still forward from Pobox/Fastmail to the same other Google Workspace account from which I pull mail to my local system with `fetchmail`. I have Fastmail send all mail, spam or not; while the settings UI does not allow setting the spam protection level to "Off" when forwarding is used, the same thing can be achieved by using "Custom" then disabling "Move messages with a score of ___ or higher to Spam". I thus can let Google's spam filter deal with the inflow and, if necessary, manually sort miscategorized mail with my IMAP client.

lakomen 6 months ago

You also terminate accounts at your sole discretion

awinter-py 6 months ago

everyone is 'cattle not pets' except the farm vet who is shoulder-deep in a cow

(my experience with managed kubernetes)

0xbadcafebee 6 months ago

I've been doing this job for almost as long as they have. I work with companies that do on-prem, and I work with companies in the cloud, and both. Here's the low down:

1. The cost of the server is not the cost of on-prem. There are so many different kinds of costs that aren't just monetary. ("we have to do more ourselves, including planning, choosing, buying, installing, etc,") Those are tasks that require expertise (which 99% of "engineers" do not possess at more than a junior level), and time, and staff, and correct execution. They are much more expensive than you will ever imagine. Doing any of them wrong will causes issues that will eventually cost you business (customers fleeing, avoiding). That's much worse than a line-item cost.

2. You have to develop relationships for good on-prem. In order to get good service in your rack (assuming you don't hire your own cage monkey), in order to get good repair people for your hardware service accounts, in order to ensure when you order a server that it'll actually arrive, in order to ensure the DC won't fuck up the power or cooling or network, etc. This is not something you can just read reviews on. You have to actually physically and over time develop these relationships, or you will suffer.

3. What kind of load you have and how you maintain your gear is what makes a difference between being able to use one server for 10 years, and needing to buy 1 server every year. For some use cases it makes sense, for some it really doesn't.

4. Look at all the complex details mentioned in this article. These people go deep, building loads of technical expertise at the OS level, hardware level, and DC level. It takes a long time to build that expertise, and you usually cannot just hire for it, because it's generally hard to find. This company is very unique (hell, their stack is based on Perl). Your company won't be that unique, and you won't have their expertise.

5. If you hire someone who actually knows the cloud really well, and they build out your cloud env based on published well-architected standards, you gain not only the benefits of rock-solid hardware management, but benefits in security, reliability, software updates, automation, and tons of unique features like added replication, consistency, availability. You get a lot more for your money than just "managed hardware", things that you literally could never do yourself without 100 million dollars and five years, but you only pay a few bucks for it. The value in the cloud is insane.

6. Everyone does cloud costs wrong the first time. If you hire somebody who does have cloud expertise (who hopefully did the well-architected buildout above), they can save you 75% off your bill, by default, with nothing more complex than checking a box and paying some money up front (the same way you would for your on-prem server fleet). Or they can use spot instances, or serverless. If you choose software developers who care about efficiency, they too can help you save money by not needing to over-allocate resources, and right-sizing existing ones. (Remember: you'd be doing this cost and resource optimization already with on-prem to make sure you don't waste those servers you bought, and that you know how many to buy and when)

7. The major takeaway at the end of the article is "when you have the experience and the knowledge". If you don't, then attempting on-prem can end calamitously. I have seen it several times. In fact, just one week ago, a business I work for had three days of downtime, due to hardware failing, and not being able to recover it, their backup hardware failing, and there being no way to get new gear in quickly. Another business I worked for literally hired and fired four separate teams to build an on-prem OpenStack cluster, and it was the most unstable, terrible computing platform I've used, that constantly caused service outages for a large-scale distributed system.

If you're not 100% positive you have the expertise, just don't do it.

BackBlast 6 months ago

> 7. ... Another business I worked for literally hired and fired four separate teams to build an on-prem OpenStack cluster, and it was the most unstable, terrible computing platform I've used, that constantly caused service outages for a large-scale distributed system.
I've seen similarly unstable cloud systems. It's generally not the tool's fault, it's the skill of the wielder.
brongondwana 6 months ago

Yeah, we have good vendor relationships, good datacenter relationships, and we've made mis-steps along the way for sure. Own hardware isn't for everyone, but it's been great for us. YMMV

tucnak 6 months ago

Yeah, Cloud is a bit of a scam innit? Oxide is looking more and more attractive every day as the industry corrects itself from overspending on capabilities they would never need.

klysm 6 months ago

It’s trading time for money
- jgb1984 6 months ago
  
  Fake news. I've got my bare metal server deployed and installed with my ansible playbook even before you manage to log into the bazillion layers of abstraction that is AWS.
  - acedTrex 6 months ago
    
    But can you do that on demand in minutes for 1000 application teams that have unique snowflake needs. Because terraform or bicep can.
  - klysm 6 months ago
    
    In multiple regions?
- rob_c 6 months ago
  
  Yes, welcome to business. But frankly an email provider needs to have their own metal, if they don't they're not worth doing business with

nprateem 6 months ago

Yeah and some people reckon web frameworks are bad too. Sometimes it might make sense to host your on your own hardware but almost certainly not for startups.

brongondwana 6 months ago

We also did our own web framework :p
https://github.com/fastmail/overture

Chengdavid 6 months ago

[dead]

oldpersonintx 6 months ago

longtime FM user here

good on them, understanding infrastructure and cost/benefit is essential in any business you hope to run for the long haul

stefantalpalaru 6 months ago

[dead]

amelius 6 months ago

[flagged]

GrantMoyer 6 months ago

This last week, gmail failed to filter as spam an email with subject "#T Anitra", body,
> oF1 d 4440 - 2 B 32677 83
> R Teri E x E q
>
> k 50347733 Safoorabegum
and an attachment "7330757559.pdf". It let through 8 similar emails in the same week, and many more even more egregiously gibberish emails over the years. I'm not pleased with the quality of gmail's spam filter.
- yazzku 6 months ago
  
  [dead]
Hard_Space 6 months ago

I moved to FastMail three years ago, and, for a contrasting experience, found that spam filtering was almost on a par with Gmail. I had feared it would be otherwise.
werid 6 months ago

my inbox at fastmail is near empty from spam. the main spam i see in my inbox is forwarded from my gmail.
- jb1991 6 months ago
  
  That probably says more about the email address that’s out there than anything else.
  - christophilus 6 months ago
    
    Fastmail has wildcard email support, so it’s pretty easy to have an email per purchase you make (for example). This makes it easy to see who leaked your email to spammers. Anyway, I have nowhere near the volume of spam with Fastmail that I had with Gmail.
    
    jb1991 6 months ago
    
    My point was not about wildcard emails, which Gmail also offers. Rather, the amount of spam you get is typically based on how well known your email address is to spammers. If someone’s not getting much spam, it usually just means they haven’t used their email address in places where they would get it. This is regardless of whether it’s a wildcard email or not.
jorams 6 months ago

Your comment is confusing because you start this one saying your inbox is full of spam, but respond to a suggestion to mark it as spam by saying it's not actually spam.
If something is not spam but you want it out of your inbox there's a few options:
- click Unsubscribe next to the sender. This should be possible for essentially all promotional email.
- click Actions -> click Block <sender>. Messages from this address will now immediately go to trash.
- click Actions -> click Add rule from message (-> optionally change the suggested conditions) -> check Archive (or if you don't use labels click Move to) -> click Save. Messages matching the conditions will now skip your inbox.
There's not much they could do to make that easier without magically knowing what you care about and what you don't.
- amelius 6 months ago
  
  I guess what's confusing is that I'm calling the promotional emails also "spam".
  But thanks for your suggestions.
  I see a few problems. When I receive a promotional email, I want to add a rule, and I have to click 7 times (including once for "Archive"), and use the scroll-wheel to select the "Promotions" label. Secondly, the rule is not applied directly. This is confusing, and cumbersome. Note: I don't want to Unsubscribe (because there may be vouchers), and I don't want to mark it as spam, for the same reason.
  Another problem is that the amount of rules gets unwieldy this way. I have hundreds of rules already for promotional stuff and the rules I use for other (more important) stuff are hidden between them.
  Maybe you think I am complaining too much, but in gmail it was all simple and automatic.
  - jorams 6 months ago
    
    One rule that may be good, depending on how much of the mailing list email you receive is promotional, is to match "A header called List-Unsubscribe exists" and move that to Promotional. Then you could put any exceptions that it categorizes wrong above it.
    
    amelius 6 months ago
    
    That's a good idea. Although, oddly, some of the emails that have an Unsubscribe button have no List-Unsubscribe header.
    How would you suggest to solve the following problem: let's say I have archived all my mail (inbox zero); how do I now see the emails that are important to me (i.e., everything that was not labeled e.g. with promotions)?
mgaunard 6 months ago

Gmail puts most of my email in the spam folder, including a lot of non-spam. Manually labeling it as non-spam is not helping.
- anonzzzies 6 months ago
  
  Never had that after the first few years, but I hear other people do have that. Maybe it's because I used it for 2 decades now? I tried alternatives including fastmail but I always leave them because I get swamped by spam while gmail works fine.
pavlov 6 months ago

There is a "Report Spam" function which is two clicks away (it's in the "More" menu).
- amelius 6 months ago
  
  I don't want to report everything as spam. For example, promotional emails from businesses that I bought something from. I don't want to punish those businesses; and those emails might contain vouchers that I could use later. But I want those emails moved out of the way without any action from my side.
- crossroadsguy 6 months ago
  
  That's like Spotify telling me "keep disliking" when I complained to them why songs from a certain language (which I never liked or listened to and I certainly don't speak) keeps filling the home after I told them in the first complaint that I have been doing that since months.
  - pavlov 6 months ago
    
    What can I say, "Report Spam" seems to work for me. I'm just a customer of Fastmail.
- ramon156 6 months ago
  
  If you get 12 spam mails everyday and after 3 months of clicking "report spam" it still doesn't filter it, then it's not en par with Gmail.
  - mcny 6 months ago
    
    If you meet someone new at a social event and give them your email address, where do you want your email provider to put the message that this person sent?
ghjfrdghibt 6 months ago

I get no spam on fastmail, I assume this is because I never give out my email to anyone and creating new ones for every interaction. This way I keep track of who I'm interacting with and also who's selling my alias emails.
Just wish there was a decent way to do this with mobile numbers!
- Modified3019 6 months ago
  
  Same, I religiously create a masked email for every website (just checked, it's now at 163!). I simply don't give my "main" email out.
  Oddly enough, simply unsubscribing from the things websites themselves has kept thing clean, I've yet to notice any true spam from a random source aimed at any of my emails since I joined last year.

edithpixie 6 months ago

[flagged]

rrgok 6 months ago

I would like to know the tech stack behind it.

brongondwana 6 months ago

There's various articles on our blog about our stack!

lokimedes 6 months ago

A mail-cloud provider uses its own hardware? Well, that’s to be expected, it would be a refreshing article if it was written by one of their customers.

rob_c 6 months ago

No they deserve me praise for simply running their stuff on metal... Like a thousand unix sysadmins before and after

pammf 6 months ago

Cost isn’t always the most important metric. If that was the case, people would always buy the cheapest option of everything.

tndibona 6 months ago

But what about the cost and complexity of a room with the racks and the cooling needs of running these machines? And the uninterrupted power setup? The wiring mess behind the racks.

bradfa 6 months ago

There is a very competitive market for colo providers in basically every major metropolitan area in the US, Europe, and Asia. The racks, power, cooling, and network to your machines is generally very robust and clearly documented on how to connect. Deploying servers in house or in a colo is a well understood process with many experts who can help if you don’t have these skills.
- rob_c 6 months ago
  
  Colo offers the ability to ship and deploy and keep latencies down if you're global, but if you're local yes you should just get someone on site and the modern equivalent of a T1 line setup to your premises if you're running "online" services.
hyhconito 6 months ago

I'm not fastmail but this is not rocket science. Has everyone forgotten how datacentre services work in 2024?
- rob_c 6 months ago
  
  Yes they have and they feel they deserve credit for discovering a WiFi cable is more reliable to the new shiny kit that was sold to them by a vendor...
grishka 6 months ago

Own hardware doesn't mean own data center. Many data centers offer colocation.
jonatron 6 months ago

Even for cloud providers, these are mostly other people's problems, eg: Equinix
7952 6 months ago

Do colocation facilities solve that?

dorongrinstein 6 months ago

We at Control Plane (https://cpln.com) make it easy to repatriate from the cloud, yet leverage the union of all the services provided by AWS, GCP and Azure. Many of our customers moved from cloud A to cloud B, and often to their own colocation cage, and in one case their own home cluster. Check out https://repatriate.cloud

rob_c 6 months ago

Hosts online service seems to think deserving of medal for discovering that S3 buckets from a cloud provider are crap and cost a fortune.

The heading in this space makes your think they're running custom FPGAs such as with Gmail, not just running on metal... As for drive failures, welcome to storage at scale. Build your solution so it's a weekly task to replace 10disks at a time not critical at 2am when a single disk dies...

Storing/Accessing tonnes of <4kB files is difficult, but other providers are doing this on their own metal with CEPH at the PB scale.

I love ZFS, it's great with per-disk redundancy but CEPH is really the only game in town for inter-rack/DC resilience which I would hope my email provider has.

brongondwana 6 months ago

Ceph is most certainly not the only game in town. It's good and stuff, but it's just tech. We're using protocol level replication for each of our data stores.
- rob_c 6 months ago
  
  No, let's be honest. CEPH is the only solution for data management at this scale (sub to few PB). The solution which is independent of application or workload. The market share, fact IBM is moving people off other projects internally for this, and the massive backing shows this.
  Yes you can have all or a bunch of these features like failure domains via other routes/products but none have all of the stuff together in one place like CEPH.
  There's a reason people call it the "Linux of storage". The only alternatives are manage this at a higher level in your stack (reinventing the wheel) or buying PB level solutions from corporate which is like saying I'm buying Oracle and MS over Linux.
  Protocol replication means you've reimplemented something which is storage related elsewhere in your stack. It's not incorrect to do so, but there exist better solutions and alternatives now.
  - brongondwana 6 months ago
    
    I mean, I'm happy to have this argument. CEPH is content agnostic and that's fantastic most of the time. Cyrus replication is data aware, so it's not just replicating the data, it's doing integrity checking and data model consistency handling.
    Most of all, it's doing split brain recovery; which - if we wanted CP rather than AP then we wouldn't need, but that wasn't the original design.
    If I was redoing this from scratch, I'd maybe do Ceph or similar and update Cyrus to work well with it, but that would be a big change from the current design.
    Anyway, I'm happy to stipulate that Ceph is great tech, without going and telling other people that it's the only choice.
    
    rob_c 6 months ago
    
    Do you honestly think CEPH isn't doing data consistency handling? I'll pay for your ticket to cephalocon if you'll speak to that effect(!)
    Split brain stuff only happens when you're splitting a single threaded task and put it back together. MDS in CEPH has this problem but that's so far into the weeds here as to be off topic.
    Again you're implementing something storage not in storage and taking any storage. Fine if you want to do it that way, but talk about _that_ not hecking ZFS being mah saviour. (Btw daily driving and love that too but an email provider _relying_ on it should raise eye brows)...
    
    brongondwana 6 months ago
    
    I do believe we are talking past each other here. Of course ceph does data consistency, but it sure doesn't assert that a modseq is monotonically increasing or that an mailbox/uidvalidity/uid triple doesn't change digest, because it's not data-model aware.
    sigh