IANAL,but naming your product S2 and mentioning in the intro that AWS S3 is the tech you are enhancing is probably looking for a branding/copyright claim from Amazon. Same vertical & definitely will cause consumer confusion. I'm sure you've done the research about whether a trademark has been registered.
It's like S3, except better because, by focusing on being a write-only data store, they can manage much more throughput and efficiency, plus your data is far more secure at rest than it is in S3.
98% of the time, law suits are just a money pit. There is zero publicity. A tiny number go viral. I don't think this is likely to be one of those times.
Most people would simply say "Amazon is right." Because Amazon is right. This is an intentional attempt to leverage their product branding to promote a new product. There is very little good here.
If this were open-source, academic, non-profit, or something like that, perhaps. A small venture trying to commercialize on some digital equivalent of Amazon's trade dress? I can't imagine anyone would care....
Even those times when someone is 100% right, usually, there is zero publicity. Right or wrong, most times I've seen, the small guy would settle with the big guy with the deep legal pockets and move on because litigating is too expensive.
In a situation like this one, your marketing spend / press coverage on the existing name is shot, links to your domain are shot, and perhaps you have an egg on your face, depending on how things play out.
I'm not sure whether they consulted a bad trademark lawyer or didn't consult one at all, but it wouldn't have cost that much to do so. I say this having just recently started the process of filing a trademark - the cost is about the same as buying i.e. 's4.dev' according to the domain registry's website.
Having to rebrand your product after launching is a lot more painful than doing it before launching.
Wow, imagine Debezium offering native compatibility with this, capturing the changes from a Postgres database, saving them as delta or iceberg in a pure serverless way!
This is a really good idea, beautiful API, and something that I would like to use for my projects. However I have zero confidence that this startup would last very long in its current form. If it's successful, AWS will build a better and cheaper in-house version. It's just as likely to fail to get traction.
If this had been released instead as a Papertrail-like end-user product with dashboards, etc. instead of a "cloud primitive" API so closely tied to AWS, it would make a lot more sense. Add the ability to bring my own S3-Compatible backend (such as Digital Ocean Spaces), and boom, you have a fantastic, durable, cloud-agnostic product.
(Founder) we do intend to be multi-cloud, we are just starting with AWS. Our internal architecture is not tied to AWS, it's interfaces that we can implement for other cloud systems.
It would be extra ironic if the whole thing already ran on top of AWS.
There's no end to startups which can be described as existing-open-source-software as a service, marketed as a cheaper alternative to AWS offerings.. who run on AWS.
They just did https://news.ycombinator.com/item?id=42211280 (Amazon S3 now supports the ability to append data to an object, 30 days ago). Azure has had the same with append blobs for a long time. It's still a bit more raw than S2, without the concept of record. The step for a cloud provider to offer this natively is very small. And with the concept of a record, isn't this essentially a message queue, where the competitor space is equally big? Likewise if you look into log storage solutions.
(Founder) Both S3 Express _One Zone_ appends and Azure's append blobs charge the regular PUT price for appends. It may work for you, but probably not if you want to do smaller writes.
Blob stores will also not let you do tailing reads, like you can with S2.
In AWS, S2's Express storage class takes care of writing to a quorum of 3 zonal buckets for regional durability.
I doubt object stores will go from operating at the level of blobs and byte ranges, to records and sequence numbers. But I could be wrong.
I had never heard of this company so I took a look and the main pitch was compelling and then I went to the pricing page and saw the pricing goes from $0 to $500 a month once you want to go to “production”. i’m clearly not the target market, which makes sense why I’ve never heard it.
Help me understand - you build on top of AWS, which charges $0.09/GB for egress to the Internet, yet you're charging $0.05/GB for egress to the Internet? Sounds like you're subsidizing egress from AWS? Or do you have access to non-public egress pricing?
For what it's worth, there's zero chance I would do business with a company whose business plan is "we'll work it out". It gives one every reason to believe that in a couple years time you guys will either be out of business (because you didn't figure out the numbers to make a profit) or will pull the rug from under customers in the form of surprise price hikes. Obviously you have to do what you think is right, but I think that this approach is going to scare off a lot of customers for you.
(Founder) We are not charging during preview. If anything, I wanted to be transparent about our planned pricing. Our mission is to make streams a cloud storage primitive, and I worked backwards from there in terms of our costs and expected costs looking ahead once we can scale a bit - based on concrete data points about what kind of discounts can be unlocked. I realized it was premature based on the comments here, so the price for internet egress has been updated. Thank you for your feedback.
Cloud services offer giant discounts sometimes and the receiving party aren't allowed to talk about it concretely so that's probably what's happening here.
Discounts require multi year commitment for minimum (and increasing) spend.
Generally you need to be either profitable or a well funded startup to demonstrate why a vendor would trust your ability to pay (it's literally a debt on your books). How do they know you're good for it?
Plus multi cloud means less scale and less marketing incentive (can't talk about you as a x cloud customer).
I wish you the best, but would encourage you to not set your prices below your costs.
(Founder) Thank you for the advice. I hope we can offer better when the deals come into play, but for now setting our planned internet egress price to $0.08/GiB.
Doesn't AWS charge $0.01 intra region and $0.02 between regions, even without setting up private links? Can't you pass part of those savings (compared to the $0.05-$0.09 of egress) on? Or is it too difficult to detect if the remote IP qualifies?
(Founder) Unfortunately, if you access over a public IP, it is internet egress. Even if the client is in the same AWS cloud region. PrivateLink is the only option.
I look at the egress costs to internet and it doesn’t check out. It’s a premium product dependent on DX, marketed to funded startups.
But if I care about ingress and egress costs, which many stream heavy infrastructure providers do.. This doesn’t add up.
I wish them luck, but I feel they would have had a much better chance from the start by getting some funding and having a loss leader start, then organising and passing on wholesale rates from cloud providers once they’d reached critical mass.
Instead they’re going in at retail which is very spicy. I feel like someone will clone the tech and let you self host, before big players copy it natively.
It’s a commodity space and they’re starting with a moat of a very busy 2 weeks from some Staff engineers at AWS.
(Founder) Thanks for sharing your thoughts. We are early and figuring things out. I agree egress cost is going to be a big concern. We want to do the best we can for users as we unlock some scale. During preview, we are focused on getting feedback so the service is free (we will need to talk if the usage is significant though).
(Founder) That somewhat summarizes yes :) We take a different approach than WarpStream architecturally too, which allows us to offer much lower latencies. No disks in our system, either.
I like it. I see it as ostensibly a product for engineers and so when I see a name like S2 it's immediately clear that it's a product led and conceived by engineers.
I also see that on your pricing page -
"We are building the S3 experience for streaming data, and that includes pricing transparency"
Love the simple and earnest copy. One can imagine what an LLM would cook up instead, I find the brevity way preferable.
Yes we are not trying to confuse S2 with S3, we just think S3 is the best damn serverless experience out there, and we aspire to that greatness. We borrowed the structure of that name to reflect that aspiration, as have other services inspired by S3 like Cloudflare's object store R2.
We say stream because we would rather not be confused with "logs" as in application logs, but rather associate with the world of streaming data where this primitive is very relevant. We don't mean stream as in a TCP stream or live stream.
(Founder) I have definitely received that advice before :) - to not seem like a regression from S3. But as an abbreviation for Stream Store, it made sense.
When I was a student we had a Facebook group to share information, and one angry guy ranted that the correct shortening of "Mathematical Analysis" is not, in fact, "anal", as we were used to say
Including potentially in court / to lawyers? IANAL, but isn't this just inviting Amazon to claim it's deliberately leveraging their 'S3' trademark and sowing confusion in order to lift their own brand? (Correctly, and even somewhat transparently in TFA, IMO.)
It looks neat but, no Java SDK? Every company I've personally worked at is deeply reliant on Spring or the vanilla clients to produce/consume to Kafka 90% of the time. This kind of precludes even a casual PoC.
(S2 Team member) As we move forward, a Java/Kotlin and a Python SDK are on our list. There is a Rust sdk and a CLI available (https://s2.dev/docs/quickstart) . Rust felt as a good starting point for us as our core service is also written in it.
I do like this. The next part I'd like someone to build on top of this is applying the stream 'events' into a point-in-time queryable representation. Basically the other part to make it a Datatomic. Probably better if it's a pattern or framework for making specific in-memory queryable data rather than a particular database. There's lots of ways this could work, like applying to a local Sqlite, or basing on a MySQL binlog that can be applied to a local query instance and rewindable to specific points, or more application-specific apply/undo events to a local state.
This is a very useful service model, but I'm confused about the value proposition given how every write is persisted to S3 before being acknowledged.
I suppose the writers could batch a group of records before writing them out as a larger blob, with background processes performing compaction, but it's still an object-backed streaming service, right?
AWS has shown their willingness to implement mostly-protocol compatible services (RDS -> Aurora), and I could see them doing the same with a Kafka reimplementation.
> I suppose the writers could batch a group of records before writing them out as a larger blob, with background processes performing compaction, but it's still an object-backed streaming service, right?
This is how it works essentially, yes. Architecting the system so that chunks that are written to object storage (before we acknowledge a write) are multi-tenant, and contain records from different streams, lets us write frequently while still targeting ideal (w/r/t price and performance) blob sizes for S3 standard and express puts respectively.
Wait, data from multiple tenants is stored in the same place. Do you have per-tenant encryption key, or how else are you ensuring no bugs allow tenants to read others data?
(Founder) We will be using authenticated encryption with per-basin (our term for bucket) or per-stream keys, but we don't have this yet. This is noted on https://s2.dev/docs/security#encryption
Seems like really cool tech. Such a bummer that the it is not source available. I might be a minority in this opinion, but I would absolutely consider commercial services where the core tech is all released under something like a FSL with fully supported self-hosting. Otherwise, the lock-in vs something like kafka is hard to justify.
(Founder) We are happy for S2 API to have alternate implementations, we are considering an in-memory emulator to open source ourselves. It is not a very complicated API. If you would prefer to stick with the Kafka API but benefit from features like S2's storage classes or having a very large number of topics/partitions or high throughput per partition, we are planning an open source Kafka compatibility layer that can be self-hosted, with features like client-side encryption so you can have even more peace of mind.
Having a kafka compatible API and S3 storage would be something I would jump to, the savings over MSK would be huge.
If you had a (paid for) API that sat on top of an S3 API for on-prem, that would be fantastic as well.
Kafka is great, but the whole Java ecosystem and the lack of control of what is in the topics and the stuff about co-ordinating the cluster in zookeeper is a management PITA.
First-class kafka compatibility could go a long way to making it a justifiable tech choice. When orgs go heavy on event streaming, that code gets _everywhere_, so a vendor off-ramp is needed.
(Founder) That makes sense. We would eventually host the Kafka layer too - and will be able to avoid a hop by inlining our edge service logic in there.
I had an idea like this a few years ago. basicly emitting a stream interface to a cloud based fs to enable random access seeking on bystreams. I envisioned it to be useful for things like loading large files. would be amazing for enabling things like cloud gaming, images processing and CAD
I wish more dev-tools startups would focus on clearly explaining the business use cases, targeting a slightly broader audience beyond highly technical users. I visited several pages on the site before eventually giving up.
I can sort of grasp what the S2 team is aiming to achieve, but it feels like I’m forced to perform unnecessary mental gymnastics to connect their platform with the specific problems it can solve for a business or product team.
I consider myself fairly technical and familiar with many of the underlying concepts, but I still couldn’t work out the practical utility without significant effort.
It’s worth noting that much of technology adoption is driven by technical product managers and similar stakeholders. However, I feel this critical audience is often overlooked in the messaging and positioning of developer tools like this.
(Founder) Appreciate the feedback. We will try to do a better job on the messaging. It is geared at being a building block for data systems. The landing page has a section talking about some of the patterns it enables (Decouple / Buffer / Journal) in a serverless manner, with example use cases. It just may not be something that resonates with you though! We are interested in adoption by developers for now.
I think they're saying that you should provide some example use-cases for how someone would use your service. High-level use-cases that involve solving problems for a business.
For what it's worth, I am already familiar with this design space well enough that I don't need this kind of example in order to understand it. I've worked with Kinesis and other streaming systems before. But for people who haven't, an example might help.
What kind of business problem would someone have that causes them to turn to your service? What are the alternative solutions they might consider and how do those compare to yours? That's the kind of info they're asking for. You might benefit from pitching this such that people will understand it who have never considered streaming solutions before and don't understand the benefits. Pitch it to people who don't even realize they need this.
(Founder) Yes I understand, and this could definitely do with work. I struggle with it personally because it is so obvious to me. I don't even know where to start? How do you pitch use cases for object storage? Stream storage feels just as universal to me.
If you ever figure it out, LMK. I don't think I've ever looked at logs more than about 24 hours old. Persistence and durability is not something I care about.
Errors, OTOH, I need a week or two of. But I consider these 2 different things. Logs are kind of a last resort when you really can't figure out what's going on in prod.
This is a very interesting abstraction (and service). I can’t help but feature creep and ask for something like Athena, which runs PrestoDB (map reduce) over S3 files. It could be superior in theory because anyone using that pattern must shoehorn their data stream (almost everything is really a stream) into an S3 file system. Fragmentation and file packing become requirements that degrade transactional qualities.
(Founder) There are definitely some interesting possibilities. Pretty hyped about S3 Table (Iceberg) buckets. S2 stream to buffer small writes so you can flush decent size Parquet into the table, and avoid compaction costs.
This is cool but I think it overlaps too much with something like Kinesis Data Streams from AWS which has been around for a long time. It’s good that AWS has some competition though
(Founder) We plan to be multi-cloud over time. Kinesis has pretty low ordered throughput limit (i.e. at the level of a stream shard) of 1 MBps, if you need higher. S2 will be cheaper and faster than Kinesis with the Express storage class. S2 is also a more serverless pricing model - closer to S3 - than paying for stream shard hours.
Thanks. You are right about those points. One thing to probably consider is whether serverless provides enough cost savings for most streaming ingest use cases which need static provisioning since ingest volumes are unpredictable. A better messaging would be that your serverless model can handle bursts well. (for context: used to sell KDA and KDS at AWS as part of AI solutions)
In the long-term, how different do you want to be from Apache Pulsar? At the moment, many differences are obvious, e.g., Pulsar offers transactions, queues and durable timers.
(Founder) We want S2 to be focussed on the stream primitive (log if you prefer). There is a lot that can be built on top, which we mostly want to do as open source layers. For example, Kafka compatibility, or queue semantics.
- Unlimited streams. Current cloud systems limit to a few thousand. With dedicated clusters, few hundred K? If you want a stream per user, you are now dealing with multiple clusters.
- Elastic throughput per stream (i.e. a partition in Kafka) to 125 MiBps append / 500 MiBps realtime read / unlimited in aggregate for catching up. Current systems will have you at tens. And we may grow that limit yet. We are able to live migrate streams in milliseconds while keeping pipelined writes flowing, which gives us a lot of flexibility.
(Founder) so many possibilities! That's what I love about building building blocks. I think we will create an open source layer for an IoT protocol over time (unless community gets to it first), e.g. MQTT. I have to admit I don't know too much about the space.
I'd really love this extending more into the event sourcing space not just the log/event streaming space.
Dealing with problems like replay and log compaction etc.
Plus things like dealing with old events. Under GDPR, removing personal information/isolating it from the data/events themselves in an event sourced system are a PITA.
(Founder) Named pipe that operates at the level of records, is durable regionally, you can read from any sequence, and lets you do concurrency control for writes if you need to.
(Founder) There is a table on the landing page https://s2.dev/ which hopefully gives a nice overview :) It's like S3, but for streams. Cheap appends, and instead of dealing with blocks of data and byte ranges, you work with records. S2 takes care of ordering records, and letting you read from anywhere in the stream.
This is an alternative to systems like Kafka which don't do great at giving a serverless experience.
(Founder here) Managed cloud offerings for streaming limit ordered throughput pretty low, e.g. Kinesis at 1 MiBps, Redpanda serverless at 1 MiBps, Confluent's even higher-end clusters at 10-20 MiBps IIRC. If you really need ordering, this can indeed be a limit. S2 lets you push 125 MiBps currently, and we may grow that.
Another factor is how many ordered streams you can have. Typically a few thousand at most with those systems. We take the serverless spirit of S3 here, when did you have to worry about the number of objects in a bucket?
We are also able to offer latency comparable to disk-based streaming like Confluent's Kora and Kinesis, with our Express storage class (under 50 milliseconds end-to-end latency for client in the same cloud region) - while being backed by S3 with regional durability! Not a disk in the system.
We want people to be able to build safe distributed data systems on top of S2, so we also allow concurrency control mechanisms on the stream like fencing. Kafka or Kinesis won't let you do that. This is the approach AWS takes internally (https://brooker.co.za/blog/2024/04/25/memorydb.html), but they don't have that as a service. We want to democratize the pattern.
ED: on throughtputs, to clarify, I am talking about ordered throughput, i.e. per Kafka partition or Kinesis shard. WarpStream also does well here because of their architectural approach to separate ordering, but at a latency cost.
Between your site copy and your early comments on this thread, it was this rundown that made the product click in my mind.
I’m sure that in this early preview you’re trying to reach mainly devs with existing domain expertise, but the way that, in this comment, you laid out existing constraints and what possibilities might lie beyond them—it really helped me situate your S2 product in the constellation of cloud primitives.
Just wanted to offer that feedback in the hope that the spirit of your comment here doesn’t get buried down-thread!
Looks like you're pushing for the throughput angle - that could be important but IMO it's not often you come across devs who need this level of throughput without dealing with large scale problem. My feedback is the lack of per-tenant encryption is a big deal breaker here since you're mixing up data of tenants within one objects.
Plus your security section talks very little how you prevent cross data contamination - that's probably first thing that popped up in my mind when I read about your data model. It makes me extremely uneasy - and can't imagine that I can adopt this for anything serious. I would encourage you to think about how you can communicate that angle to the customer as well, besides supporting per tenant encryption key.
(Founder) It's a number of dimensions. I get excited about the ordered throughput angle because I have personally cared about this in the past, and yeah a lot of folks may not need that :)
Simple API, reasonable pricing, latency flexibility, unlimited streams, _and_ elastic to high throughputs. All adding up to a great serverless experience.
Re: the data colocation. This is how most multi-tenant systems - including S3 itself AFAIU - operate. I understand there is a difference in level of trust vs a cloud provider, and the best we can do here while delivering a serverless experience is encrypting every single record at the edge of S2 where they transit in or out, with a tenant-specific key. We may even allow specifying it as part of the request, if clients want to manage the key for themself.
The best data security when leveraging any multi-tenant service is going to be client-side encryption, and we also want to make this super easy. With our planned Kafka layer, we plan on client-side encryption as a value add.
@agallego Yes in aggregate both Confluent and Redpanda can push GiBps throughputs, and I know Redpanda has amazing perf. I was referring to Redpanda Serverless :) And per-partition i.e. ordered throughput.
ED: for some reason I wasn't seeing the reply link before on your comment, do see it now.
A interesting difference is the ability to have exclusive access to writes on the log (the fencing token). This allows you to use the logs as write ahead logs.
Replying to this one since you apparently can't reply to a comment that has been flagged. Why was the grandparent flagged? Google's S2 library has been around for more than a decade and is the first thing I think of when I see "S2" in a tech stack.
And the flippant response from the parent here that they don't really care that they're muddying the waters and just want the crate name is irksome.
IANAL,but naming your product S2 and mentioning in the intro that AWS S3 is the tech you are enhancing is probably looking for a branding/copyright claim from Amazon. Same vertical & definitely will cause consumer confusion. I'm sure you've done the research about whether a trademark has been registered.
https://tsdr.uspto.gov/#caseNumber=98324800&caseSearchType=U...
Fun fact: S2 and EC2 sound exactly the same in Spanish - both are "ese dos". Add that to EC2 and S3 already being confusing to tell apart by ear
TBF, building something with the goal of enhancing S3 I would call it S4.
Thats short term thinking. you need to leapfrog everybody and go s∞
That’s actually a pretty cool name if you pronounce the first letter the letter sound rather than as an initial: Sinfinity
Sounds more like a porn website...
Very responsive log porn. ;-)
Too late, name's taken for something else: https://incubator.apache.org/projects/s4.html
And don't forget the other S4: http://www.supersimplestorageservice.com/
It's like S3, except better because, by focusing on being a write-only data store, they can manage much more throughput and efficiency, plus your data is far more secure at rest than it is in S3.
why not s11?
F3 - (Fast Furious Fail-Safe)
S3++ ? T4?
My company is a Fivetran client, and they named that company after a (bad) joke, but it's worth a fortune.
Fivetran is going to zero because they don’t offer anything of actual value and their CEO isn’t a good person.
[1] https://news.ycombinator.com/item?id=42434450
At least cloudflare’s R2 has an argument for the naming (IBM vs HAL, A Space Odyssey)
Yep, letter S and a number is copyrighted, can't do that
1) we're talking about trademark law, not copyright law.
2) the problem here is that they're in the same business segment, and explicitly reference S3.
s3 (serverless stream store)
What could possibly be better than being sued by Amazon for some nitpicky naming Issue ?
That’s the kind of David vs. Goliath publicity one could only dream of …
98% of the time, law suits are just a money pit. There is zero publicity. A tiny number go viral. I don't think this is likely to be one of those times.
Most people would simply say "Amazon is right." Because Amazon is right. This is an intentional attempt to leverage their product branding to promote a new product. There is very little good here.
If this were open-source, academic, non-profit, or something like that, perhaps. A small venture trying to commercialize on some digital equivalent of Amazon's trade dress? I can't imagine anyone would care....
Even those times when someone is 100% right, usually, there is zero publicity. Right or wrong, most times I've seen, the small guy would settle with the big guy with the deep legal pockets and move on because litigating is too expensive.
In a situation like this one, your marketing spend / press coverage on the existing name is shot, links to your domain are shot, and perhaps you have an egg on your face, depending on how things play out.
I'm not sure whether they consulted a bad trademark lawyer or didn't consult one at all, but it wouldn't have cost that much to do so. I say this having just recently started the process of filing a trademark - the cost is about the same as buying i.e. 's4.dev' according to the domain registry's website.
Having to rebrand your product after launching is a lot more painful than doing it before launching.
Wow, imagine Debezium offering native compatibility with this, capturing the changes from a Postgres database, saving them as delta or iceberg in a pure serverless way!
This is a really good idea, beautiful API, and something that I would like to use for my projects. However I have zero confidence that this startup would last very long in its current form. If it's successful, AWS will build a better and cheaper in-house version. It's just as likely to fail to get traction.
If this had been released instead as a Papertrail-like end-user product with dashboards, etc. instead of a "cloud primitive" API so closely tied to AWS, it would make a lot more sense. Add the ability to bring my own S3-Compatible backend (such as Digital Ocean Spaces), and boom, you have a fantastic, durable, cloud-agnostic product.
(Founder) we do intend to be multi-cloud, we are just starting with AWS. Our internal architecture is not tied to AWS, it's interfaces that we can implement for other cloud systems.
It would be extra ironic if the whole thing already ran on top of AWS.
There's no end to startups which can be described as existing-open-source-software as a service, marketed as a cheaper alternative to AWS offerings.. who run on AWS.
They just did https://news.ycombinator.com/item?id=42211280 (Amazon S3 now supports the ability to append data to an object, 30 days ago). Azure has had the same with append blobs for a long time. It's still a bit more raw than S2, without the concept of record. The step for a cloud provider to offer this natively is very small. And with the concept of a record, isn't this essentially a message queue, where the competitor space is equally big? Likewise if you look into log storage solutions.
(Founder) Both S3 Express _One Zone_ appends and Azure's append blobs charge the regular PUT price for appends. It may work for you, but probably not if you want to do smaller writes.
Blob stores will also not let you do tailing reads, like you can with S2.
In AWS, S2's Express storage class takes care of writing to a quorum of 3 zonal buckets for regional durability.
I doubt object stores will go from operating at the level of blobs and byte ranges, to records and sequence numbers. But I could be wrong.
People keep making the same argument against Aptible (https://aptible.com) and it is still a very successful PaaS over a decade later.
I had never heard of this company so I took a look and the main pitch was compelling and then I went to the pricing page and saw the pricing goes from $0 to $500 a month once you want to go to “production”. i’m clearly not the target market, which makes sense why I’ve never heard it.
If you do cloud infra stuff, AWS will try to undercut you on price but will never outdo you on D/UX. So I wouldn't let Beezus hold me back
Amazon don't compete for price sensitive product offerings.
If anything, they normlise an expectation with a budget aware base.
Help me understand - you build on top of AWS, which charges $0.09/GB for egress to the Internet, yet you're charging $0.05/GB for egress to the Internet? Sounds like you're subsidizing egress from AWS? Or do you have access to non-public egress pricing?
Looks like they changed it to $0.08/GB. Which loses them at most $300/month at 50TB, and makes money after that.
(Founder) We are not charging in preview. At the scale where it matters, we will work it out. Definitely some assumptions in here.
For what it's worth, there's zero chance I would do business with a company whose business plan is "we'll work it out". It gives one every reason to believe that in a couple years time you guys will either be out of business (because you didn't figure out the numbers to make a profit) or will pull the rug from under customers in the form of surprise price hikes. Obviously you have to do what you think is right, but I think that this approach is going to scare off a lot of customers for you.
(Founder) We are not charging during preview. If anything, I wanted to be transparent about our planned pricing. Our mission is to make streams a cloud storage primitive, and I worked backwards from there in terms of our costs and expected costs looking ahead once we can scale a bit - based on concrete data points about what kind of discounts can be unlocked. I realized it was premature based on the comments here, so the price for internet egress has been updated. Thank you for your feedback.
[flagged]
Just FYI, that doesn't give me confidence in the longevity of your service.
Cloud services offer giant discounts sometimes and the receiving party aren't allowed to talk about it concretely so that's probably what's happening here.
(Founder) I understand the concern. However, cloud discounts at scale can be very large, and we are going to share as much of it as we reasonably can.
Discounts require multi year commitment for minimum (and increasing) spend. Generally you need to be either profitable or a well funded startup to demonstrate why a vendor would trust your ability to pay (it's literally a debt on your books). How do they know you're good for it?
Plus multi cloud means less scale and less marketing incentive (can't talk about you as a x cloud customer).
I wish you the best, but would encourage you to not set your prices below your costs.
(Founder) Thank you for the advice. I hope we can offer better when the deals come into play, but for now setting our planned internet egress price to $0.08/GiB.
Do you plan to charge differently for bandwidth depending on whether the customer is in AWS or not? Would be nice if you pass on the cost savings.
(Founder) Yes, we will charge less for private connectivity. Pricing is transparent https://s2.dev/pricing - free during preview.
Doesn't AWS charge $0.01 intra region and $0.02 between regions, even without setting up private links? Can't you pass part of those savings (compared to the $0.05-$0.09 of egress) on? Or is it too difficult to detect if the remote IP qualifies?
(Founder) Unfortunately, if you access over a public IP, it is internet egress. Even if the client is in the same AWS cloud region. PrivateLink is the only option.
List pricing is $0.05 per GB after 150TB and at high volume it’s cheaper than that
Nobody with sufficient scale will be paying retail for data transfer.
They’re probably betting on most users being in AWS and only having to pay 1¢-2¢ transfer.
They're also banking on scale to PPA with a specific amendment for egress.
strat is likely just get users, then offboard aws if the product works.
(Founder) No, we want to be in the same cloud regions as customers.
I look at the egress costs to internet and it doesn’t check out. It’s a premium product dependent on DX, marketed to funded startups.
But if I care about ingress and egress costs, which many stream heavy infrastructure providers do.. This doesn’t add up.
I wish them luck, but I feel they would have had a much better chance from the start by getting some funding and having a loss leader start, then organising and passing on wholesale rates from cloud providers once they’d reached critical mass.
Instead they’re going in at retail which is very spicy. I feel like someone will clone the tech and let you self host, before big players copy it natively.
It’s a commodity space and they’re starting with a moat of a very busy 2 weeks from some Staff engineers at AWS.
(Founder) Thanks for sharing your thoughts. We are early and figuring things out. I agree egress cost is going to be a big concern. We want to do the best we can for users as we unlock some scale. During preview, we are focused on getting feedback so the service is free (we will need to talk if the usage is significant though).
So is this basically WarpStream except providing a lower-level API instead of jumping straight to Kafka compatibility?
An S3-level primitive API for streaming seems really valuable in the long-term if adopted
(Founder) That somewhat summarizes yes :) We take a different approach than WarpStream architecturally too, which allows us to offer much lower latencies. No disks in our system, either.
These folks knowingly chose to spend the rest of their careers explaining that they are not, in fact, S3.
(Founder) well 50% of our name is different
I like it. I see it as ostensibly a product for engineers and so when I see a name like S2 it's immediately clear that it's a product led and conceived by engineers.
I also see that on your pricing page -
"We are building the S3 experience for streaming data, and that includes pricing transparency"
Love the simple and earnest copy. One can imagine what an LLM would cook up instead, I find the brevity way preferable.
(Founder) Thank you for the kind comment!
Yes we are not trying to confuse S2 with S3, we just think S3 is the best damn serverless experience out there, and we aspire to that greatness. We borrowed the structure of that name to reflect that aspiration, as have other services inspired by S3 like Cloudflare's object store R2.
I actually thought S2 is a Cloudflare service at first.
You should have gone with S4 tbh. The suits love bigger numbers. Super Simple Stream Store.
http://www.supersimplestorageservice.com/ exists and calls itself S4. It's a decent gag and the immediately came to mind when I heard S2 vs S3.
How do you store a stream? Don’t they just spray around the internet here and there, and if you don’t catch them in the moment, they’re just gone?
(Founder) I thought you were joking but coming back it could well be serious :)
When we say stream, we really mean The Log that Jay Kreps has a famous blog about https://engineering.linkedin.com/distributed-systems/log-wha...
We say stream because we would rather not be confused with "logs" as in application logs, but rather associate with the world of streaming data where this primitive is very relevant. We don't mean stream as in a TCP stream or live stream.
You can, however stream Star Wars on S2 ;-) https://s2.dev/docs/quickstart#get-started-with-the-cli
(Founder) I have definitely received that advice before :) - to not seem like a regression from S3. But as an abbreviation for Stream Store, it made sense.
Why not just use SS? There can’t possibly be any negative connotations there.
Reserved by GM for the Super Sport
So that's why GM has been asking itself "Are we the baddies?" lately.
SS .. as in nazi?
You could even make the s look kind of like a lightning bolt to emphasize how fast it is
Quite dangerous. Will look almost like the Schutzstaffel runic insignia. I'd better avoid this resemblance.
thatsthejoke.jpg
S3++?
Surely S3++? /s
Disagree. You have a marketing opportunity for a hipster character named "Stu" to be the spokesman.
Disco Stu don't advertise
Props to you for having a sense of humor about it. :D
If I could put in one request...a video which describes what it is and how to use it would make it easier for me to understand.
(Founder) Yes we should create a video, thanks for the feedback.
In the meantime, checkout this quickstart which will have to streaming Star Wars with the S2 CLI and give you a pretty good sense of things https://s2.dev/docs/quickstart#get-started-with-the-cli
(You will have to apply to join the preview, but we are approving quickly)
You could say that. Or, in binary ASCII, you could say your name is 93.75% the same (it flips only the last bit of 16).
Your 66.66% (2/3) of the way there to the second character too. So I would say your only 16.66% different across the two characters.
I would look much more into Levenshtein Distance ;) if I would like to be smart ass funny.
You're 50% of the way closer to 1st!
Or this
https://github.com/google/s2geometry
How many of these letter-number storage services are there now? S3, B2, R2, S2...
S3 isn't the name of the service - that's "Amazon Simple Storage Service". S3 is a nickname, short for "Simple Storage Service".
Nickname implies it's unofficial, but S3 is very much the product name too:
https://aws.amazon.com/s3/faqs/
"Simple storage service" is used once. "S3" is used throughout.
While you’re technically correct, for all intents and purposes it is called S3 even by AWS themselves.
and EC2 stands for "Elastic Compute Cloud". but no one remembers that.
When I was a student we had a Facebook group to share information, and one angry guy ranted that the correct shortening of "Mathematical Analysis" is not, in fact, "anal", as we were used to say
Seems preferable to having to explain you're not a paramilitary organization responsible for unspeakable war crimes. Nothing funny about that.
Including potentially in court / to lawyers? IANAL, but isn't this just inviting Amazon to claim it's deliberately leveraging their 'S3' trademark and sowing confusion in order to lift their own brand? (Correctly, and even somewhat transparently in TFA, IMO.)
My issue is that 2<3 and for most people they will just assume its older/shittier S3 lol
It looks neat but, no Java SDK? Every company I've personally worked at is deeply reliant on Spring or the vanilla clients to produce/consume to Kafka 90% of the time. This kind of precludes even a casual PoC.
(S2 Team member) As we move forward, a Java/Kotlin and a Python SDK are on our list. There is a Rust sdk and a CLI available (https://s2.dev/docs/quickstart) . Rust felt as a good starting point for us as our core service is also written in it.
I do like this. The next part I'd like someone to build on top of this is applying the stream 'events' into a point-in-time queryable representation. Basically the other part to make it a Datatomic. Probably better if it's a pattern or framework for making specific in-memory queryable data rather than a particular database. There's lots of ways this could work, like applying to a local Sqlite, or basing on a MySQL binlog that can be applied to a local query instance and rewindable to specific points, or more application-specific apply/undo events to a local state.
This is a very useful service model, but I'm confused about the value proposition given how every write is persisted to S3 before being acknowledged.
I suppose the writers could batch a group of records before writing them out as a larger blob, with background processes performing compaction, but it's still an object-backed streaming service, right?
AWS has shown their willingness to implement mostly-protocol compatible services (RDS -> Aurora), and I could see them doing the same with a Kafka reimplementation.
(S2 team member here)
> I suppose the writers could batch a group of records before writing them out as a larger blob, with background processes performing compaction, but it's still an object-backed streaming service, right?
This is how it works essentially, yes. Architecting the system so that chunks that are written to object storage (before we acknowledge a write) are multi-tenant, and contain records from different streams, lets us write frequently while still targeting ideal (w/r/t price and performance) blob sizes for S3 standard and express puts respectively.
Wait, data from multiple tenants is stored in the same place. Do you have per-tenant encryption key, or how else are you ensuring no bugs allow tenants to read others data?
(Founder) We will be using authenticated encryption with per-basin (our term for bucket) or per-stream keys, but we don't have this yet. This is noted on https://s2.dev/docs/security#encryption
Seems like really cool tech. Such a bummer that the it is not source available. I might be a minority in this opinion, but I would absolutely consider commercial services where the core tech is all released under something like a FSL with fully supported self-hosting. Otherwise, the lock-in vs something like kafka is hard to justify.
(Founder) We are happy for S2 API to have alternate implementations, we are considering an in-memory emulator to open source ourselves. It is not a very complicated API. If you would prefer to stick with the Kafka API but benefit from features like S2's storage classes or having a very large number of topics/partitions or high throughput per partition, we are planning an open source Kafka compatibility layer that can be self-hosted, with features like client-side encryption so you can have even more peace of mind.
Having a kafka compatible API and S3 storage would be something I would jump to, the savings over MSK would be huge.
If you had a (paid for) API that sat on top of an S3 API for on-prem, that would be fantastic as well.
Kafka is great, but the whole Java ecosystem and the lack of control of what is in the topics and the stuff about co-ordinating the cluster in zookeeper is a management PITA.
Checkout warpstream (recently acquired by confluent)
First-class kafka compatibility could go a long way to making it a justifiable tech choice. When orgs go heavy on event streaming, that code gets _everywhere_, so a vendor off-ramp is needed.
(Founder) That makes sense. We would eventually host the Kafka layer too - and will be able to avoid a hop by inlining our edge service logic in there.
I had an idea like this a few years ago. basicly emitting a stream interface to a cloud based fs to enable random access seeking on bystreams. I envisioned it to be useful for things like loading large files. would be amazing for enabling things like cloud gaming, images processing and CAD
kudos for sitting down and makin it happen!
Just you wait, I am launching S1 next year!
Ok good, my startup S½ (also known as Ç) is still unique, phew
Dibs on S0
I wish more dev-tools startups would focus on clearly explaining the business use cases, targeting a slightly broader audience beyond highly technical users. I visited several pages on the site before eventually giving up.
I can sort of grasp what the S2 team is aiming to achieve, but it feels like I’m forced to perform unnecessary mental gymnastics to connect their platform with the specific problems it can solve for a business or product team.
I consider myself fairly technical and familiar with many of the underlying concepts, but I still couldn’t work out the practical utility without significant effort.
It’s worth noting that much of technology adoption is driven by technical product managers and similar stakeholders. However, I feel this critical audience is often overlooked in the messaging and positioning of developer tools like this.
(Founder) Appreciate the feedback. We will try to do a better job on the messaging. It is geared at being a building block for data systems. The landing page has a section talking about some of the patterns it enables (Decouple / Buffer / Journal) in a serverless manner, with example use cases. It just may not be something that resonates with you though! We are interested in adoption by developers for now.
I think they're saying that you should provide some example use-cases for how someone would use your service. High-level use-cases that involve solving problems for a business.
For what it's worth, I am already familiar with this design space well enough that I don't need this kind of example in order to understand it. I've worked with Kinesis and other streaming systems before. But for people who haven't, an example might help.
What kind of business problem would someone have that causes them to turn to your service? What are the alternative solutions they might consider and how do those compare to yours? That's the kind of info they're asking for. You might benefit from pitching this such that people will understand it who have never considered streaming solutions before and don't understand the benefits. Pitch it to people who don't even realize they need this.
(Founder) Yes I understand, and this could definitely do with work. I struggle with it personally because it is so obvious to me. I don't even know where to start? How do you pitch use cases for object storage? Stream storage feels just as universal to me.
If you ever figure it out, LMK. I don't think I've ever looked at logs more than about 24 hours old. Persistence and durability is not something I care about.
Errors, OTOH, I need a week or two of. But I consider these 2 different things. Logs are kind of a last resort when you really can't figure out what's going on in prod.
"Replace our MSK clusters and EBS storage with S3 storage costs."
This is a very interesting abstraction (and service). I can’t help but feature creep and ask for something like Athena, which runs PrestoDB (map reduce) over S3 files. It could be superior in theory because anyone using that pattern must shoehorn their data stream (almost everything is really a stream) into an S3 file system. Fragmentation and file packing become requirements that degrade transactional qualities.
(Founder) There are definitely some interesting possibilities. Pretty hyped about S3 Table (Iceberg) buckets. S2 stream to buffer small writes so you can flush decent size Parquet into the table, and avoid compaction costs.
My first thought: "introducing? The S2 has been out for a while!"
https://www.sunlu.com/products/new-version-sunlu-filadryer-s...
Google had it years ago! http://s2geometry.io/devguide/s2cell_hierarchy
This is cool but I think it overlaps too much with something like Kinesis Data Streams from AWS which has been around for a long time. It’s good that AWS has some competition though
(Founder) We plan to be multi-cloud over time. Kinesis has pretty low ordered throughput limit (i.e. at the level of a stream shard) of 1 MBps, if you need higher. S2 will be cheaper and faster than Kinesis with the Express storage class. S2 is also a more serverless pricing model - closer to S3 - than paying for stream shard hours.
Thanks. You are right about those points. One thing to probably consider is whether serverless provides enough cost savings for most streaming ingest use cases which need static provisioning since ingest volumes are unpredictable. A better messaging would be that your serverless model can handle bursts well. (for context: used to sell KDA and KDS at AWS as part of AI solutions)
How is this compare to https://github.com/deuxfleurs-org/garage ?
Seems like there are a lot of more lite weight self-hosted s3 around now days. Why even use S3?
In the long-term, how different do you want to be from Apache Pulsar? At the moment, many differences are obvious, e.g., Pulsar offers transactions, queues and durable timers.
(Founder) We want S2 to be focussed on the stream primitive (log if you prefer). There is a lot that can be built on top, which we mostly want to do as open source layers. For example, Kafka compatibility, or queue semantics.
In terms of a pitch, i'm not sure i understand how this differs from existing solutions. Is the core value proposition a simpler api?
(Founder) Besides simple API,
- Unlimited streams. Current cloud systems limit to a few thousand. With dedicated clusters, few hundred K? If you want a stream per user, you are now dealing with multiple clusters.
- Elastic throughput per stream (i.e. a partition in Kafka) to 125 MiBps append / 500 MiBps realtime read / unlimited in aggregate for catching up. Current systems will have you at tens. And we may grow that limit yet. We are able to live migrate streams in milliseconds while keeping pipelined writes flowing, which gives us a lot of flexibility.
- Concurrency control mechanisms (https://s2.dev/docs/stream#concurrency-control)
Forgot to mention storage classes to tune your latency vs cost tradeoff. That you can even reconfigure - soon we will make that a live migration.
Seems really good for IoT no? Been a while since I worked in that space, but having something like this would have been nice at the time.
(Founder) so many possibilities! That's what I love about building building blocks. I think we will create an open source layer for an IoT protocol over time (unless community gets to it first), e.g. MQTT. I have to admit I don't know too much about the space.
Really interesting service and bookmarked.
I'd really love this extending more into the event sourcing space not just the log/event streaming space.
Dealing with problems like replay and log compaction etc.
Plus things like dealing with old events. Under GDPR, removing personal information/isolating it from the data/events themselves in an event sourced system are a PITA.
(Founder) An S2 stream is a durable log and can be replayed! We do want to add compaction support. Event sourcing is a great use case for S2.
So is this a "serverless" named-pipe-as-a-service cloud offering? Or am I misreading?
Yep. Just tack "serverless" onto something that already exists and charge for it
(Founder) Named pipe that operates at the level of records, is durable regionally, you can read from any sequence, and lets you do concurrency control for writes if you need to.
How does this compare to Kafka? Is the primary difference that this is a hosted solution?
I really liked the landing page and the service, but it took me a while to realize it wasn't a AWS service with a snazzy landing page.
Definitely a useful API but not super compelling until I could store the data in my own bucket
Is it possible to bring my own cloud account to provide the underlying S3 storage?
(Founder) Not currently! We want to explore this.
Would this be like an alternative to Delta? Am I thinking about that right?
S2 is, in my opinion, the sweet spot of PRS's lineup.
This would sell much better is was S5 or S6 next level thing.
Wow man are you stil stuck on S3?
so the naming convention for 2024-25 products seems to be <letter><number>.
o1, o3, s2, M4, r2, ...
Scribe aaS? ;)
"Making the world a better place through streamable, appendable object streams"
Kafka as a service ?
(Founder) Nope! We have a FAQ for this ;)
Can someone tell me what does this do? And why its better.
(Founder) There is a table on the landing page https://s2.dev/ which hopefully gives a nice overview :) It's like S3, but for streams. Cheap appends, and instead of dealing with blocks of data and byte ranges, you work with records. S2 takes care of ordering records, and letting you read from anywhere in the stream.
This is an alternative to systems like Kafka which don't do great at giving a serverless experience.
Could you clarify the Kafka difference further?
Or more generally, when is it better to choose S2 vs services like SQS or Kinesis?
S2 sounds like an ordered queue to me, but those exist?
(Founder here) Managed cloud offerings for streaming limit ordered throughput pretty low, e.g. Kinesis at 1 MiBps, Redpanda serverless at 1 MiBps, Confluent's even higher-end clusters at 10-20 MiBps IIRC. If you really need ordering, this can indeed be a limit. S2 lets you push 125 MiBps currently, and we may grow that.
Another factor is how many ordered streams you can have. Typically a few thousand at most with those systems. We take the serverless spirit of S3 here, when did you have to worry about the number of objects in a bucket?
We are also able to offer latency comparable to disk-based streaming like Confluent's Kora and Kinesis, with our Express storage class (under 50 milliseconds end-to-end latency for client in the same cloud region) - while being backed by S3 with regional durability! Not a disk in the system.
We want people to be able to build safe distributed data systems on top of S2, so we also allow concurrency control mechanisms on the stream like fencing. Kafka or Kinesis won't let you do that. This is the approach AWS takes internally (https://brooker.co.za/blog/2024/04/25/memorydb.html), but they don't have that as a service. We want to democratize the pattern.
ED: on throughtputs, to clarify, I am talking about ordered throughput, i.e. per Kafka partition or Kinesis shard. WarpStream also does well here because of their architectural approach to separate ordering, but at a latency cost.
Between your site copy and your early comments on this thread, it was this rundown that made the product click in my mind.
I’m sure that in this early preview you’re trying to reach mainly devs with existing domain expertise, but the way that, in this comment, you laid out existing constraints and what possibilities might lie beyond them—it really helped me situate your S2 product in the constellation of cloud primitives.
Just wanted to offer that feedback in the hope that the spirit of your comment here doesn’t get buried down-thread!
thank you for the feedback!
Hey congrats! Looks like a really cool idea.
Looks like you're pushing for the throughput angle - that could be important but IMO it's not often you come across devs who need this level of throughput without dealing with large scale problem. My feedback is the lack of per-tenant encryption is a big deal breaker here since you're mixing up data of tenants within one objects.
Plus your security section talks very little how you prevent cross data contamination - that's probably first thing that popped up in my mind when I read about your data model. It makes me extremely uneasy - and can't imagine that I can adopt this for anything serious. I would encourage you to think about how you can communicate that angle to the customer as well, besides supporting per tenant encryption key.
(Founder) It's a number of dimensions. I get excited about the ordered throughput angle because I have personally cared about this in the past, and yeah a lot of folks may not need that :)
Simple API, reasonable pricing, latency flexibility, unlimited streams, _and_ elastic to high throughputs. All adding up to a great serverless experience.
Re: the data colocation. This is how most multi-tenant systems - including S3 itself AFAIU - operate. I understand there is a difference in level of trust vs a cloud provider, and the best we can do here while delivering a serverless experience is encrypting every single record at the edge of S2 where they transit in or out, with a tenant-specific key. We may even allow specifying it as part of the request, if clients want to manage the key for themself.
The best data security when leveraging any multi-tenant service is going to be client-side encryption, and we also want to make this super easy. With our planned Kafka layer, we plan on client-side encryption as a value add.
@agallego Yes in aggregate both Confluent and Redpanda can push GiBps throughputs, and I know Redpanda has amazing perf. I was referring to Redpanda Serverless :) And per-partition i.e. ordered throughput.
ED: for some reason I wasn't seeing the reply link before on your comment, do see it now.
coo cool right on.
Redpanda cloud doesn’t limit tput. Most ppl get a bigger discount at high volumes. We have customers in 10s of GB/s. Confluent has those volumes too.
Sort of serverless Kafka, which natively uses object storage and promises better latencies than things like warpstream.
A interesting difference is the ability to have exclusive access to writes on the log (the fencing token). This allows you to use the logs as write ahead logs.
It's a message queue on the cloud.
https://chatgpt.com/c/676703d4-7bc8-8003-9e5d-d6a402050439
Edit: Keep downvoting, only 5.6k to go!
Thank you
[flagged]
Indeed... we sure wish we could have nabbed that crate name, but it was not to be. Our Rust SDK is here https://lib.rs/crates/streamstore
Replying to this one since you apparently can't reply to a comment that has been flagged. Why was the grandparent flagged? Google's S2 library has been around for more than a decade and is the first thing I think of when I see "S2" in a tech stack.
And the flippant response from the parent here that they don't really care that they're muddying the waters and just want the crate name is irksome.
Serverless pricing to me is exactly like the ETH gas pricing !