1 week to SSO with Stytch
Sri RaghavanAt Orb, we’re building real-time infrastructure to power billing and monetization for modern companies. The process of integrating with Orb starts with sending in data in the form of usage events, which represent billable product usage in your application. Orb allows you to build metrics (aggregation queries over your events) and attach those metrics to pricing models.
Although Orb provides a direct batch ingestion API, there’s an ingestion solution that’s become increasingly popular amongst our customers, especially for high volume use cases: ingestion using S3 as a message bus. On Orb’s end, we set up cross-account S3 event notifications on a bucket that you provide us read permissions to, and immediately ingest any files that land in the bucket. It’s also convenient for our customers, since there’s usually an existing S3 sink output (whether that’s Kafka, Kinesis, or a periodic dump from a data warehouse). This latter method has scaled to hundreds of thousands of events a second — not surprising, since S3 is the holy grail of reliability and provides plenty of read and write throughput.
Between the input source (in this case, S3) and our events datastore, a couple things still need to happen:
Only when an event passes validation and we’ve ensured it’s not a duplicate, we ingest it into our events datastore.
Although Orb has detailed and flexible in-product tooling on top of ingested data, when customers are first integrating our platform, there’s sometimes an upstream problem: not everything in the bucket is ingested, and instead ends up in a dead letter queue because it fails one or both of validation or deduplication. Although inspecting the dead letter queue is a start for basic integrations, actually inspecting the shape of the problem requires more work.
When we started looking for a solution that could help our engineering support team help customers ensure that their data was production-ready, we had a few criteria to make it use-case appropriate:
DuckDB meets all three of these things.
When customers are seeing a mismatch between their metrics and the data they believe they’re sending into Orb, we can now help them debug straight at the source; they don’t need to pull in any data science help on their side to debug the datasets coming into Orb.
We’ve installed a DuckDB client on a multipurpose production EC2 instance, but other than that, there’s no running infrastructure at all.
It’s easy — instant, in fact — to attach a DuckDb instance to query over a remote file in S3, using a custom role we assume temporarily. We can start by doing simple counts over the timeframe, finding duplicate events that would’ve been rejected by Orb’s ingestion API. This helps us understand the pattern of duplicates, and when they happen:
Note how interacting with the S3 file like a relational table is seamless here! DuckDB also lets us directly replicate the metrics we support in Orb over the events that land in S3. Suppose, for example, that we discover that the customer’s client is incorrectly assigning idempotency keys, and so we’re deduplicating more data than intended. Ignoring the duplicate constraint and running the metric over the raw events in the bucket (thanks to support for structs) lets us confirm that the rest of the data is correct, avoiding a re-ingestion loop.
We’ve found DuckDB to be very snappy for this sort of debugging over millions of rows of data – and we expect even better performance for Parquet files, where the httpfs protocol and the parquet metadata format allows DuckDB to only selectively download portions of the file.
We’re considering expanding the use cases in the following ways:
Using DuckDB as a debugging tool for support may seem like a minor use case, but it’s helped us to understand its performance characteristics and build confidence in expanding its footprint in Orb’s technical stack.
See how AI companies are removing the friction from invoicing, billing and revenue.