Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Narrator (YC S19) – a data modeling platform built on a single table
143 points by cedricd on Sept 30, 2020 | hide | past | favorite | 59 comments
Hi HN, We’re Ahmed, Cedric, Matt, and Mike from Narrator (https://www.narrator.ai).

We’ve built a data platform that transforms all data in a data warehouse into a single 11-column data model and provides tools for analysts to quickly build any table for BI, reporting, and analysis on top of that model.

Narrator initially grew out of our experience building a data platform for a team of 40 analysts and data scientists. The data warehouse, modeled as a star schema, grew to over 700 data models from 3000+ raw production tables. Every time we wanted to make a change or build a new analysis, it took forever as we had to deal with managing the complexity of these 700 different models. With all these layers of dependencies and stakeholders constantly demanding more data, we ended up making lots of mistakes (i.e. dashboard metrics not matching). These mistakes led to loss of trust and soon our stakeholders were off buying tools (Heap, Mixpanel, Amplitude, Wave Analytics, etc…) to do their own analysis.

With a star schema (also core to recently IPO-ed Snowflake), you build the tables you need for reporting and BI on top of fact tables (what you want to measure, i.e. leads, sales…) and dimension tables (how you want to slice your data, i.e. gender, company, contract size…). Using this approach, the amount of fact and dimension tables grow in size and complexity in relation to the number of questions / datasets / metrics that need to be answered by the business. Over time the rate of new questions increases rapidly and data teams spend more time updating models and debugging mismatched numbers than answering data questions.

What if instead of using the hundreds of fact and dimension tables in a star schema, we could use one table with all your customer data modeled as a collection of core customer actions (each a single source of truth), and combine them together to assemble any table at the moment the data analyst needs that table? Numbers would always match (single source of truth), any new question could be answered immediately without waiting on data engineering to build new fact and dimension tables (assembled when the data analyst needs it), and investigating issues would be easy (no nested dependencies of fact and dimension tables that depend on other tables). After several iterations, Narrator was born.

Narrator uses a single 11-column table called the Activity Stream to represent all the data in your data warehouse. It’s built from sql transformations that transform a set of raw production tables (for example, Zendesk data) into activities (ticket opened, ticket closed, etc). Each row of the Activity Stream has a customer, a timestamp, an activity name, a unique identifier, and a bit of metadata describing it.

Creating any table from this single model made up of activities that don’t obviously relate to each other is hard to imagine. Unlike star schema, we don’t use foreign keys (the direct relationships in relational databases that connect objects, like employee.company_id → company.id) because they don’t always exist when you’re dealing with data in multiple systems.

Instead each activity has a customer identifier which we use, along with time, to automatically join within the single table to generate datasets.

As an example, imagine you were investigating a single customer who called support. Did they visit the web site before that call? You’d look at that customer’s first web visit, and see if that person called before their next web visit.

Now imagine finding all customers who behaved this way per month -- you’d have to take a drastically different approach with your current data tools. Narrator, by contrast, always joins data in terms of behavior. The same approach you take to investigate a single customer applies to all of them. For the above example you’d ask Narrator’s Dataset tool to show all users who visited the website and called before the next visit, grouped by month.

We started as a consultancy to build out the approach and prove that this was possible. We supported eight companies per Narrator data analyst, and now we’re excited for more data folks to get their hands on it so y’all can experience the same benefits.

We’d love to hear any feedback or answer any questions about our approach. We’ve been using it ourselves in production for three years, but only launched it to the public last week. We’ll answer any comments on this thread and can also set up a video chat for anyone who wants to go more in-depth.




[Co-founder here of a start-up that provided monitoring / metadata analytics for cloud warehouses]

My unsolicited $0.02 - I think your approach is spot on.

As a company, you will never have one consistent data set and metrics if you keep building an individual model for each user / use case / etc. And I've seen the explosion of tables and models in real-time. They just keep growing. And how do you even know that the question you're asking in your dashboard is pulling the information from the correct table? I've yet to see a data team that didn't have to deal with drift. Plus, there's a real cost of storing all these stale tables that nobody is looking at anymore.

What your product is doing is what I see companies already trying to accomplish themselves [somewhat]. For the leading companies when it comes to working with data, the warehouse today is already the source of truth, with one dimension table that points back to the SaaS tool / dashboard via an S3 bucket. So the SaaS tool itself is really only the last mile and visualization layer. Run the model, create the table, offload the table to S3, point the tool to the S3 bucket with the table. Update every 4 hours, etc.

dbt wins in that world. (and I assume you're using something like dbt under the hood of narrator.ai?)

That approach is already commoditizing the SaaS tool down to the visualization layer and the opinionated way of displaying data. But that still means there's at least one model per tool, use case, etc. with one table - and you still don't see the entire journey of the user, that's something you either have to create for a single specific use case, or cobble it together ad-hoc. If instead you have one table that has it all - you can move soooo much faster with data, and take out all the friction that comes from having disparate data sets.

narrator.ai wins in that world.

Blinkist is Berlin is following a very similar approach to what you guys have built. This deck is a few years old, but I think the approach described will resonate with you:

https://www.slideshare.net/SebastianSchleicher/tracking-and-...

If I had to look into my Crystal Ball, I think one of your GTM challenges will be to convince existing data teams that everything they've built is somewhat redundant. On the flipside, I can see the same data teams say "OMG, finally!". I'm curious to hear the customer reactions so far.

I'm very excited about this product! I wouldn't be a direct user with my current role, but FWIW, I can share the bruises I got from working in this market.

Would love to hear more!


This is so great! You see exactly what we see and clearly you have shared similar experiences with dashboards not matching because of wrong table. (The good old "spent 3 weeks debugging an analysis using sales_data and then finally found that sales_data_v2 was built to solve it).

Yeah we do something very similar to dbt for taking restructuring the data into a single time-series table. We add things like identity resolution, diffing, incremental update and computing some cache columns.

Your Crystal Ball is SPOT ON!!! We get 3 kinds of data people. The ones who are like: "THIS WILL NEVER WORK", "Too bad I already built all this" or the "THIS IS THE FUTURE, HOW IS EVERYONE NOT USING IT".

I would love to chat and show you what we have (schedule a demo on our site and it will go to me and we can chat!)

Also, Teaser... When you standardize all of data and you create a consistent way to relating that standardized structure then analysis become very consistent. Imagine a world where your email attribution deep dive can be run by loading a template and point it to your "opened email" activity and your "order activity".... coming soon ... a Narrative Library.


> restructuring the data into a single time-series table. We add things like identity resolution, diffing, incremental update

So is this where the customer still has to do some work? Defining states and transforming their sources into a series of events with these states?


Yes, the customer would have to define their activities (e.g. 'page view', 'completed order', 'support ticket opened') and write sql snippets to define them.

https://docs.narrator.ai/docs/activity-transformations describes these scripts and links to a few examples


scheduled!


Question for the narrator folks...what about using dbt to create the activity stream?


We would love that but Narrator works on any warehouse. To support that we built a query abstraction later that compiles to the flavor of SQL used by the customers warehouse.

(We will open source that query abstraction later with a demo where you can translate Redshift Queries to Snowflake Queries).

Maybe in the future we can get that project into dbt so that dbt models can work on any warehouse as well.


Wish you the best of luck but isn't this just a fancier version of EAV?

https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80...

IMO, it doesn't matter what kind of db technology, schema or query tool that you use. A company will always have analysis sprawl regardless of whether those analyses are represented as data lake files, sql tabes, materialized views, regular views, or (as is often the case with EAV and other such schemas) queries saved in some BI tool. There is no silver bullet and it will always take some work to maintain single source of truth, traceability, a coherent model that is understandable by average business users, etc.


Yeah, Entity modeling was one of the big inspirations to our approach. The main difference is how do you reassemble the single time-series table to create any table.

This was quite a challenge and I think what makes the traceability and source of truth problem a lot simpler.

In Narrator, the data team writes small SQL to create single customer centric business concepts that we call activities. These are around 25 lines and decided to be understood by anyone in the company (i.e. "viewed page", "called us",...).

Now, every question you or a stakeholder has will simply be a rearrangement of these activities. If you can describe what you want, then Narrator can assemble a table that represent it.

Source of truth - What ever is in the activity stream? Tracebility - always Dataset (activities and how they relate), then activities (~25 SQL). Coherent Model - Customers doing actions in time.

Does that make sense? Some of these things are easier to show in a demo then describe in text.


> activities and how they relate

This is the problem with EAV/nosql/schemaless/etc and ultimately the problem I think you are going to have to solve. Instead of using ETL to model how the activities relate and reifying that model as database objects, EAV just kicks the can down the road to the query/BI tool.

Sprawl - The BI tool will end up containing most of the real business logic sprawled across many reports.

Single source of truth - A lot of the reports will be very similar but they will be based off slightly different activities or slightly different filtering logic. Which report is the correct one?

Traceability - I think this is more of an end-to-end "garbage-in, garbage-out" problem that all ETL/BI tools have that wouldn't be specific to your tool. It's more of an organizational/people problem.

Coherent model - In my experience, EAV isn't enough to cover the breadth of analyses mature businesses need to do and most business users won't be able to wrap their head around it. There will have to be some data person that creates a more coherent, tabular/spreadsheet-like model and in the case of this tool it looks like that model will have to exist in the BI tool. Which brings us back to sprawl/single source of truth issues.

Just some thoughts. But always glad to see more people working on stuff like this!

Edit - one last thing I wanted to mention. I think in reality you are going to find it takes more than ~25 lines of sql to define activities. That may be the case if the source is a schema that gets spit out of something like Stitch, but many other schemas in the wild will take a lot more than 25 loc to massage into your 11 column schema.


Sprawl - YES! I would never put a single time-series table in your BI tool. It is not queryable and you will hate the insane results.

- We actually built our own query layer called Dataset to make sure that the dataset is materialized. This way if you put it in your BI tool, you can always go back to the dataset which points direct to the activity stream.

Single Source of truth. & traceability - 100%. We really aim to have activities be actually different. Each activity is modeled via SQL and often done by a data engineer or analyst. You cannot just create 1000 activities. 90% of our customers have between 20-40. This enables your activities to be unique. Also unlike tables, activities are building block so they map to something real (i.e. "paid invoice", "sent contract").

So far we haven't seen many people struggling with activities being too similar.

Also the modeling of the activity helps clear up the Garbage in -> Garbage out problem that often happens with CDPs (mixpanel, segment, etc..).

In terms of analysis. We did build a tool called Narrative (actionable analysis in a story format). This is designed to get users to write their analysis with CONTEXt built in vs just numbers on a screen. With context + the ability to click to see the activities and relationship people can quickly know what data powers the source. Does this solve the problem 100%? Nope, but it does take us huge steps in the right direction.

Coherent Model - I think our tool Dataset helps with this problem. We started as a consultancy and answered 1000s of questions over 3 years till our tool was able to answer any question. I usually demo by asking the customer to ask any question they have and I try to answer it live. So far, we have been able to answer them all so I am SUPER excited to find the limit of our tools.

Yeah, for data EL via Stitch, Fivetran then this is easy. Dirty data that is a bunch of JSONS etc... take a bit more effort but that building of the activity is done once. You also don't have to deal with how concepts relate or identity resolution or a lot of other things that make SQL complex.

Overall, I love this conversation and would like to continue. I am excited to hear some of your edge cases. Maybe we can even setup some time and talk face to face: https://calendly.com/ahmed-narrator/30min-1


I took a deeper look at the Dataset portion of your product this morning and it definitely piqued my interest. It wasn't clear to me from your original post and my initial scan of your site that there was a way to create queries/models/views (whatever folks want to call them, they're essentially the same concept) on top of the single activity table and then either materialize them or integrate them with other services via webhooks or native API integrations. That's definitely super useful. Also, the "Relationship" concept does a nice job of trying to approach joins/window functions in plain english. Query builders are always a difficult UX problem and I think you're onto something interesting. Finally, the validation, identity resolution and spend features are also nice and I could see you adding value via more features in this vein in the future.

The main thing this product strikes me as is "The ETL tool that understands your business". Whereas the domain language of most ETL tools is at the level of DW technologies (rows, columns, schemas, facts, dimensions, indexes, join algos, views, dags, orchestration schedules), the domain language of Narrator is at the level of the business (activities, customers, relationships, spend, etc). In a way it's sort of similar to the old convention over configuration religious war. I could see companies using Narrator for the 80% of ETL that is just plain table stakes in order to compete nowadays and offloading most of the definition and minor customization of this ETL to less technical folks. And maybe in parallel the data engineers would use plain old code to do the last 20% of ETL that is truly proprietary and specific to the business.

Not sure if my biased initial reading of your pitch was off but it seemed like you were focusing heavily on addressing the pain points of the star schema. I've found that most people fall into two camps: either they don't care at all about the kimball star schema world and they're just loading tables however they see fit into their warehouse or they are willing to go to their grave defending the star schema and its variants. In either case, I don't think you gain much by positioning yourself as the antidote to the star schema. I think you could capture customers in both camps by focusing instead on the fact that your ETL tool has a deep understanding of how companies that rely heavily on a web presence work. I think this would also better align you with the ability to increase your customers' revenue as opposed to optimizing engineering/infrastructure concerns which is an easier sell.

Anyway, sorry for the rant. I'm going to shoot you a short email in case you want to connect.


That is really great to hear! I think you make a great point and we will discuss your recommendations internally to improve our messaging.

I am excited to chat in person.


EAV [...] my thoughts exactly. My last encounter with EAV was briefly said disastrous.

On the other hand, HW gets better and better, and if you are able to do on the fly, what used to be a "cached" entity, then there you go.


Yeah our first iteration of the activity stream was at WeWork and it was impossible to use. For this reason we built our Dataset tool and thanks to our innovation of relationships we are now able to make use of this structure to create any table.

Often, people think that just creating a time-series table is enough but it is so hard to use and so hard to maintain that you will hate yourself. Narrator solve all those problems so your experience become truly incredible!


"EAV has three columns, and these go to eleven"

https://www.youtube.com/watch?v=hW008FcKr3Q


How are you solving the problem of scale? Wouldn't everyone else be using this approach if the need to e.g. maintain indexes was overruled by your table join technology?

Or is the idea that you specify the type of dataset you want, and then wait a few hours for the table to be generated from the activity stream?

Good luck with your launch, this is a cool and novel take on data analytics.


Thanks! We've found that this scales fairly well. Tables generated from the activity stream are generally done in seconds or minutes at the worst case.

One reason for this is that most warehouses are column-oriented. This means our table, which is 11 columns wide but pretty deep, is really fast to query.

We had a customer on a 3 Node (tiny) DS large Redshift warehouse dealing with billions of rows. Their Looker queries literally took 8 hours. We moved them to the Narrator structure and they saw the same queries assembled using the activity stream down to minutes (<5 minutes)

Edit: I should also point out that all queries in Narrator are compiled to SQL that runs directly on the warehouse. We don't use any external data stores or anything like that.


If I understand correctly:

1. you define activities, each with an associated SQL query eg customer opened support ticket: SELECT * FROM tickets WHERE customer_id=$customer_id

2. users use Narrator UI to build a report, similar to Looker

3. Narrator creates a table for that report in the same db and populates it with data, based on 1.

4. Narrator maintains all the reports tables (updates, deletes when report is deleted)

?


Yes, that's basically it.

2. The report (we call it Dataset) you build with the Narrator UI is a table that you can aggregate different ways, plot, and export (including writing back to the warehouse as a materialized view)

3. done optionally as part of 2

4. Yes, we keep anything written back to the warehouse up to date. You can control the cadence.

Because of 4. we work well with BI tools like Looker. Once you have the data you want just point Looker to the right table in the warehouse.


Thanks. Interesting approach!


So cool! Had a weekend tinker a few weeks ago that needed this use case and a quick search didn’t produce any useable light solutions so did it by hand.

One tiny data point for y’all.

Good luck!

Edit: assuming eventually your schema transformation unlocks at least partially and you have at least some flexibility outside the default 11 column approach?


Great to hear from someone who also built this themselves!

As far as flexibility beyond 11 columns: I'd love to know your use case.

We do support additional metadata on each activity with what we call enrichment tables.

Some events are going to need more metadata -- a page view would want to have the actual page, the five UTM parameters, referrer, etc, which is more than the 3 fields of metadata we store on the activity stream.

So we also support creating additional tables to add metadata to each activity. Each row requires a unique activity id and its timestamp and can an unlimited number of additional columns.

We'll then automatically join that table into the activity stream when queries need it.


Conversely, isn’t literally any use case that doesn’t feed your schema something you can’t really support?

Went through your docs and the pre-populated “narratives” or query templates are thoughtful and probably capture a wide load of initial analytics (and will get better), however it’s a pretty narrow solution.

I guess technically if you have 1 table with support tables of metadata you are closer to a traditional relational DB, so maybe this isn’t as restrictive as it seems.

Excited to see your product iterate!


Yeah, that's true, but you'd be surprised by the number of things that can be modeled with an activity stream.

It's one of the more common objections people have as they understand the model, but in practice we've found that it's not an issue. Our CEO loves asking people to describe their hard data questions and then redefine them in terms of the activity stream.

The metadata support tables are actually an exception -- we don't use them frequently in practice.

Thanks for engaging with us. If you ever want to dive deeper into this we're always happy to chat.


So the concept behind this is that every action is written to a single activity stream across the enterprise. Then you query that table when you want to do analytics. Does this mean I need to modify all my other applications to write to this table when an event happens?


That's a great question. No, you don't have to change the any other applications at all. We work from the raw data in a warehouse.

So the typical flow is that your production applications structure their data however they want. From there the data is sent into your warehouse as-is (using an EL tool like Fivetran).

From there you write small sql scripts in Narrator for each activity you'd like to create. Those scripts are responsible from transforming into your format to ours. They're generally fairly short. Our docs describe them here: https://docs.narrator.ai/docs/transformations


I worked at a big, high-growth tech company on a product team and saw first hand how much of a challenge cleaning up / joining data was. This seems like a really novel and useful solution to a big problem (and well timed on the heels of Snowflake's IPO news).


Thank you. Once you start using it, then you will experience the difference and honestly there is no going back. You should try it or schedule a demo for your product team. I would love to showcase Narrator answering questions live for you.


Everyone is talking about the tech, but I'm really curious about the marketing site.

It's very clean and looks great. How long, or how many iterations did it take to get it to this level? Did you do it in-house or contract it out?


Thanks! We actually worked with Superside to help us put it together: https://www.superside.com/

We worked with them over the course of a few weeks. Given how complicated the topic was we came to them with copy, layout and a few loose ideas for what graphics we need. The design and graphic work was all them though.

If you have any other questions about it happy to dive in deeper!


I'm in a similar space, what do you find most compelling?


I love the paradigm and I think the Narrator team has done a great job so far, but I'm unclear about the business model. Are you still operating as a consultancy, or are you providing tooling?


Sorry for not making that clear :). We're a SaaS product.

You can check out our pricing page here https://www.narrator.ai/pricing

The initial consultancy approach helped us build out the product. Once we could show internally that it made us far faster to analyze data we were ready to launch.


This approach sounds very similar to using a single table in DynamoDB: https://www.alexdebrie.com/posts/dynamodb-single-table/


Yeah DynamoDB is very much aligned with a single table. We do it in the customers warehouse so they have access to all their data.

The real magic is how do you assemble the data into tables that can answer any question when the output table is billions of rows. Unfortunately, that requires a lot of SQL magic that depends on taking advantage of a columnar warehouse.


What are the 11 columns?


Also replying since I wrote this up :)

- activity_id : a unique identifier for the row

- activity : the type of activity (eg 'page_view')

- timestamp : time the activity happened

- customer : the unique customer identifier

Metadata columns

  Three columns for any info we'd like to add to an activity. Eg for a purchased product activity it could be product name. 

 - feature_1
 - feature_2
 - feature_3

 - revenue_impact : the amount of money related to this activity. A completed order activity would have this
 - link : a hyperlink related to the activity ('ticket submitted' might have a link to the ticket in Zendesk)
Additional customer identifier - source and source id are used when you're not entirely sure who the customer is. For example, a 'page view' activity wouldn't know the actual customer, but might have a unique identifier. So the source could be 'segment.io' and source_id could be their generated uuid

- source

- source_id


Only three metadata fields or can you have more?


We do have the ability to enrich an activity and add as many more features but that is very rare.

I am sure you used to a lot of features in a table but in Narrator you can burrow any feature from any activity so you often don't even have 3 features.

https://docs.narrator.ai/docs/how-to-borrow-features-from-ot...


Activity, Customer, Time, Activity, 3 features, Revenue impact. source and source_id (for identity resolution) and a couple of meta data columns.

You can see some example transformations here: https://docs.narrator.ai/docs/by-data-source


So you offer a free tier, then the next tier is $200 a month? Seems a bit steep does it not? What about a $50 plan for like say 10 million rows?


Thanks for the feedback. Yeah, we can definitely consider it. We haven't totally optimized our pricing yet.

If you (or anyone) uses our free tier and wants to upgrade to something between it and the lowest paid tier just send us a message at support@narrator.ai and we'll set up something for you.


whats the underlying storage engine? tech stack?


We don't use an underlying storage engine. Everything runs directly on your warehouse. We build the activity stream by running queries against the warehouse and writing to a table that we create inside of it. When people build datasets with Narrator we compile everything down to sql and query the warehouse directly.

The tech stack is Python for the backend scheduling and query engine hosted on AWS. For the frontend it's React. We have some internal data stores for managing our own state and a bit of caching - Postgres, S3, ElasticSearch. We use GraphQL a fair bit.


Are you trying to build product analytics usecase like mixpanel/amplitude with this as well ?


No, we mainly focus on data modeling on top of a warehouse and deep analyses. Our customers are often data analysts, data scientists, or engineers.

That being said we would love to partner with a CDP like mixpanel and amplitude to have marketers and product people get quick insights using the data that is modeled and cleaned by the data team.


So event sourcing?


In some respects it's similar, in that our table is time-series, but we're fully on the data analysis side. Data flows from production databases (in any format) into a warehouse, and from there we transform it into the activity stream.

Event sourcing is a great way to model production data that would work really well with Narrator, but it's by no means necessary.


Hard to imagine a worse idea.

The problem in data warehousing is not the structure, it's the definitions. It doesn't matter one tiny little bit whether you store your data in a fact/dimension star schema, a normalized OLTP-style schema, arbitrary aggregate/rollup tables, or this one-table monstrosity you've constructed. You still have to do the work of determining what the data means.

This isn't 50% of the work of building a data warehouse. It's 99% of the work.

This "Activity Stream" concept has some merit but most data sources can't support it. Much to my eternal frustration, most systems only store current state, and have no concept of history or events. You can't extract "activities" from a data system that doesn't record them.

Still, maybe I'm wrong and your idea is great. The market will reject it.

Star schema modeling is woven into the very fabric of the data warehousing profession. We use it everywhere, whether it's appropriate or not (and it is often not appropriate). It's what all the BI tools support. It's what analysts know how to write queries against. Your thing may be cool and all but no one knows how to use it.

One last thing: "Numbers would always match (single source of truth)..."

Not even close. This belies just massive ignorance of how analytics actually work. Let's say I have a data point that says the total amount of order #123 is $42.00. Ok... does that include shipping? Sales tax? Oh it does, and finance cares about that, so now I need to decompose that into 34.95 + 5.00 + 2.05 and label appropriately. Now Bob's report can sum the item total (no shipping, no taxes) and Alice's report can include shipping but not taxes, and Jimmy's report can sum the total, and now your wonderful "single source of truth" story goes in the toilet. Because "$42.00" isn't truth it's just a value. What turns it into truth is definition and consensus, which are social problems, not technical problems.


Can you please make your substantive points thoughtfully, rather than in the flamewar style? This is particularly important when people are sharing their work. We don't want a culture in which people get flamed and belittled for doing that.

This is important because the effects that comments like this one produce are much stronger than the people making them assume they are. Worse, they compound. Then, unintentionally, we end up with an asshole culture which no one would want to subject themselves to. On HN, we want the incentives to go exactly the other way. I'm sure you can make your substantive criticisms without putdowns and name-calling if you want to—and that would actually be a quite valuable contribution.

https://news.ycombinator.com/newsguidelines.html


Appreciate the skepticism, since yes, we're a totally different approach.

I'll try to address your points in order.

Yes, we agree that 99% of the work is determining what the data means. Our structure doesn't magically make things better because of its structure. It's that once you have it analysis / aggregation on top of it becomes substantially less work because you don't have to constantly redo models to answer new questions.

Yeah, we've all been there -- production DBs aren't typically architected to store historical data. But we've found in practice that the data sources you most care about do have it. Page views, emails sent / received, completed orders, etc. all have timestamps. And for some things you don't need it. If you wanted to do a query with all customers who are VIPs, you wouldn't need a 'became VIP' activity. Adding is_VIP as a feature to the customer in the activity stream works too. Generally if you can do an analysis the more traditional way then you should already have the data to do it in Narrator too.

Sure, star schemas are the way of doing things and this is a new approach. But the efficiency gains realized by our own data scientists are enough to where they wouldn't go back -- it warrants the investment in learning it. Our challenge as a business will be how to convince others of that as well.

What we mean by single source of truth is that data is internally consistent - each term is defined once. In your scenario you'll have a single 'completed order' activity with the total order amount. If you want to add shipping cost that's fine -- add a 'added shipping to order' activity with the cost in it. Do the same with sales tax. Bob, Alice, and Jimmy can create reports with whatever activities they want. The crucial point is by making those reports they're not defining a new model. They're just combining activities. Since all tables are generated straight from the activity stream a future analyst won't use a materialized view based on Bob's data to build a new report -- they'll build it straight from the original activities.


If I am understanding correctly, if my BI folks want to report on things like order total or order sub total, or order sub total + shipping, or order sub total + tax, or presumably anything about the contents of the order (SKUs, quantities, per item pricing, price adjustments, coupons and promotions, etc...) instead of capturing a "order submitted" event, we'd have to capture every add to cart, every cart pricing recalc operation, shipping address added, shipping method selected, shipping cost added to order, shipping method changed, new shipping cost added to order, etc... as separate events? And have the smarts to generate reports using the final selected shipping method (for example) for calculations?

For an average-ish B2C order that could be dozens of events, and for a complex B2B order that might easily be hundreds of events.


It's not quite that granular. I was more responding to the idea that there can't be a source of truth for these activities.

We actually have several e-commerce companies using our platform (with decently high volume). In practice we tend to see events like 'completed order' 'shipped order' 'product added to cart' 'order delivered'. I.e. they're all very discrete differentiated steps in the process.

There's a bit of an art between when to make a new activity and when to add it as metadata on an existing one. A completed order will more likely have 'discount code' as a feature than 'discount code applied' as an activity for example.

Your order completed event could have total amount along with tax, shipping, etc costs that add up to the total. It depends on the analyses you want to generate.

We do see things like an order submitted event with the total, num products purchased, discount code on it, and a separate 'purchased product' event with individual product price, sku, etc. Once can do things like MRR and another could let you identify best selling skus or product categories.

Happy to chat more offline if you want to dive into the specifics for your use case. We love digging into what sorts of analysis someone wants to do and figuring out which activities make sense https://calendly.com/ahmed-narrator/30min-1


I think the point is that you can choose your own granularity, and that it tries to make things fast no matter that choice.


Definitions are super important. Getting agreement on definitions is really hard and can easily get us stuck in a loop.

In my last job, we spent months talking about "What is a Sale?" is it when someone signs a contract or when they pay or when they move in, etc...

Then you add your sale metric into a table and as that table is used in different places the sales numbers till don't match, people don't remember what a sale is and the conversation starts up again.

Why is Narrator different?

In Narrator, you don't define what a sale is, you break up your data into customer actions. So "Signed Contract", "Paid Invoice", "Moved In" and thus as people ask different questions we can alway clearly see what they are referring to.

Step 1 to alignment is guaranteeing consistency and transparency.

How do you deal with states? You are right that does make it really hard. We see customers leveraging the incremental aspect of the activity stream to diff the stream with the updated_at to pull out changes as the activity stream updates. (every 15 minutes so yes you will loose changes in between that time). This doesn't solve the problem but does take you much closer. And then when you do add proper timestamps then you can have historical data in the activity stream and cleaned data from you new tables merged. All the users using that activity are NOT affected.

Not perfect but allows us to have as accurate data as possible.

What about the market? Yes, I agree the market is a challenge since the world has been only using star schema. We hope that standardization, speed, reusability aspect of Narrator is so compelling that people slowly switch.

How do we get numbers to match? Consistency and Transparency. Every one who uses the "Completed Order" uses the same revenue. SO it is consistent! Then you can add the "Shipped Order" activity which has the shipping amount.

By having clear CONSISTENT definitions and clear activities you end up in a world where your numbers match. The only ways for numbers not to match is if some is being deliberate about getting the data not to match which thanks to dataset, is always transparent.

Definitions are a social problem but technology, limitations and consistency can help a lot.


This sounds very similar to the approach used by Salesforce internally. How does it differ?


Hmm. That's interesting. I'm not familiar with what Salesforce does. Do you have any more info about it? I'd love to learn more!


I think this may be referring to Force.com UDD model [PDF]

http://www.developerforce.com/media/ForcedotcomBookLibrary/F...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: