We help bring gpu visual analytics & investigation automation to users of all sorts of graph DBs (think tableau & servicenow for graph), so based on our enterprise/big tech/gov/startup interactions:
1. Shortlist (and in no order): Neo4j, AWS Neptune, Datastax Graph, TigerGraph, Azure CosmosDB, and JanusGraph (Titan fork) are the ones we see the most in practice, and not in production but rumor-mill, Dgraph, RedisGraph, & ArangoDB. The three-and-four-letter types seem to roll their own, for better or worse. There are also some super cool ones that don't get visibility outside of the HPC+DoD world, like Stinger & Gunrock. Interestingly, the reality is a ton of our graph users aren't even on graph DBs (think Splunk/ELK/SQL), and for data scientists, just do ephemeral Pandas/Spark. As someone from the early days of the end-to-end GPU computing movement, we're incorporating cuGraph (part of nvidia rapids.ai) into our middle tier, so you get to transparently benefit from it while looking at data in any of the above.
2. I now slice graph DB's more in terms of OLTP (neo4j, janus, neptune, maybe tiger) vs OLAP (spark graphx, cugraph) vs batch (janus, tiger) vs friendly BI/data science (neo4j) vs friendly app dev / multi-modal add-on (CosmosDB, Neo4j, Arango, Redis). Curious to see how this goes -- given the number of contributors, I'm guessing it's doing well in at least one of these. +1 to hearing reports from others!
Thanks, I really appreciate the comprehensive write up of what your team is seeing. Any chance of a longer blog post that expands on this, especially pro-cons and performance?
For someone who just wants to run some (intensive) OLAP graph queries on the “graph formulation” of a relational or hierarchical dataset every once in a while (maybe batch, maybe user-initiated, but either way <1QPS), but doesn’t yet have a graph DB and doesn’t really want to maintain their data in a canonical graph formulation, which type of graph DB would you recommend as the simplest-to-maintain, simplest-to-scale “adjunct” to their existing infra?
I.e. what’s the graph DB that best fits the use-case equivalent to “having your data in an RDBMS and then running an indexer agent to feed ElasticSearch for searching”?
My default nowadays is minimize work via "no graph db": csv/parquet extract -> jupyter notebook of pandas/cugraph/graphistry, and if that isn't enough, then dockerized (=throwaway) neo4j , or if the env has it, spark+graphistry. The answers to some questions can easily switch the answer to say "kafka -> tigergraph/janusgraph/neptune", or some push button neo4j/cosmosdb stuff:
* Primary DB: type / scale, and how fresh do the extracts need to be (daily, last minute?)
* Are queries more search-centric ("entities 4 hops out") or analytics ("personalized pagerank")?
* Graph size: 10M relations, or 10B? Document heavy, or mostly ints & short strings?
* Is the client consuming the graph via a graph UI, or API-only?
* Licensing and $ cost restrictions?
* Push-button or inhouse-developer-managed?
The result of (valid) engineering trade-offs by graph db dev teams means that, currently, adding a graph db as a second system can be tricky. The above represent potential mismatches between source db / graph stack / workload and team burden. Feels like this needs a flow chart!
Happy to answer based on the above, and you can see why I'm curious which areas Nebula will help straddle :)
(Sherman here. I'm the founder of Nebula) Nice to meet you here, Manish. Nebula is actually inspired by the Facebook internal project Dragon (https://engineering.fb.com/data-infrastructure/dragon-a-dist...). Fortunately I was one of the founding members of the project. The project was started in 2012. We never heard of dgraph at that time. So I'm not sure who was inspired :-)
The goal of Nebula is to be a general graph database, not just a knowledge graph database. There are some fundamental differences between the two.
We welcome any positive feedback and technical discussion. We would love to learn to the community and to provide a product which truly satisfies customers' needs.
Yes, Nebula Graph supports multiple backend storages by design. So theoretically you are able to use whatever storage you want for whichever graph space in Nebula Graph.
Thanks for asking! Sorry I missed this question earlier.
Nebula doesn't store data multiple times for index.
And here's how the indexing works in Nebula Graph:
You are allowed to create multiple indexes for one tag or edge type in Nebula Graph. For example, if a vertex has 5 properties attached to it, then you can create one index for each if it's necessary for you. Both indexes and the raw data are stored in the same partition with their own data structure for quick query statement scanning. Whenever there are "where" clause/syntax in the queries, the index optimizer decides which index file should be traversed.
One of the most interesting picks: RDF4j (java based). It can connect to a lot of different SPARQL servers, but the rdf4j Native Store should be good enough for data sets in the order of the "100 million triples", according to the docs.
I don't know much about it, but not long ago they announced integrated support for "federated queries", which means that if you data set can't fit in a single node, they have a solution to query different servers in the same query [2].
I'm slowly learning through the forest of related technologies, one of the most useful is SHACL [3], which is a language to validate and extract pieces of the graph that match a pattern (very loosely, think a "schema" for graphs).
Before using RDF for graphs one should inform themselves on the differences between labeled property graphs and triple stores, and choose the model that best fit their use case.
Good point. Funny you mention that article: I remember encountering both that article and another one that provides some counterpoints! [1]
Also, both those articles are a bit old: RDF* ([2],[3]) is a new extension for RDF that makes it easier to accomplish the same kind of things you can do with property graphs. RDF4j has support for RDF* in the roadmap! [4].
To me, the fact that RDF is 1) a simpler and more general model and 2) an open standard with multiple free and commercial implementations; makes RDF a more a attractive option than locking into a single proprietary implementation like Neo4j.
RDF is an interoperability mechanism, it has nothing to do with the architecture you use internally for your database. You can have a PostgreSQL database and offer an endpoint for querying it via RDF.
Currently the project has been deployed in multiple leading internet companies in China, including Tencent, MeiTuan (Chinese Yelp), Red (Chinese Pinterest), Vivo, and so on.
That's pretty impressive. I'd love to see some details blog posts about setting it up, or using it in production (things to watch out for, good practices for provisioning hardware, etc.).