Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is very interesting. Anybody use it? Alternatives?


We help bring gpu visual analytics & investigation automation to users of all sorts of graph DBs (think tableau & servicenow for graph), so based on our enterprise/big tech/gov/startup interactions:

1. Shortlist (and in no order): Neo4j, AWS Neptune, Datastax Graph, TigerGraph, Azure CosmosDB, and JanusGraph (Titan fork) are the ones we see the most in practice, and not in production but rumor-mill, Dgraph, RedisGraph, & ArangoDB. The three-and-four-letter types seem to roll their own, for better or worse. There are also some super cool ones that don't get visibility outside of the HPC+DoD world, like Stinger & Gunrock. Interestingly, the reality is a ton of our graph users aren't even on graph DBs (think Splunk/ELK/SQL), and for data scientists, just do ephemeral Pandas/Spark. As someone from the early days of the end-to-end GPU computing movement, we're incorporating cuGraph (part of nvidia rapids.ai) into our middle tier, so you get to transparently benefit from it while looking at data in any of the above.

2. I now slice graph DB's more in terms of OLTP (neo4j, janus, neptune, maybe tiger) vs OLAP (spark graphx, cugraph) vs batch (janus, tiger) vs friendly BI/data science (neo4j) vs friendly app dev / multi-modal add-on (CosmosDB, Neo4j, Arango, Redis). Curious to see how this goes -- given the number of contributors, I'm guessing it's doing well in at least one of these. +1 to hearing reports from others!


Thanks, I really appreciate the comprehensive write up of what your team is seeing. Any chance of a longer blog post that expands on this, especially pro-cons and performance?


Yes, that is a great idea!


For someone who just wants to run some (intensive) OLAP graph queries on the “graph formulation” of a relational or hierarchical dataset every once in a while (maybe batch, maybe user-initiated, but either way <1QPS), but doesn’t yet have a graph DB and doesn’t really want to maintain their data in a canonical graph formulation, which type of graph DB would you recommend as the simplest-to-maintain, simplest-to-scale “adjunct” to their existing infra?

I.e. what’s the graph DB that best fits the use-case equivalent to “having your data in an RDBMS and then running an indexer agent to feed ElasticSearch for searching”?


My default nowadays is minimize work via "no graph db": csv/parquet extract -> jupyter notebook of pandas/cugraph/graphistry, and if that isn't enough, then dockerized (=throwaway) neo4j , or if the env has it, spark+graphistry. The answers to some questions can easily switch the answer to say "kafka -> tigergraph/janusgraph/neptune", or some push button neo4j/cosmosdb stuff:

* Primary DB: type / scale, and how fresh do the extracts need to be (daily, last minute?)

* Are queries more search-centric ("entities 4 hops out") or analytics ("personalized pagerank")?

* Graph size: 10M relations, or 10B? Document heavy, or mostly ints & short strings?

* Is the client consuming the graph via a graph UI, or API-only?

* Licensing and $ cost restrictions?

* Push-button or inhouse-developer-managed?

The result of (valid) engineering trade-offs by graph db dev teams means that, currently, adding a graph db as a second system can be tricky. The above represent potential mismatches between source db / graph stack / workload and team burden. Feels like this needs a flow chart!

Happy to answer based on the above, and you can see why I'm curious which areas Nebula will help straddle :)


Very insightful answer! Thanks for sharing your opinions here. Nebula Graph is good at OLTP use cases where high QPS and low latency are required.


I'd say dgraph (https://dgraph.io/) is closest competitor but Neo4j (https://neo4j.com) as well which has longer heritage.

I'd also include redis because of the graph module (https://oss.redislabs.com/redisgraph/).

I've likely missed a bunch of others. Add them as I'm interested in graph db and have only scratched the surface myself.


There is also decentralized Gun DB: https://gun.eco/

OrientDB is an alternative to Neo: https://orientdb.org/

There are a bunch of DB's compatible with Tinkerpop and e.g. query-able with Gremlin: http://tinkerpop.apache.org/


There is also Weaviate, still in development, which has a flavor of GraphQL for querying: https://github.com/semi-technologies/weaviate

And this awesome page has some good entries: https://github.com/jbmusso/awesome-graph/blob/master/README....


Alternative:Dgraph

https://dgraph.io/

In architecture and goals it actually closely resembles Dgraph, would love to see an (opinionated) comparison by Manish, the CEO of Dgraph


(Manish here) Don't know much about Nebula. Feels quite inspired by Dgraph.


(Sherman here. I'm the founder of Nebula) Nice to meet you here, Manish. Nebula is actually inspired by the Facebook internal project Dragon (https://engineering.fb.com/data-infrastructure/dragon-a-dist...). Fortunately I was one of the founding members of the project. The project was started in 2012. We never heard of dgraph at that time. So I'm not sure who was inspired :-)

The goal of Nebula is to be a general graph database, not just a knowledge graph database. There are some fundamental differences between the two.

We welcome any positive feedback and technical discussion. We would love to learn to the community and to provide a product which truly satisfies customers' needs.


> The goal of Nebula is to be a general graph database, not just a knowledge graph database. There are some fundamental differences between the two.

I am by no means a Graph expert but what are some of the mentioned fundamental differences?


I had the very same feeling, dgraph is older and has a larger community plus additional features like:

- geospatial features

- good speed as it is based on badgerdb key value database and ristello cache library.

- http library and other features

One of the advantage I saw in nebula graph is security role based access which is not available in dgraph until today.

I am very curious about benchmark between nebula graph and dgraph.

Also what is storage system used in nebula graph.


ACL is an enterprise dgraph feature: https://dgraph.io/support


geospatial is also available in Nebula Graph.:)

As to the storage system, Nebula Graph is based on multi-group raft and RocksDB.


Can you add support for module based storage system like, if someone wants to use badgerdb or leveldb or any other storage system instead of rocksdb


Yes, Nebula Graph supports multiple backend storages by design. So theoretically you are able to use whatever storage you want for whichever graph space in Nebula Graph.

You may take a look at this article about the design of our storage engine: https://github.com/vesoft-inc/nebula/blob/master/docs/manual...

In 2020 we will be working on more plugins. You may stay tuned if that interests you. :)


Geospatial is on the TODO list


Geospatial support has already been merged to the code base. :)


Dgraph has another disadvantage of data redundancy, where data associated with multiple index are stored multiple times for speed.

Does nebula also store data multiple time for multiple index?


Thanks for asking! Sorry I missed this question earlier.

Nebula doesn't store data multiple times for index.

And here's how the indexing works in Nebula Graph:

You are allowed to create multiple indexes for one tag or edge type in Nebula Graph. For example, if a vertex has 5 properties attached to it, then you can create one index for each if it's necessary for you. Both indexes and the raw data are stored in the same partition with their own data structure for quick query statement scanning. Whenever there are "where" clause/syntax in the queries, the index optimizer decides which index file should be traversed.


Hi there, out of curiosity what do you mean the data is stored multiple times for speed?

We (I work at Dgraph) have data redundancy when you have multiple replicas for a given group - but that's an optional feature.

Thanks!


Check the list of implementations of SPARQL [1].

One of the most interesting picks: RDF4j (java based). It can connect to a lot of different SPARQL servers, but the rdf4j Native Store should be good enough for data sets in the order of the "100 million triples", according to the docs.

I don't know much about it, but not long ago they announced integrated support for "federated queries", which means that if you data set can't fit in a single node, they have a solution to query different servers in the same query [2].

I'm slowly learning through the forest of related technologies, one of the most useful is SHACL [3], which is a language to validate and extract pieces of the graph that match a pattern (very loosely, think a "schema" for graphs).

1: https://en.wikipedia.org/wiki/List_of_SPARQL_implementations

2: https://rdf4j.org/news/2019/10/15/fedx-joins-rdf4j/

3: https://rdf4j.org/documentation/programming/shacl/


Before using RDF for graphs one should inform themselves on the differences between labeled property graphs and triple stores, and choose the model that best fit their use case.

Take your time and beware of objectivity of article. Vendors try to lure you in. Following has some good info (but a Neo4j bias): https://dzone.com/articles/rdf-triple-stores-vs-labeled-prop...


Good point. Funny you mention that article: I remember encountering both that article and another one that provides some counterpoints! [1]

Also, both those articles are a bit old: RDF* ([2],[3]) is a new extension for RDF that makes it easier to accomplish the same kind of things you can do with property graphs. RDF4j has support for RDF* in the roadmap! [4].

To me, the fact that RDF is 1) a simpler and more general model and 2) an open standard with multiple free and commercial implementations; makes RDF a more a attractive option than locking into a single proprietary implementation like Neo4j.

--

1: http://www.snee.com/bobdc.blog/2018/04/reification-is-a-red-...

2: http://olafhartig.de/slides/RDFStarInvitedTalkWSP2018.pdf

3: http://blog.liu.se/olafhartig/2019/01/10/position-statement-...

4: https://github.com/eclipse/rdf4j/issues/1484


RDF is an interoperability mechanism, it has nothing to do with the architecture you use internally for your database. You can have a PostgreSQL database and offer an endpoint for querying it via RDF.


Currently the project has been deployed in multiple leading internet companies in China, including Tencent, MeiTuan (Chinese Yelp), Red (Chinese Pinterest), Vivo, and so on.


That's pretty impressive. I'd love to see some details blog posts about setting it up, or using it in production (things to watch out for, good practices for provisioning hardware, etc.).


There is a Getting Started series in the GitHub wiki page: https://github.com/vesoft-inc/nebula/blob/master/docs/manual...

Check out the Getting started YouTube video here if you prefer video tutorials: https://www.youtube.com/channel/UC73V8q795eSEMxDX4Pvdwmw

Also some FAQs: https://github.com/vesoft-inc/nebula/blob/master/docs/manual...

If you are interested in the architectural design of the project, here are some articles for your reference:

Overview: https://github.com/vesoft-inc/nebula/blob/master/docs/manual...

Storage engine: https://github.com/vesoft-inc/nebula/blob/master/docs/manual...

Query engine: https://github.com/vesoft-inc/nebula/blob/master/docs/manual...

Feel free to contact us if anything is missing. :)


We're using Blazegraph (The db that Amazon Neptune uses) with great success. We only use the Sparql api with quads so we get nested graphs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: