Hacker News new | past | comments | ask | show | jobs | submit login
Dataset: Databases for lazy people (readthedocs.org)
204 points by ikuyamada on Nov 12, 2013 | hide | past | favorite | 49 comments



This is excellent, I've always wanted a simple db library that makes defining a schema upfront optional. After all a schema is a constraint that can be set in stone later on, just like adding indices for optimisation purposes.

Few questions:

- what's the performance like? is there any overhead to alchemy, eg comparing schema every time you do an insert?

- no way to specify primary keys as an alternative to the auto-generated id column?

- no table.remove()?

- what about inserting more complicated data structures, eg a dict with a nested dict or a list? would be great if those were serialized auto-magically into a blob type (or used to create another table with a foreign key?)

- would be nice to be able to freeze a schema with table.freeze() for example: from then on new columns don't get created automatically, or get stored in an extra blob column (this is a very common scenario for python devs where non-indexable columns just get stuck in a blob)

- let me optionally define a schema and specify defaults with table.schema(name='', price=0.0')

- would love to see table.ensureIndex('column', 'unique') similar to mongo for quickly creating indices

- db = Dataset() should do dataset.create('sqlite:///:memory:') for me - would be nice to have that as the default connector, so that Dataset() acts as LINQ for Python by default

- dataset.freeze is nice but I'd rather have dataset.export() & dataset.import() letting me easily copy rows from one db to another (after inspection for example)

Thanks for creating this!


This looks like a really useful library. A few questions:

1. What size limits/practical constraints are there on freezefiles (and accompanying JSON files)?

2. Are there any code samples for consuming freezefiles, or should I just assume it's simple JSON/YML parsing?

3. Has there been any thought in using this to expose database contents via static REST API?

Final thought: this seems like a great step towards solving the age-old version controlling data problem.


Author here.

As for 1.: CSV files are encoded as a stream, so they can be as large as needed. JSON is dumped as a whole from memory, I'd be keen to see if someone has written a streaming JSON encoder.

2.: Consuming, no. I normally load them in a browser with D3 or jQuery to feed them into a graphic or other interface.

3.: I'd argue this is out of scope for dataset, but simpler REST API makers would definietly be cool. Check https://github.com/okfn/webstore - this is what dataset came out of, and it makes somewhat RESTish APIs.


I'd be keen to see if someone has written a streaming JSON encoder.

This looks interesting: https://gist.github.com/akaihola/1415730

Edit: dataset looks like a really interesting library!


I'm curious of the relative advantages/disadvantages over something like sqlalchemy..


From an end-user point of view, SQLAlchemy relies on you first defining your models in the ORM (object relational mapping) and then SQLAlchemy will take control of issuing the SQL to create, update and drop tables depending on your interactions with your Python ORM models.

From what I can read, it seems that this cool looking tool allows you to use SQL as a kind of object free data store, maybe not unlike a NoSQL DB python wrapper (freeing you from first defining your models, and then ensuring that the SQLAlchemy functions have updated your DB).


Since tables are created and modified on insert commands, there doesn't seem to be any possibility of maintaining integrity at the DB level. That would seem to be the main disadvantage compared to any approach that uses schema defined in advance. You still get an RDBMS advantages for ad hoc queries, but not integrity.


Well it seems like this is heavily based upon the progress of sqlalchemy based on the shoulder of giants comment at the bottom of the page. Whether that is in a philosophical way or a technical way, I haven't looked into it enough to find out, but it would be nice to know the comparative differences and similarities.


I'm trying to figure out where something like this fits into the python data ecosystem.

For datasets that fit in memory, Pandas seems like the best bet. Good I/O functions (JSON, CSV), easy slicing (numpy array-like syntax), and some sql-like operations (groupby, join).

For large datasets, you'd need a proper db.

So is Dataset then useful for datasets that cannot fit in memory but aren't too large?


Simple.

Recently, I've been tasked with mapping all of our clients addresses to lat/long. I could've read the CSV and appended the results to each line. Or used a JSON file. That I would have to read/write every time.

Instead, I wrote some pseudo-helper to dump all the CSV data into a SQLite DB. Then I ran my script. Every time I found a lat/long, I could mark the client as "done" and add the lat/long for that client and every client that shared this address. When I had to cut my script because I saw one result from Google Maps was wrong, I could just edit it straight in SQL, mark it as "invalid" and relaunch my script: it started right back at the first undone row. Then I just had to select all the "invalid" results and search them manually or refine them so Google Maps would give me a proper result.

Dataset is useful for small data that is constantly being worked on.

(This answer is from a Ruby POV and the dataset I was working on had about 4K rows, which explains why a) some Python magic wasn't available to me, maybe it would have been perfect in Python world and b) I didn't want to play with streams on my files)

Of course I still need some automation to correctly use my "DataMiner" (as I called it) to the fullest. I'll use Dataset's API as a basis to rewite it correctly.


I know very little about what's available in Ruby, but I would have used the Pandas library to accomplish this task in python. Their in-memory data structure, a DataFrame, is more than capable of handling those operations.


I think it's for persistence, there's a lot more to storing mutable data on disk than reading and writing JSON or CSV or pickle files if you want it to be robust. SQLite is great for that sort of thing.

Also, it looks like it is a proper DB (access layer), point it at postgres or something and take away it's ALTER and CREATE permissions and you're good to go.


pickle? sqllite?


I don't know that the solution to "programmers are lazy" and databases are hard is for them to learn a niche concept instead of taking the time to actually learn about data storage or, find someone who knows it well and work with them.


Where I see this being useful isn't as a solution to programmers being "lazy" about learning DBs, but to programmers being "lazy" about dealing with RDBMS schema while rapidly iterating on a new project. Something like this lets you discover the schema as you iterate on the program, which would seem to be a win for agility.


What I don't get is why do I need a "stable" data store to iterator over? On a project I once didn't get a database until 2 weeks before the actual project was due. Fortunately I was using Spring. So I just mocked the DAL until such time as I actually got a real one. This whole time I changed the contract on the DAL, changed how I used it, etc. Then when the DBA finally got around to having time to make my db I presented him with a decent, thought-out ERD.

Sure the mocks had to do some work, but a simple cache allowed me to perform all of the CRUD operations in memory. I can see doing something similar with Mongo/Couch, but having done the DAL with a pure mock set injected via Spring, I don't really see the point. The same goes for HQL or another lightweight in memory DB + Hibernate/JPA. I assume the model of interaction would work with Python or similar languages too.


Because sometimes you want a product to be available to real users while it is still rapidly developing, especially in an environment using Lean principles.

It's kinda of the opposite of the delivery-date-and-it's-done style of project.


I know databases well and I have a good grasp of all the normalisation levels, however quite recently I had a project where I just stored all my data in a big json object and when I was done I found out (because of a new requirement from my boss) I now needed to query this big json object. At that point I really wished I had had my data in a nice sql database from day one.


I would make a microscopic improvement on your otherwise good comment by pointing out "databases" are big and interconnected, but not hard at all.

Hard is stuff like some very obscure sort algorithm which is mysterious but once you figure it out, you can apply the sort.

Big and interconnected is what RDBMS is where knowing only one or a couple topics in isolation makes the whole thing appear useless... if you all you know about is normalization, or the idea of foreign keys, or the idea of indexes, or the idea of transactions, individually it all seems like a waste of time lets just use CSV files. But once you know a critical mass of the (simple) parts, its becomes a valuable tool.

If sorts were like RDBMS then once you understood the quicksort you'd still be inherently unable to ever apply a quicksort unless you also knew the radix sort. But they're not like that.


This is by no means meant to replace an understanding of databases. The typical use case is a web scraper, where you download a lot of messy data into an operational data store before you clean it up and load it into something with a proper model. Many people use mongo for this, but I actually like keeping my data around.


I have been working on an ETL domain specific language using Scala for a while now (DataExpress for the curious: http://dataexpress.research.chop.edu) trying to address similar problems to this.

Namely, when doing ETL you don't want to have to map all your tables and relationships into models that an ORM likes to have. IMO, there is such a dearth of tools in this space of "quick and dirty" database work. People are either using highly custom scripts on one end or things like Kettle or commercial analogs for "big serious work" on the other. There's almost no in-between.

Having something at a slightly higher level of abstraction than the database driver itself is really, really nice and makes for cleaner, more readable code. Makes me wonder about my continued work on DataExpress!


"for lazy people" should probably be "for lazy individuals not groups" as I've often seen the schema and its DB become a natural demarc point between groups, so changing the demarc on the fly amounts to forcing everyone else's API to change.

Typical example, "Say what, who decided the name column is now two columns first name and last name ?"

And sometimes there's absolutely nothing wrong with that, if the natural demarc point in a project isn't the database and its schema.


Nice. For PHP there is Idiorm, which is a really lightweight ORM wrapper that makes dealing with SQL databases a breeze:

https://idiorm.readthedocs.org/en/latest/


Idiorm is one of my favourite PHP libraries by far. Itself, plus Slim for routing on top of a bunch of classes is the nicest most maintainable way I've come across for building small web applications in PHP!


I also enjoy its companion library, Paris, an Active Record implementation on top of Idiorm.

https://paris.readthedocs.org/en/latest/

http://j4mie.github.io/idiormandparis/


RedBean is also an excellent configuration-less ORM for PHP.

http://redbeanphp.com/


what is the most robust PHP ORM out there for large projects?


Doctrine / Eloquent.


Doctrine2


That could be nice, unless `find` is limited to equality relations with a constant. And all examples are equality relations with constants...


Looking at the source[1] `find` supports `==` and `in_()`. Beyond that, it supports custom sql queries[2].

For more power, drop down to SQLAlchemy.

[1]: https://github.com/pudo/dataset/blob/dc144a27b01ff404a528275... [2]: https://dataset.readthedocs.org/en/latest/quickstart.html#ru...


> Beyond that, it supports custom sql queries[2].

That's more than a bit unsatisfactory if I'm using a query builder or ORM to avoid writing custom SQL queries.

> For more power, drop down to SQLAlchemy.

It's closer to stepping sideways, even the expression language is at a similar level of abstraction.


It pretty much is SQLAlchemy. Look at the code, there's very little there.


I'm looking for a nice syntax to implement the other filter types in Python that doesn't amount to rebuilding most of SQLAlchemy. Maybe it's just a question of exposing the existing API of SQLA better.


Looks like it doesn't work on python 3.3 for some reason ;( It would be better to put some information to the website about required software\modules versions etc...

in <module> import dataset File "C:\Python33\lib\site-packages\dataset\__init__.py", line 7, in <module> from dataset.persistence.database import Database File "C:\Python33\lib\site-packages\dataset\persistence\database.py", line 3, in <module> from urlparse import parse_qs ImportError: No module named 'urlparse'


This looks wonderful. I have a side project that uses BeautifulSoup to get sports scores and then computes standings based on those results and prints the results out to a text file. Honestly, for me, being a simple and personal program SQL Alchemy was just overkill I felt like, but this project looks like it'll do exactly what I need.

I'll have to look into it more in-depth later, but I love the idea behind it.


interesting. am i right to say that it turns nosql-like queries/inserts into a relational structure? what about joins and the complicated stuff that bogs down writing queries as the complexity grows, how does this library support the __advanced__ stuff?


well, the idea is to make the simple stuff really simple and keep the complicated stuff around. so if you want to write a JOIN, use SQL or SQLAlchemy's core constructs - both are accessible, neither have been reinvented :)


I actually think it's time for object databases (e.g. ZopeDB) to make a comeback.

Sometimes it's useful to persist a mass of crap, without thinking through the format at all. Webscraping is a good example offered by the project author.


This is great. I was just looking for a humane way to play with databases in IPython Notebook the other day. I was able to pull a sample of data from a DB into a Pandas dataframe with just a couple lines of code. Perfect.


This looks cool. My main question: why not just use Redis? Its a no-sql solution that's proven to be fast and reliable. I assume dataset isn't meant for larger datasets than Redis can already handle. The biggest advantage is the `datafreeze` command which could've been written for Redis instead.


Anything like this for ruby?


This looks pretty close to a clone of the Sequel gem.

http://sequel.jeremyevans.net/ https://github.com/jeremyevans/sequel


Sequel has much more advanced features than this Python library.


A nice idea, but please change the logo.


Naked Mole Rats are the coolest creatures in the entire world. They hardly seem to age and they very seldomly get cancer. Also there is a really funny video about them: http://www.youtube.com/watch?v=eHi9FvUPSdQ


thanks ymmd


Are you kidding! I'm literally now looking for a production project in order to ask Johannes Koch for a weird, unappealing logo


It reminds me of Ren & Stimpy in the best possible way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: