Hacker Newsnew | past | comments | ask | show | jobs | submit | aynyc's commentslogin

What's the difference between feather and parquet in terms of usage? I get the design philosophy, but how would you use them differently?


parquet is optimized for storage and compresses well (=> smaller files)

feather is optimized for fast reading


Given the cost of storage is getting cheaper, wouldn't most firms want to use feather for analytic performance? But everyone uses parquet.


You can, still, gain a lot of performance by doing less I/O.


There's definitely a "everyone uses it because everyone uses it" effect.

Feather might be a better fit for sime yse cases, but parquet has fantastic support and is still a pretty good choice for things that feather does.

Unless they're really focussed on eaking out every bit of read performance, people often opt for the well supported path instead.


What people have done in the face of cheaper storage is store more data.


Storage is cheap but bandwidth no.


Storage getting cheaper did not really reach the cloud providers and for self-hosting it has recently gotten even more expensive due to AI bs.


And now there's Lance! https://lance.org/



I read that. But afaik, feather format is stable now. Hence my confusion. I use parquet at work a lot, where we store a lot of time series financial data. We like it. Creating the Parquet data is a pain since it's not append-able.


Generally Parquet files are combined in an LSM style, compacting smaller files into larger ones. Parquet isn't really meant for the "journal" of level-0 append-one-record style storage, it's meant for the levels that follow.


So feather for journaling and parquet for long term processing?


I still don't understand what happened to using Apache Avro [1] for row-oriented fast write use cases.

I think by now a lot of people know you can write to Avro and compact to Parquet, and that is a key area of development. I'm not sure of a great solution yet.

Apache Iceberg tables can sit on top of Avro files as one of the storage engines/formats, in addition to Parquet or even the old ORC format.

Apache Hudi[2] was looking into HTAP capabilities - writing in row store, and compacting or merge on read into column store in the background so you can get the best of both worlds. I don't know where they've ended up.

[1] https://avro.apache.org/

[2] https://hudi.apache.org/


You basically can't do row by row appends to any columnar format stored in a single file. You could kludge around it by allocating arenas inside the file but that's still a huge write amplification, instead of writing a row in a single block you'd have to write a block per column.


You can do row by row appends to a Feather (Arrow IPC — the naming is confusing). It works fine. The main problem is that the per-append overhead is kind of silly — it costs over 300 bytes (IIRC) per append.

I wish there was an industry standard format, schema-compatible with Parquet, that was actually optimized for this use case.


Creating a new record batch for a single row is also a huge kludge leading to lot of write amplification. At that point, you're better off storing rows than pretending it's columnar.

I actually wrote a row storage format reusing Arrow data types (not Feather), just laying them out row-wise not columnar. Validity bits of the different columns collected into a shared per-row bitmap, fixed offsets within a record allow extracting any field in a zerocopy fashion. I store those in RocksDB, for now.

https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...

https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...

https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...


> Creating a new record batch for a single row is also a huge kludge leading to lot of write amplification.

Sure, except insofar as I didn’t want to pretend to be columnar. There just doesn’t seem to be something out there that met my (experimental) needs better. I wanted to stream out rows, event sourcing style, and snarf them up in batches in a separate process into Parquet. Using Feather like it’s a row store can do this.

> kantodb

Neat project. I would seriously consider using that in a project of mine, especially now that LLMs can help out with the exceedingly tedious parts. (The current stack is regrettable, but a prompt like “keep exactly the same queries but change the API from X to Y” is well within current capabilities.)


Frankly, RocksDB, SQLite or Postgres would be easy choices for that. (Fast) durable writes are actually a nasty problem with lots of little detail to get just right, or you end up with corrupted data on restart. For example, blocks may be written out of order so on a crash you may end up storing <old_data>12_4, and if you trust all content seen in the file, or even a footer in 4, you're screwed.

Speaking as a Rustafarian, there's some libraries out there that "just" implement a WAL, which is all you need, but they're nowhere near as battle-tested as the above.

Also, if KantoDB is not compatible with Postgres in something that isn't utterly stupid, it's automatically considered a bug or a missing feature (but I have plenty of those!). I refuse to do bug-for-bug compatible and there's some stuff that are just better not implement in this millennia, but the intent is to make it be I Can't Believe It's Not Postgres, and to run integration tests against actual everyday software.

Also, definitely don't use KantoDB for anything real yet. It's very early days.


> Frankly, RocksDB, SQLite or Postgres would be easy choices for that. (Fast) durable writes are actually a nasty problem with lots of little detail to get just right, or you end up with corrupted data on restart. For example, blocks may be written out of order so on a crash you may end up storing <old_data>12_4, and if you trust all content seen in the file, or even a footer in 4, you're screwed.

I have a WAL that works nicely. It surely has some issues on a crash if blocks are written out of order, but this doesn’t matter for my use case.

But none of those other choices actually do what I wanted without quite a bit of pain. First, unless I wire up some kind of CDC system or add extra schema complexity, I can stream in but I can’t stream out. But a byte or record stream streams natively. Second, I kind of like the Parquet schema system, and I wanted something compatible. (This was all an experiment. The production version is just a plain database. Insert is INSERT and queries go straight to the database. Performance and disk space management are not amazing, but it works.)

P.S. The KantoDB website says “I’ve wanted to … have meaningful tests that don’t have multi-gigabyte dependencies and runtime assumptions“. I have a very nice system using a ~100 line Python script that fires up a MySQL database using the distro mysqld, backed by a Unix socket, requiring zero setup or other complication. It’s mildly offensive that it takes mysqld multiple seconds to do this, but it works. I can run a whole bunch of copies in parallel, in the same Python process even, for a nice, parallelized reproducible testing environment. Every now and then I get in a small fight with AppArmor, but I invariably win the fight quickly without requiring any changes that need any privileges. This all predates Docker, too :). I’m sure I could rig up some snapshot system to get startup time down, but that would defeat some of the simplicity of the scheme.


And I have a system that launches Postgres in a container as part of a unit test (a little wrapper around https://crates.io/crates/pgtemp). It's much better than nothing, but the test using Postgres takes 0.5 seconds when the same business logic run against an in-memory implementation takes 0.005s.


Agreed.

There is room still for an open source HTAP storage format to be designed and built. :-)


Have you considered something like iceberg tables?


Yes, but parquet hates small files.


You can't compact? i.e. iceberg maintenance


We might be doing something wrong, but we saw significant performance degradation for both ingestion and query when doing compaction when it comes to finance data during trading hours.


Feather (Arrow IPC) is zero copy and an order of magnitude simpler. Parquet has a lot of compatibility issues between readers and writers.

Arrow is also directly usable as the application memory model. It’s pretty common to read Parquet into Arrow for transport.


When you say compatibility issues, you mean they are more problematic or less?

It’s pretty common to read Parquet into Arrow for transport.

I'm confused by this. Are you referring to Arrow Flight RPC? Or are you saying distributed analytic engine use arrow to transport parquet between queries?


Not the OP, but Parquet compatibility issues are usually due to the varying support of features across implementations. You have to take that into account when writing Parquet data (unless you go with the defaults which can be conservative and suboptimal).

Recently we have started documenting this to better inform choices: https://parquet.apache.org/docs/file-format/implementationst...


Going back how’s many years? I checked recently and VTI easily out perform in the last 5 years.


You'll be amazed on what the new breed of engineers are using Redis for. I personally saw an entire backend database using Redis with RDB+AOF on. If you redis-cli into the server, you can't understand anything because you need to know the schema to make sense of it all.


This belongs to the "shit that never happened" list.


I lost all my college projects due to source forge bullshit decades ago. And old pictures from digital/film cameras. Right now, I run both google/apple photos so I have 2 backups of my pics and videos.

Honestly, after 20 some years in technology, I don't think it's possible to back up everything unless you are willing to pay and constantly work at it.


I've seen this type of advice a few times now. Now I'm not a database expert by any stretch of imagination, but I have yet to see UUID as primary key in any of the systems I've touched.

Are there valid reasons to use UUID (assuming correctly) for primary key? I know systems have incorrectly expose primary key to the public, but assuming that's not the concern. Why use UUID over big-int?


Uuids also allow the generation of the ID to seperate from the insertion into the database, which can be useful in distributed systems.


I mean this is the primary reason right here! You can pre-create an entire tree of relationships client side and ship it off to the database with everything all nice and linked up. And since by design each PK is globally unique you’ll never need to worry about constraint violations. It’s pretty damn nice.


About 10 years ago I remember seeing a number of posts saying "don't use int for ids!". Typically the reasons were things like "the id exposes the number of things in the database" and "if you have bad security then users can increment/decrement the id to get more data!". What I then observed was a bunch of developers rushing to use UUIDs for everything.

UUIDv7 looks really promising but I'm not likely to redo all of our tables to use it.


You can use the same techniques except with the smaller int64 space - see e.g. Snowflake ID - https://en.wikipedia.org/wiki/Snowflake_ID


Note that if you’re using UUID v4 now, switching to v7 does not require a schema migration. You’d get the benefits when working with new records, for example reduced insert latency. The uuid data type supports both.


At my company we only use UUIDs as PKs.

Main reason I use it is the German Tank problem: https://en.wikipedia.org/wiki/German_tank_problem

(tl;dr; prevent someone from counting how many records you have in that table)


What stops you from having another uuid field as publicly visible identifier (which is only a concern for a minority of your tables).

This way you avoid most of the issues highlighted in this article, without compromising your confidential data.


I'm new to the security side of things; I can understand that leaking any information about the backend is no bueno, but why specifically is table size an issue?


In my old company new joiners are assigned an monotonic number as id in tech. GitHub profile url reflected that.

Someone may or may not have used the pattern to get to know the attrition rate through running a simple script every month))


This was a great read, thank you for sharing!


Appreciate it!


At least for the Spanner DB, it's good to have a randomly-distributed primary key since it allows better sharding of the data and avoids "hot shards" when doing a lot of inserts. UUIDv4 is the typical solution, although a bit-reversed incrementing integer would work too

https://cloud.google.com/blog/products/databases/announcing-...


I've been using Django on and off at work for the past few years. I really like it. That being said, I still find its ORM difficult. I understand it now that since it's an opinionated framework, I need to follow Django way of thinking. The main issue is that at work, I have multiple databases from different business units. So I constantly have to figure out a way to deal with multiple databases and their idiosyncrasies. I ended up doing a lot of hand holding by turning off managed, inspectdb and then manually delete tables I don't want to show via website or other reasons. For green webapps we have, django is as good as it gets.


Agreed, and their DB migration workflow leaves much to be desired. By not storing a schema/DB state alongside code, Django depends on the current DB state to try and figure it out from scratch every time you run a DB command. Not to mention defining DB state from model code is inherently flawed, since models are abstractions on top of database tables. I much prefer the Rails way of composing migrations as specific DB instructions, and then getting models 'for free' on top of those tables.


Do you use Django's multiple databases support ? (https://docs.djangoproject.com/en/6.0/topics/db/multi-db/)


Yes, we have to in order to use a lot of the features. The core issue for us is really the way Django assumes code represents database state. In normal webapp where the application has full control of the database, that's a good idea. But our databases are overloaded for simple transactions, analytics, users managements, jobs and AI. Business uses the databases in various ways such as Power BI, Aquastudio, etc.. Django app is actually a tiny part of the database. As you can imagine, we duck tape the heck out of the databases, and Django goes bonkers when things aren't matching.


I've used Aldjemy (https://github.com/aldjemy/aldjemy) on a small project and it worked pretty well for allowing me to compose the fairly complex queries needed that the Django ORM couldn't do.


Also don't underestimate setting up e.g. views or materialized views even that you can use through the ORM to query. It helps a lot and allows you to combine fine tuning SQL with ease of use through Django, and get a lot of performance out of it. Just remember to create them in the migration scripts.


Any docs? Django migration is a HUUGE pain point for us.


  manage.py makemigrations myapp --empty --name add_some_view
(in the migration file)

  operations=[migrations.RunSQL("Create View some_view AS ....", "DROP VIEW IF EXISTS...."]
(in your models.py)

  class SomeView(models.Model):
       class Meta:
           db_table = 'some_view'
           managed = False

  manage.py makemigrations myapp --name add_some_view_model


An extremely common thing to do. Also great with materialized views. I bet it's documented somewhere in Django's docs.


I've been using Django for the last 10+ years, its ORM is good-ish. At some point there was a trend to use sqlalchemy instead but it was not worth the effort. The Manager interface is also quite confusing at first. What I find really great is the migration tool.


Since Django has gained mature native migrations there is a lot less point to using SQLAlchemy in a Django project, though SQLAlchemy is undeniably the superior and far more capable ORM. That should be unsurprising though - sqlalchemy is more code than the entire django package, and sqlalchemy + alembic is roughly five times as many LOC as django.db, and both are similar "density" code.


Makes sense as sqlalchemy’s docs are also 5x as confusing.


Maybe this shows my data analyst tendencies, but why not use SQL?


That’s what we do now. But it gets repetitive and not leveraging Django core features.


My good friend is a network engineer and provider in NYC for decades now, pretty good one at that. Has anyone deployed UniFi in a computer centric professional environment? Just to be more specific a bit, computer centric professional environment means networks and computers are the primary way of getting job done. He hasn't seen any UniFi but his home is all UniFi.


Small businesses? Sure. I've seen Unifi networks with a few hundred MACs often enough.

Enterprises? Thousands, tens of thousands of employees? Generally cost isn't prohibitive at the scale so bigger ecosystems with more support make way more sense. Even their enterprise switches aren't really equivalent to Cisco, Arista, Juniper, etc enterprise offerings. They're inching forward, though.


What kinda of small businesses? My friend does consulting and most of his business is on small (100-250 employees) size and sometimes small start up, typically in the office construction phase where he comes in and set up network infrastructure. He never seems anyone asking for UniFi, but again, might just because the cost? He feels UniFi is price competitive at that scale but no one wants it for some reason.

When he was with a larger company, cisco and juniper were the only options.


Upmarket electrical trade, engineering offices, MSPs commonly, smallish healthcare, an equine event center, a bank.

I see tons of small businesses (mom & pop, restaurants) with a UI AP or two, of course, but that's not what you meant, I don't think.

The places that COULD use UI often just don't care, and want the cheap toilet paper (netgear, ebay whatever) since cost discipline can be critical. I think there's a niche of biz with enough margin that networking/cameras get sold together and they don't insist on lowest price. My guess would also be the MSP that is quoting the job also heavily influences whether UI is used, plenty of dinosaurs out there.

Bigger places want routing, network virtualization, etc from the big players you mentioned. UI doesn't want to mess with BGP, spine-leaf, sd access/wan, etc. There's also like the 24/7 support options they want, and access to the partner/VAR/contractor networks so you have tons of options. The sales deals and dinners unfortunately factor into this too...


Thank you! He recently did a job on a newly constructed venue hall for about 300-400 people. He originally quoted full UI which has everything the company is looking for, from network to security camera. The company didn't want that, I think he ended do a combination of Cisco and some odd security system.


I've seen Cisco meraki AP's on the ceilings at more than a handful of tech companies, if that helps.


Thanks!


Ha! Maybe Javascript developers will finally drop memory usage! You need to display the multiplication table? Please allocate 1GB of RAM. Oh, you want alternate row coloring? Here is another 100MB of CSS to do that.

edit: this is a joke


I do sometimes reflect on how 64MB of memory was enough to browse the Web with two or three tabs open, and (if running BeOS) even play MP3s at the same time with no stutters. 128MB felt luxurious at that time, it was like having no (memory-imposed) limits on personal computing tasks at all.

Now you can't even fit a browser doing nothing into that memory...


HN works under Dillo and you don't needs JS at all. If some site needs JS, don't waste your time. Use mpv+yt-dlp where possible.


Haha, the amount of downvotes of your very true comment just proves how many web developers are there on HN.


One even responded to an earlier comment of mine that we shouldn't be optimizing software for low-end machines because that would be going against technological progress...

https://news.ycombinator.com/item?id=46152275


Its funny because the blogpost author makes the same joke


I don't know about Stanford students' actual disability, so I can't say much to that. I went to shitty high school and decent middle school in relatively poor middle class neighborhood. Now, I live in a wealthy school district. The way parents in the two different neighborhood treat "learning disability" is mind blowing.

In my current school district, IEP (Individual Education Program) is assigned to students that need help, and parents are actively and explicitly ask for it, even if the kids are borderline. Please note that, this doesn't take away resource for regular kids, in fact, classrooms with IEP student get more teachers so everyone in that class benefits. IEP students are also assigned to regular classroom so they are not treated differently and their identities aren't top secret. Mind you, the parents here can easily afford additional help if needed.

In other neighborhood, a long time family friend with two young children, the older one doesn't talk in school, period. Their speech is clearly behind. The parents refuse to have the kids assign IEP and insist that as long as the child is not disruptive, there is no reason to do so. Why the parents don't want to get help, because they feel the older child will get labelled and bullied and treated differently. The older child hates school and they are only in kindergarten. Teachers don't know what to do with the child.


>Please note that, this doesn't take away resource for regular kids

Sure it does, those extra teachers don't work for free. I think kids should get the help they need, but it's silly to pretend that it doesn't cost money that could be going towards other things.


My kid hated school in kindergarten as well. As did I. I didn't get any kind of intervention, and I feel like that set me on a terrible course.

My kid, mercifully, was diagnosed and received intervention in the form of tutoring, therapy, that sort of thing. He still has weapons-grade ADHD, and his handwriting is terrible (dysgraphia), but he seems to have beat the dyslexia and loves reading almost as much as his mother and I do. He's happier, healthier, and has a brighter future.

I really, really hope your friend comes to understand, somehow, that their kid needs intervention, and will benefit tremendously from it.


I'm in a upper-middle neighbourhood and my kids go to public school. Not having a individual learning plan is the exception (I think that makes me double-exceptional). Classrooms DO NOT get more education assistant resources and combine this will the move to integrate kids who ehsitorically wouldn't consider attending regular school means teachers spend all their time managing the classroom and the parents.

>> the older one doesn't talk in school, period.

If the kid is completely non-verbal there's no way they should be in a class with regular kids. This is extremely unfair to the class.


> Please note that, this doesn't take away resource for regular kids, in fact, classrooms with IEP student get more teachers so everyone in that class benefits.

There is a limited amount of money in the school system. When resources are assigned to one place they are taken away from somewhere else. The kids in the class without IEP students are getting boned by this policy.


That’s just a callous myth.


You think schools have unlimited money?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: