More

mattewong · 2025-12-06T06:20:14 1765002014

The site says privacy-first and also says "we cannot lose your data if we never collect it" but it makes a WHOLE lot of POST calls passing what appear to be encrypted payloads, and refuses to work offline-- so the user has no way to verify that the limited info you claim to be collecting is in fact what is being collected. Worse, if you simply visit and use the site, you never once see any mention of terms of use, and yet those published terms-- which you will only find if you actively scroll way down to the bottom of the SPA and click on a tiny link-- and claim to be binding merely by the use of the site, which could have easily happened without the user having any knowledge or notice whatsoever that they "agreed" to something (in other words, without actually agreeing to anything). The terms also do not say anything about your data collection, though if one looks hard enough one can find it mentioned in the privacy policy, well below the contradictory opening line that says "we cannot lose your data if we never collect it". Sorry, but meta data is still data, so "we never collect [your data]" is simply false.

So, maybe you did not intend it to be so, but to me the site comes off as being very sketchy and untrustworthy.

nighwatch · 2025-12-06T08:57:45 1765011465

Thanks for the thoughtful and critical feedback, this is exactly why I posted here.

You raised fair points about the mixed messaging, and I’ve just pushed updates to address them:

• Privacy Policy & data collection: You're right that the tagline “we never collect data” was too absolute. I do use standard analytics (GA) for anonymous usage metrics and error tracking. The Privacy Policy now clearly separates File Data — which is processed 100% locally and never leaves the browser — from Usage Metadata, which is anonymized and collected only for understanding feature performance.

• Network activity: The POST requests you saw come solely from those analytics libraries. No file contents, pasted text, or conversion results ever hit the network. I’ll also review whether I can reduce or defer analytics calls to make this more transparent.

• Visibility of terms: Agreed. I’ve added a prominent Privacy/Terms link in the header and a first-visit consent banner so users aren’t relying on a tiny footer link or assumptions.

• Offline behavior: The conversion logic runs entirely in Web Workers and doesn’t require a server, but my PWA config wasn’t robust enough to guarantee a clean offline startup. I’m working on tightening that up so users can verify the “local-only” behavior themselves.

None of this was intended to be sketchy — I simply oversimplified the marketing copy and didn’t surface the right information. I really appreciate you calling it out and giving me the chance to improve it.

mattewong · 2025-12-02T23:15:24 1764717324

What is the advantage of this over the parser used by xsv? From the documentation, the only difference I can see is that xsv handles weird CSV better than this crate-- which in some situations is very important! So presumably this one must be faster? If so, how much faster? Or is there some other advantage to this?

Yomguithereal · 2025-12-03T14:24:17 1764771857

There is a section here with a table describing the kind of boost you might observe vs. rust-csv used by xsv: https://docs.rs/simd-csv/latest/simd_csv/#regarding-performa...

mattewong · 2025-12-02T23:08:52 1764716932

Looks interesting and I gave it a whirl-- thank you. Your intro mentions filter + sort, but I couldn't find a way to do that in the web UI (maybe that's just my ineptitude).

Re your question whether it would be useful: hard to answer because I cannot tell right now whether it solves (or intends to solve) any specific problem better than plenty of other alternatives.

mattewong · 2025-10-23T15:56:07 1761234967

Hi HN, I'm the author of zsv.

zsv was built because I needed a library to integrate with my application, and other CSV parsers had one or more of a variety of limitations (couldn't handle "real-world" CSV or malformed UTF8, were too slow, degraded when used on very large files, couldn't compile to web assembly, could not handle multi-row headers (seems like basically none of the other CSV parsers do this) etc-- more details are in the repo README). The closest solution to what I wanted was xsv, but was not designed as an API and I still needed a lot of flexibility that wasn't already built into it.

My first inclination was to use flex/bison but that approach yielded surprisingly slow performance; SIMD had just been shown to be useful in unprecedented performance gains for JSON parsing, so a friend and I took a page from that approach to create what afaik (though I could be wrong) is now the fastest CSV parser (and most customizable as well) that properly handles "real-world" CSV.

When I say "real-world CSV": if you've worked with CSV in the wild, you probably know what I mean, but feel free to check out the README for a more technical explanation.

With parser built, I found that some of the use cases I needed it for were generic, so I wrapped them up in a CLI. Most of the CLI commands are run-of-the-mill stuff: echo, select, count, sql, pretty, 2tsv, stack. Some of the commands are harder to find in other utilities: compare (cell-level comparison with customizable numerical tolerance-- useful when, for example, comparing CSV vs data from a deconstructed XLSX, where the latter may look the same but technically differ by < 0.000001), serialize/flatten, 2json (multiple different JSON schema output choices). A few are not directly CSV-related, but dovetail with others, such as 2db, which converts 2json output to sqlite3 with indexing options, allowing you to run e.g. `zsv 2json my.csv --unique-index mycolumn | zsv 2db -t mytable -o my.db`.

I've been using zsv for years now in commercial software running bare metal and also in the browser (see e.g. https://liquidaty.github.io/zsv/), so I finally got around to tagging v1.0.1 as the first production-ready release.

I'd love for you to try it out and would welcome any feedback, bug reports, or questions.

mattewong · 2025-10-09T17:10:42 1760029842

> even if it's "this isn't useful because X."

OK... this isn't useful to me because I now just only use mermaid and stopped using other diagraming tools, because mermaid can be embedded now in so many places (github, in-browser editing/rendering, shareable URL, python etc), can be easily saved as text and edited (whether manually or programmatically), there are no IP issues, etc. So more useful for me would be, whatever you need to use something other than mermaid for-- do what it takes to make mermaid fill that need, so that there is no need to use anything else.

mattewong · 2025-10-09T17:04:22 1760029462

Maybe better if the website discloses the fact that file names are getting tracked via POST to the server

mattewong · 2025-10-09T16:56:15 1760028975

< Technical details about the data wrangling happy to share in comments if anyone's interested in that nightmare.

I'm interested. That is the nightmare my company obsesses about solving

hermitcrab · 2025-10-09T18:57:52 1760036272

Do you mean solve - as in sort out your own mess?

Or solve - as in make a tool to sort out everyone else's mess?

If it is the former, have you looked at data wrangling tools, like Easy Data Transform (our product), Tableau Prep or Alteryx?

mattewong · on Dec 5, 2024

This is misleading. First, as other comments have noted, it is comparing multi-threaded/parallelized vs single-threaded, and its total CPU time is much longer than wc's. Second, it suggests there is something special going on, when there is not. Just breaking the file into parts and running wc -l on it-- or even, running a CSV parser that is much more versatile than DuckDB's-- I'm pretty confident will perform significantly faster than this showing. Bets anyone?

szarnyasg · on Dec 5, 2024

I am the author of the original post and I also wrote a followup blog post on it yesterday: https://szarnyasg.org/posts/duckdb-vs-coreutils/

Yes, if you break the file into parts with GNU Parallel, you can easily beat DuckDB as I show in the blog post.

That said, I maintain that it's surprising that DuckDB outperforms wc (and grep) on many common setups, e.g., on a MacBook. This is not something many databases can do, and the ones which can usually don't run on a laptop.

mattewong · on Dec 9, 2024

Your follow-up post is helpful and appreciated!

Re the original analysis, my own opinion is that the outcome is only surprising when the critical detail, highlighting how the two are different, is omitted. It seems very unsurprising if it is rephrased to include that detail: "DuckDB, executed multi-threaded + parallelized, is 2.5x faster than wc, single-threaded, even though in doing so, DuckDB used 9.3x more CPU".

In fact, to me, the only thing that seems surprising about that is how poorly DuckDB does compared to WC-- 9x more CPU for only 2.5x more improvement.

But an interesting analysis regardless of the takeaways-- thank you

mattewong · on Dec 5, 2024

I am always a proponent of starting with the end goal and then working backward. What are the end results you are aiming to achieve (or aiming to allow your audience to achieve)? Is marginal precision more important than the speed impact? The optimal database design will depend on that (i.e., on what you are optimizing for...).

It would also be very helpful, imho, to indicate keys and indexes, perhaps by modifying your schema diagram, or simply (and maybe better), just dump the actual SQL schema definition (i.e. the output from sqlite3's ".schema" command)

Rendello · on Dec 5, 2024

Thank you for taking the time, this is exactly the sort of discussion I was looking for :)

I had a clear end goal in mind, a smallish database I can query for municipality data easily, but in the middle of a project I started to miss the forest for the trees thinking about every theoretical use case. I think I achieved my goal, but it would've been better to keep my end goal in mind.

The schema diagram is limited, I agree. The tool I was using has no options, so I might switch to another library and do more manual work, or otherwise indicate it in text as you mentioned. The diagram does show primary and foreign key relations, it just might not be clear.

mattewong · on Dec 5, 2024

Glad to be helpful-- I'm in the business of data process automation, so I appreciate the opportunity to learn about new use cases. If you are willing to share what your end goal was in more detail (even as simple as an SQL query that you would now run want to run against your current schema), I'd be interested to see how an optimal process could be designed to easily generate that, and possibly suggest some tooling you could find useful. You may also want to try posting questions like this in forums such as the Seattle Data Guy's discord channel, and I'm suspect you will get lots of suggestions and advice.

Rendello · on Dec 5, 2024

Here's an example case: getting all municipalities in Canada with more than 1500 native speakers of Inuktitut (the language of the Inuit):

First, I have to find the characteristic I'm looking for:

   SELECT id, description FROM characteristic WHERE description LIKE "%Inuktitut%";

I can see what I'm looking for, id 455, which is "Total - Mother tongue for the total population excluding institutional residents - 100% data; Single responses; Non-official languages; Indigenous languages; Inuktut (Inuit) languages; Inuktitut" (since these characteristics have parents and children, I can build up a graph with a CTE, which could be useful).

To use it, I query the view with:

  SELECT geo_name, c1_count_total, c2_count_men, c3_count_women FROM cview
  WHERE geo_level_name = 'Census subdivision' -- Roughly a municipality
  AND  characteristic_id = 455
  AND c1_count_total > 1500 -- Total count (speakers, in this case)
  ORDER BY c1_count_total DESC;

Which gives:

  ┌─────────────────────────────────────┬────────────────┬──────────────┬────────────────┐
  │              geo_name               │ c1_count_total │ c2_count_men │ c3_count_women │
  ├─────────────────────────────────────┼────────────────┼──────────────┼────────────────┤
  │ 'Arviat, Hamlet (HAM)'              │ 2465.0         │ 1235.0       │ 1235.0         │
  │ 'Iqaluit, City (CY)'                │ 2195.0         │ 985.0        │ 1215.0         │
  │ 'Puvirnituq, Village nordique (VN)' │ 1965.0         │ 940.0        │ 1030.0         │
  │ 'Igloolik, Hamlet (HAM)'            │ 1820.0         │ 975.0        │ 845.0          │
  │ 'Inukjuak, Village nordique (VN)'   │ 1745.0         │ 910.0        │ 835.0          │
  │ 'Kuujjuaq, Village nordique (VN)'   │ 1745.0         │ 795.0        │ 945.0          │
  └─────────────────────────────────────┴────────────────┴──────────────┴────────────────┘

So it appears to work as intended (although the men+women count isn't exactly the same as the total count, I suspect that's some StatCan wierdness that I'll have to look at). It's slow, but I'm not too worried about that, it's not meant for fast real time queries, and I can always add indexes later especially since it's now a read-only database in practice.

My main concerns are regarding the schema and use of REALs and NULLs. Where there are empty strings in the CSV, I have NULLs in the database. I suspect that's the best move, but I suspect having REALs instead of DECIMALs (ie. TEXT processed with the SQLite extension) may be the wrong abstraction. For my usecase, I think the database covers all of my needs except having the areas associated with a province/territory, although I know which dataset I would need for that information and that will likely be in a future iteration of the database.

I say "it appears to work" because I'm not sure the best way to test the data in the database without reimplementing the parsing and rechecking every row in each CSV against the database. I'm wary of blindly trusting the data is correct and all there.

mattewong · on Dec 9, 2024

That is great, thank you. I'd love to continue the conversation-- maybe easier in a separate forum. Can I follow-up via the email address on your profile (gaven...)?

Rendello · on Dec 9, 2024

Of course! :)

mattewong · on Oct 28, 2024

Haven't yet seen any of these beat https://github.com/liquidaty/zsv (of which I'm an author) when real-world constraints are applied (e.g. we no longer assume that line ends are always \n, or that there are no dbl-quote chars, embedded commas/newlines/dbl-quotes). And maybe under the artificial conditions as well.