Hacker Newsnew | past | comments | ask | show | jobs | submit | heyplanet's commentslogin

Is there a way to keep up with this story? Looks like Orpheusdroid is not on Twitter.

I am not sure who is right and who is wrong here. But I have the feeling that it is too easy for big companies to threaten small companies.

I wonder if there should be some institution where small companies and individuals can go to when threatened by big companies. And if the threat seems unreasonable, the institution would take up the fight on behalf of the small company or individual.


The article talks about processing the CSV file in memory, not about loading it.

What am I missing?


It does load it - mmap doesn't copy the file content into a buffer, it merely allows you to operate on a file as if it were in memory. Memory reads correspond to file read operations.


Sort of. mmap absolutely copies the file contents into the kernel file system cache which is a buffer, it just lets you map the filesystem cache into your address space so you can see it. And memory reads don't translate to file reads unless not in the cache already.


> mmap absolutely copies the file contents into the kernel file system cache which is a buffer

Isn't this a bit misleading? mmaping a file doesn't cause the kernel to start loading the whole thing into RAM, it just sets things up for the kernel to later transparently load pages of it on demand, possibly with some prefetching.


The GP's comment on the other hand seemed to imply it was completely unbuffered.


"Completely unbuffered" is almost unattainable in practice, so I'm not sure that's a reasonable inference. About the best you can do in general is not do any buffering yourself, and usually explicitly bypass whatever buffering is going on at the next level of abstraction down. Ensuring you've cut out all buffering in the entire IO stack takes real effort.


> "Completely unbuffered" is almost unattainable in practice, so I'm not sure that's a reasonable inference.

While absolutely true, I've found that fact to be very surprising to a lot of engineers.


But I think the important part is that the file starts on disk and ends parsed. The rate of that was NVME limited (per article).


> The rate of that was NVME limited (per article).

The article shows that he's getting half the throughput of parsing a CSV that's already in RAM. But: he's using RAID0 of two SSDs and only getting a little more than half the throughput of one of those SSDs. As currently written, this program might not be giving the SSDs a high enough queue depth to hit their full read throughput. I'd like to see what throughput is like with an explicit attempt to prefetch data into RAM (either with a thread manually touching all the necessary pages, or maybe with a madvise call). That could drastically reduce the number of page faults and context switches affecting the OpenMP worker threads, and yield much better CPU utilization.


I thought queue depth related to supporting outstanding/pending reads. For a serial access such as csv parsing, what would you do other than having a readahead - somehow, see my other question - which would presumably maintain the queue depth at about 1.

Put another way, what would you do to read in the CSV serially to increase speed that would push the queue depth above 1?


For sequential accesses, it usually doesn't make a whole lot of difference whether the drive's queue is full of lots of medium-sized requests (eg. 128kB) or a few giant requests (multiple MB), so long as the queue always has outstanding requests for enough data to keep the drive(s) busy. Every operating system will have its own preferred IO sizes for prefetching, and if you're lucky you can also tune the size of the prefetch window (either in terms of bytes, or in terms of number of IOs). Different drives will also have different requirements here to achieve maximum throughput; an enterprise drive that stripes data across 16 channels will probably need a bigger/deeper queue than a consumer drive with just 4 channels, if the NAND page size is the same for both.

However, optimal utilization of the drive(s) will always require a queue depth of more than one request, because you don't want the drive to be idle after signalling completion of its only queued command and waiting for the CPU to produce a new read request. In a RAID0 setup like the author describes, you need to also ensure that you're generating enough IO to keep both drives busy, and the minimum prefetch window size that can accomplish this will usually be at least one full stripe.

As for how you accomplish the prefetching: the madvise system call sounds like a good choice, with the MADV_SEQUENTIAL or MADV_WILLNEED options. But how much prefetching that actually causes is up to the OS and the local system's settings. On my system, /sys/block/$DISK/queue/read_ahead_kb defaults to 128, which is definitely insufficient for at least some drives but might only apply to read-ahead triggered by the filesystem's heuristics rather than more explicitly requested by a madvise. So manually touching pages from a userspace thread is probably the safer way to guarantee the OS pages in data ahead of time—as long as it doesn't run so far ahead of the actual use of the data that it creates memory pressure that might get unused pages evicted.


WILLNEED really just reads the entire file asynchronously into the page cache, at least on Linux.


Is that still true if the size_t parameter to the madvise call is less than the entire file size? I would think that madvise hints could be issued at page granularity and not affect the entire mapping as originally allocated.


With no limit? What if the file is huge, will it evict other things in the cache?


Yes, probably. With hindsight, it is probably a mistake to use mmap. I probably can do better to just read file myself, since I have to make a mirror buffer later for some data manipulation anyway.


That makes sense, thanks!


Well, it copies into the kernel buffer as you access it as a sort of demand paging that isn’t actually all that bad depending on what you’re doing. It’s dramatically different from a typical “read everything into a buffer” that most programs do.


General question: if mmap pulls in data as you ask it and not before, you're going to have CPU waits on the disk, followed by processing on the CPU but no disk activity, alternating back and forth. I'd assume that to be optimal is to have them both working at once, so to have some kind of readahead request for the disk. How is this done, if at all?

Edit: just seen this which kind of touches on the same https://news.ycombinator.com/item?id=24737186


Generally the OS should see if you’re doing a long sequential access and prefetch this data before you access it.


Not sure if you know how mmap works, but regardless you can't say that memory reads correspond to file reads.

There is literally no io being done on your data access paths. Synchronising mapped pages with file contents happens in background write back threads.


This probably adds a significant number of data bits for tracking people without cookies.

First of all it tells the page if you run the dev version of Chrome with a certain version.

Secondly, I would be surprised if memory behavior does not differ between certain setups.


Author here. Fingerprinting is a valid concern. The API explainer has a section about it: https://github.com/WICG/performance-measure-memory#fingerpri...

It is important to keep in mind that the API only accounts for the objects allocated by the web page itself and does not expose the total memory usage of the browser.

The only information that can be extracted using the API is the browser version (because an object representation may change between different versions) and the bitness of the browser (32-bit vs 64-bit). This information is already exposed by other existing APIs (e.g navigator.userAgent, navigator.deviceMemory)

Thus the API does not add _new_ data bits for tracking. The final spec of the API may include additional protection against fingerprinting. For example, adding a small amount of Gaussian noise would make browser version inference much more difficult.


Secondly, I would be surprised if memory behavior does not differ between certain setups.

In my experience of profiling various web apps, memory usage will differ between sessions of the same app on the same machine in a series of automated test runs with no changes to the code or setup. Web apps are terrible at managing memory and they leave all manner of things lying around that make it hard to get a deterministic number you could use for fingerprinting a user's browser.


Your first point surely applies to introducing any browser-specific feature -- and features are generally introduced to one or two or all browsers before becoming part of a standard.

"A website, with JS enabled, can tell what version of what browser you have" is a sailed ship, right?


I think UTF-8 was a mistake.

It is a pain in the ass to have a variable number of bytes per char.

In Ascii, you could easily know every character personally. No strange surprises.

Also no surprises while reading black on white text and suddenly being confronted with clors [1].

[1] Also no surprises when writing a comment on HN like this one and having some characters stripped. I put in a smiley as the firs "o" in colors, but it was stripped out. Looks like the makers of HN don't like UTF-8 either.


You're conflating code points and some encoding; more importantly, you're conflating "array of encoded objects (bytes)" for "a string of text". They're not — and never have been — the same.


> It is a pain in the ass to have a variable number of bytes per char.

Maybe, but nobody can stomach the wasted space you get with UTF-32 in almost every situation. The encoding time tradeoff was considered less objectionable than making most of your text twice or four times larger.


And as the article points out, even then you might have more than one code point for a character.

> For example, the only way to represent the abstract character ю́ cyrillic small letter yu with acute is by the sequence U+044E cyrillic small letter yu followed by U+0301 combining acute accent.


You can't even write proper English in ASCII. ASCII is an absolute dead end. It's history.

Actually representing human language is HARD. It is also absolutely necessary. Whatever solution you choose is going to be complicated, because it is solving a very complicated problem.

Throwing your hands up and going "oh this is too hard, I don't like it" will get you nowhere.


You can't write proper snooty English in ASCII, with diaereses and whatnot.


1967 ASCII anticipated that, with dual-use character shapes so you could type o BS " → ö

But then people invented video terminals that didn't overstrike.


ASCII doesn't have have all the punctuation regularly used in English.


ASCII doesn't have a direct representation of all the punctuation used in English print, like 66 99 quotes, and different kinds of dashes (distinct from minus). For non-print, it's entirely fine.

Typesetting should be handled by a markup language anyway. Adding a few characters to Notepad doesn't create a typesetting system. A typesetting system needs to be able to do kerning, ligatures, justification. Not to mention bold, italics, and different fonts.


Why would print be different here? A screen is as much "print" as a paper is these days.

Choosing correct punctuation is not typesetting, either.


> It is a pain in the ass to have a variable number of bytes per char.

This is from API & language mistakes more than an issue with UTF-8 itself.

If you actually design your API & system around being UTF-8, like Rust did, then there's really no issue for the programmer. The API enforces the rules, and still gives you things like a simple character iterator (with characters being 32-bit, so that it actually fits: https://doc.rust-lang.org/std/char/index.html). The String class handles all the multi-byte stuff for you, you never "see" it: https://doc.rust-lang.org/std/string/struct.String.html

Retrofitting this into existing languages isn't going to be easy, but that's not an excuse to not do it at all, either.


Character (code point) iterators are useless.

For parsing text-based formats, UTF-8 has the nice property that the encoded byte sequence of a character is not a subsequence of the encoding for any other chracter or sequence of other characters. This means splitting on byte sequences of UTF-8 works just as well as spliting on code points.

And for text editing you need to deal with grapheme clusers anyway, which can be made up of a variable number of code points - so having these be made up of a variable number of bytes doesn't make anything worse.


> It is a pain in the ass to have a variable number of bytes per char.

In the same vein it's a pain in the ass to write everything in assembler. Which is why we don't do that, we use high-level languages instead.


Certain things such as DNS, email addresses and so on should be restricted to ASCII, it’s a security nightmare otherwise.


I assume you mean a limited subset of 7bit ascii ? 33-126


    % host -t a $'\015'.
    1 \015:
    19 bytes, 1+0+0+0 records, response, authoritative, nxdomain
    query: 1 \015
    %
It's not as straightforward or sensible as you think. It's case insensitive; it's case preserving; and C0 control characters, SPC, and DEL are allowed. The case differentiating bits for letters are nowadays sometimes used in an attempt to foil attackers. If you want things to look back on and say "I think that X was a mistake." then forget UTF of any stripe. The DNS is full of them.


I thought DNS allowed any arbitrary byte sequence as label (up to max length limit)


Flagged because this is a sales page, not a Show HN. Please read the rules:

https://news.ycombinator.com/showhn.html

Regarding the sales page itself: If you had a list of podcasts created this way, I would definitely look at it. Because I am a big podcast fan.

If your product has not been used yet, I would start by contacting youtubers directly and work with them to build an initial portfolio of podcasts created with Podely.


What's wrong with it?

"Show HN is for something you've made that other people can play with. HN users can try it out, give you feedback, and ask questions in the thread."

Looks like you can try it out so not sure why do you think it's not good for ShowHN?


All you can do on that page is sign up. Sign-up pages are explicitely listed as a type of page that can't be a Show HN.

This sign-up page especially rubs me the wrong way as it uses a dark pattern: It grabs your email address first and then tells you that you cannot proceed without entering your credit card details.


I'll say that's more incompetence on my part than a "dark pattern". At this time you can only have 1 channel per account, and I need to check the email is unique before sending the info to Stripe, that's the way I came up to solve that, presenting that form before payment.

My product has not been used yet, just put it live yesterday. My idea here was just to get general feedback, like yours.


The title made it sound like the discovery engine is used in games. So I took a look, but it seems to be the music is used in games.

If you are as music-crazy as me, try this discovery site:

http://www.gnoosic.com

It is my daily driver to discover new music.


Gave it a shot. Put in 3 musicians/groups I love. I got 11 suggestions, I knew 7 of them already. Out of all the suggestion only one of them had any kind embedded media (Youtube) sample, that's it, no links or anything else for the other ones. I'll give it a shot with more main stream bands and see if that changes, could just be the genre/groups I chose.

Pretty neat and on point as far as the suggestions I did recognize though. Wish there at least a basic outgoing link for each suggestion


In a few seconds you got 4 new music recommendations? I gotta try this.


It told me I'd like the music of Women: http://www.gnoosic.com/artist/women

I don't know what to do with this information


Look them up on Discogs.com for more info. I typically preview bands I find from my own sources using Youtube, soundcloud and bandcamp.

Also Women formed another band called Viet Cong, and then after getting rolling had some pushback on the name, changed their name to Preconceptions. Great band, saw them live at Desert Daze last year.

Also one of the band members recently tweeted something about wanting to hear more Women so they might be reforming.


Jokes on you, Women is a really awesome band, greatly respected and missed by those in the know.


Wow, nice job. Women is one of my favorite bands.

I randomly get these weird urges to build playlists around certain bands or songs, so this sort of service helps a lot. The others I use, which take way more work, are last.fm, rate your music and Pandora.


This is especially vague combined with the Facebook ad: ‘Start reacting today. Connect to the world’.

Listening to the music of: women. Sounds like I’m starting at the very basics.


A link to wiki wouldn't hurt here


Would you say the 7 you already knew are similar to the 3 you entered?


I clicked on the link without much expectations, a broken HTTPS configuration is never a good sign, but I'm actually surprised by the results. It suggests the obvious similar artists, but also some very unknown artists with very few listeners on Spotify, but similar music.

The other website https://www.music-map.com/ is also interesting. It appears that I'm not very original in my music tastes.


Seems like this is also a part of Gnod. Thanks for sharing

Does anyone have any insight into how this is built?


Couldn't find a lot of info, but the creator Marek Gibney was featured in an MIT Tech Review article.

“If 90 percent of the readers of Douglas Hofstadter also like [Stephen] Hawking, the distance between these two writers in the Hofstadter-Hawking dimension is 0.1,” Gibney says.

https://www.technologyreview.com/2004/03/19/40243/sketchy-in...


Thank you


It's worked well for me: [Bob Dylan, Yacht, Bjork] => Halfby. Recommendation makes sense, and triggers fond memories of feeding last.fm :) I like the agency in this site.


I didn't enter any lo-fi artists and ended up in a lo-fi rabbit hole. Not complaining though


Same here. I put in Carsick Cars, Spacemen 3 (Gnod didn't have The Men), and Sun City Girls. What I got was Moonbell, Water Fai, etc. Thinking Fellers Union Local 282 was close to what I was looking for, but only in the way a Hot Pocket is like a burrito.


Just tried that and it seems potentially useful (except that it recommended two of the three bands I'd stated I liked in the first place). Thanks for the recommendation.


I tried this and the very first recommendation it gave me is already amazing. Thanks! Gonna add this to my other music discovery tools.


Great. It found I like retro german dance-pop: Klee


Is the only reason for the iframe so that it is possible to keep a state in the top frame while loading different pages?

Because otherwise - since you use the dev tools to inject the iframe - you don't really need the iframe. You can just run it as a "snippet" in Chromium or from the multi-line-code-editor in Firefox.

Both have the problem that it all has to be a single file. It would be much nicer if one could import modules.


I think having this on a production website means you can crawl the web using your user's CPUs and network connections, making it harder for people to stop you from harvesting their data.


>Both have the problem that it all has to be a single file. It would be much nicer if one could import modules.

Isn't this a solved problem in javascript land? Just use a compiler/minifier and your module oriented js code is in a single file as a build artifact.


Then every time you want to make a change to your code, you have to go to your original codebase, make the change, start the compiler, copy the output and paste it into your dev tools ...


> It would be much nicer if one could import modules.

Is there some reason es modules wouldn't work here? Just a snippet that inserts a tag of type=module


Yes, the reason is that most sites these days serve a content security policy which only allows code from whitelisted origins.


No word on where the search results come from?

Can't find anything about it on the about page.

Doing a few test searches, the results seem similar to those of Bing.


It's usually Bing, AFAIK they are the only one that offer a whitelabel/API solution.

There were a bunch of founders in the search category at startupschool 2018 and 2019, those that I spoke to all used Bing as that was their only viable option according to them.


For an interactive music map with bands, check out Music-Map:

https://www.music-map.com

You start at a band of your choice and then can travel all the bands in the world.


Have you tried Gnoosic (http://www.gnoosic.com)? That is the one that works best for me.


I'm thinking of incorporating gnoosic (or something similar--haven't looked yet to see if they have an API) into Lagukan for recommending new artists. Mainly my work has focused on re-recommending songs you already know.


No, I didn't know about it, thank you!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: