Is there a way to keep up with this story? Looks like Orpheusdroid is not on Twitter.
I am not sure who is right and who is wrong here. But I have the feeling that it is too easy for big companies to threaten small companies.
I wonder if there should be some institution where small companies and individuals can go to when threatened by big companies. And if the threat seems unreasonable, the institution would take up the fight on behalf of the small company or individual.
It does load it - mmap doesn't copy the file content into a buffer, it merely allows you to operate on a file as if it were in memory. Memory reads correspond to file read operations.
Sort of. mmap absolutely copies the file contents into the kernel file system cache which is a buffer, it just lets you map the filesystem cache into your address space so you can see it. And memory reads don't translate to file reads unless not in the cache already.
> mmap absolutely copies the file contents into the kernel file system cache which is a buffer
Isn't this a bit misleading? mmaping a file doesn't cause the kernel to start loading the whole thing into RAM, it just sets things up for the kernel to later transparently load pages of it on demand, possibly with some prefetching.
"Completely unbuffered" is almost unattainable in practice, so I'm not sure that's a reasonable inference. About the best you can do in general is not do any buffering yourself, and usually explicitly bypass whatever buffering is going on at the next level of abstraction down. Ensuring you've cut out all buffering in the entire IO stack takes real effort.
> The rate of that was NVME limited (per article).
The article shows that he's getting half the throughput of parsing a CSV that's already in RAM. But: he's using RAID0 of two SSDs and only getting a little more than half the throughput of one of those SSDs. As currently written, this program might not be giving the SSDs a high enough queue depth to hit their full read throughput. I'd like to see what throughput is like with an explicit attempt to prefetch data into RAM (either with a thread manually touching all the necessary pages, or maybe with a madvise call). That could drastically reduce the number of page faults and context switches affecting the OpenMP worker threads, and yield much better CPU utilization.
I thought queue depth related to supporting outstanding/pending reads. For a serial access such as csv parsing, what would you do other than having a readahead - somehow, see my other question - which would presumably maintain the queue depth at about 1.
Put another way, what would you do to read in the CSV serially to increase speed that would push the queue depth above 1?
For sequential accesses, it usually doesn't make a whole lot of difference whether the drive's queue is full of lots of medium-sized requests (eg. 128kB) or a few giant requests (multiple MB), so long as the queue always has outstanding requests for enough data to keep the drive(s) busy. Every operating system will have its own preferred IO sizes for prefetching, and if you're lucky you can also tune the size of the prefetch window (either in terms of bytes, or in terms of number of IOs). Different drives will also have different requirements here to achieve maximum throughput; an enterprise drive that stripes data across 16 channels will probably need a bigger/deeper queue than a consumer drive with just 4 channels, if the NAND page size is the same for both.
However, optimal utilization of the drive(s) will always require a queue depth of more than one request, because you don't want the drive to be idle after signalling completion of its only queued command and waiting for the CPU to produce a new read request. In a RAID0 setup like the author describes, you need to also ensure that you're generating enough IO to keep both drives busy, and the minimum prefetch window size that can accomplish this will usually be at least one full stripe.
As for how you accomplish the prefetching: the madvise system call sounds like a good choice, with the MADV_SEQUENTIAL or MADV_WILLNEED options. But how much prefetching that actually causes is up to the OS and the local system's settings. On my system, /sys/block/$DISK/queue/read_ahead_kb defaults to 128, which is definitely insufficient for at least some drives but might only apply to read-ahead triggered by the filesystem's heuristics rather than more explicitly requested by a madvise. So manually touching pages from a userspace thread is probably the safer way to guarantee the OS pages in data ahead of time—as long as it doesn't run so far ahead of the actual use of the data that it creates memory pressure that might get unused pages evicted.
Is that still true if the size_t parameter to the madvise call is less than the entire file size? I would think that madvise hints could be issued at page granularity and not affect the entire mapping as originally allocated.
Yes, probably. With hindsight, it is probably a mistake to use mmap. I probably can do better to just read file myself, since I have to make a mirror buffer later for some data manipulation anyway.
Well, it copies into the kernel buffer as you access it as a sort of demand paging that isn’t actually all that bad depending on what you’re doing. It’s dramatically different from a typical “read everything into a buffer” that most programs do.
General question: if mmap pulls in data as you ask it and not before, you're going to have CPU waits on the disk, followed by processing on the CPU but no disk activity, alternating back and forth. I'd assume that to be optimal is to have them both working at once, so to have some kind of readahead request for the disk. How is this done, if at all?
It is important to keep in mind that the API only accounts for the objects allocated by the web page itself and does not expose the total memory usage of the browser.
The only information that can be extracted using the API is the browser version (because an object representation may change between different versions) and the bitness of the browser (32-bit vs 64-bit). This information is already exposed by other existing APIs (e.g navigator.userAgent, navigator.deviceMemory)
Thus the API does not add _new_ data bits for tracking. The final spec of the API may include additional protection against fingerprinting. For example, adding a small amount of Gaussian noise would make browser version inference much more difficult.
Secondly, I would be surprised if memory behavior does not differ between certain setups.
In my experience of profiling various web apps, memory usage will differ between sessions of the same app on the same machine in a series of automated test runs with no changes to the code or setup. Web apps are terrible at managing memory and they leave all manner of things lying around that make it hard to get a deterministic number you could use for fingerprinting a user's browser.
Your first point surely applies to introducing any browser-specific feature -- and features are generally introduced to one or two or all browsers before becoming part of a standard.
"A website, with JS enabled, can tell what version of what browser you have" is a sailed ship, right?
It is a pain in the ass to have a variable number of bytes per char.
In Ascii, you could easily know every character personally. No strange surprises.
Also no surprises while reading black on white text and suddenly being confronted with clors [1].
[1] Also no surprises when writing a comment on HN like this one and having some characters stripped. I put in a smiley as the firs "o" in colors, but it was stripped out. Looks like the makers of HN don't like UTF-8 either.
You're conflating code points and some encoding; more importantly, you're conflating "array of encoded objects (bytes)" for "a string of text". They're not — and never have been — the same.
> It is a pain in the ass to have a variable number of bytes per char.
Maybe, but nobody can stomach the wasted space you get with UTF-32 in almost every situation. The encoding time tradeoff was considered less objectionable than making most of your text twice or four times larger.
And as the article points out, even then you might have more than one code point for a character.
> For example, the only way to represent the abstract character ю́ cyrillic small letter yu with acute is by the sequence U+044E cyrillic small letter yu followed by U+0301 combining acute accent.
You can't even write proper English in ASCII. ASCII is an absolute dead end. It's history.
Actually representing human language is HARD. It is also absolutely necessary. Whatever solution you choose is going to be complicated, because it is solving a very complicated problem.
Throwing your hands up and going "oh this is too hard, I don't like it" will get you nowhere.
ASCII doesn't have a direct representation of all the punctuation used in English print, like 66 99 quotes, and different kinds of dashes (distinct from minus). For non-print, it's entirely fine.
Typesetting should be handled by a markup language anyway. Adding a few characters to Notepad doesn't create a typesetting system. A typesetting system needs to be able to do kerning, ligatures, justification. Not to mention bold, italics, and different fonts.
> It is a pain in the ass to have a variable number of bytes per char.
This is from API & language mistakes more than an issue with UTF-8 itself.
If you actually design your API & system around being UTF-8, like Rust did, then there's really no issue for the programmer. The API enforces the rules, and still gives you things like a simple character iterator (with characters being 32-bit, so that it actually fits: https://doc.rust-lang.org/std/char/index.html). The String class handles all the multi-byte stuff for you, you never "see" it: https://doc.rust-lang.org/std/string/struct.String.html
Retrofitting this into existing languages isn't going to be easy, but that's not an excuse to not do it at all, either.
For parsing text-based formats, UTF-8 has the nice property that the encoded byte sequence of a character is not a subsequence of the encoding for any other chracter or sequence of other characters. This means splitting on byte sequences of UTF-8 works just as well as spliting on code points.
And for text editing you need to deal with grapheme clusers anyway, which can be made up of a variable number of code points - so having these be made up of a variable number of bytes doesn't make anything worse.
It's not as straightforward or sensible as you think. It's case insensitive; it's case preserving; and C0 control characters, SPC, and DEL are allowed. The case differentiating bits for letters are nowadays sometimes used in an attempt to foil attackers. If you want things to look back on and say "I think that X was a mistake." then forget UTF of any stripe. The DNS is full of them.
Regarding the sales page itself: If you had a list of podcasts created this way, I would definitely look at it. Because I am a big podcast fan.
If your product has not been used yet, I would start by contacting youtubers directly and work with them to build an initial portfolio of podcasts created with Podely.
All you can do on that page is sign up. Sign-up pages are explicitely listed as a type of page that can't be a Show HN.
This sign-up page especially rubs me the wrong way as it uses a dark pattern: It grabs your email address first and then tells you that you cannot proceed without entering your credit card details.
I'll say that's more incompetence on my part than a "dark pattern".
At this time you can only have 1 channel per account, and I need to check the email is unique before sending the info to Stripe, that's the way I came up to solve that, presenting that form before payment.
My product has not been used yet, just put it live yesterday.
My idea here was just to get general feedback, like yours.
Gave it a shot. Put in 3 musicians/groups I love. I got 11 suggestions, I knew 7 of them already. Out of all the suggestion only one of them had any kind embedded media (Youtube) sample, that's it, no links or anything else for the other ones. I'll give it a shot with more main stream bands and see if that changes, could just be the genre/groups I chose.
Pretty neat and on point as far as the suggestions I did recognize though. Wish there at least a basic outgoing link for each suggestion
Look them up on Discogs.com for more info. I typically preview bands I find from my own sources using Youtube, soundcloud and bandcamp.
Also Women formed another band called Viet Cong, and then after getting rolling had some pushback on the name, changed their name to Preconceptions. Great band, saw them live at Desert Daze last year.
Also one of the band members recently tweeted something about wanting to hear more Women so they might be reforming.
I randomly get these weird urges to build playlists around certain bands or songs, so this sort of service helps a lot. The others I use, which take way more work, are last.fm, rate your music and Pandora.
I clicked on the link without much expectations, a broken HTTPS configuration is never a good sign, but I'm actually surprised by the results. It suggests the obvious similar artists, but also some very unknown artists with very few listeners on Spotify, but similar music.
The other website https://www.music-map.com/ is also interesting. It appears that I'm not very original in my music tastes.
Couldn't find a lot of info, but the creator Marek Gibney was featured in an MIT Tech Review article.
“If 90 percent of the readers of Douglas Hofstadter also like [Stephen] Hawking, the distance between these two writers in the Hofstadter-Hawking dimension is 0.1,” Gibney says.
It's worked well for me: [Bob Dylan, Yacht, Bjork] => Halfby. Recommendation makes sense, and triggers fond memories of feeding last.fm :) I like the agency in this site.
Same here. I put in Carsick Cars, Spacemen 3 (Gnod didn't have The Men), and Sun City Girls. What I got was Moonbell, Water Fai, etc. Thinking Fellers Union Local 282 was close to what I was looking for, but only in the way a Hot Pocket is like a burrito.
Just tried that and it seems potentially useful (except that it recommended two of the three bands I'd stated I liked in the first place). Thanks for the recommendation.
Is the only reason for the iframe so that it is possible to keep a state in the top frame while loading different pages?
Because otherwise - since you use the dev tools to inject the iframe - you don't really need the iframe. You can just run it as a "snippet" in Chromium or from the multi-line-code-editor in Firefox.
Both have the problem that it all has to be a single file. It would be much nicer if one could import modules.
I think having this on a production website means you can crawl the web using your user's CPUs and network connections, making it harder for people to stop you from harvesting their data.
Then every time you want to make a change to your code, you have to go to your original codebase, make the change, start the compiler, copy the output and paste it into your dev tools ...
It's usually Bing, AFAIK they are the only one that offer a whitelabel/API solution.
There were a bunch of founders in the search category at startupschool 2018 and 2019, those that I spoke to all used Bing as that was their only viable option according to them.
I'm thinking of incorporating gnoosic (or something similar--haven't looked yet to see if they have an API) into Lagukan for recommending new artists. Mainly my work has focused on re-recommending songs you already know.
I am not sure who is right and who is wrong here. But I have the feeling that it is too easy for big companies to threaten small companies.
I wonder if there should be some institution where small companies and individuals can go to when threatened by big companies. And if the threat seems unreasonable, the institution would take up the fight on behalf of the small company or individual.