More

funnyflamigo · on Dec 8, 2021

> FerretDB is an open-source proxy, converting the MongoDB wire protocol queries to SQL - using PostgreSQL as a database engine.

Based on this description I'd agree that FerretDB isn't a database itself. However the conversion between the MongoDB wire protocol to SQL queries could have bugs, data resiliency could be an issue if you need to guarantee writes, No guarantees of on-going support, etc.

New db's are always welcome but to use a brand new one in production would be very.... bold.

funnyflamigo · on Dec 7, 2021

Am I understanding this correctly that this exploit uses edge OR simply having teams installed (which is default in windows)?

Are there any community patches for this since microsoft has failed to patch what appears to be a 0 day (especially for windows 10)?

tiarafawn · on Dec 7, 2021

You would need Teams installed AND an application that opens the malicious link. IE11 and Edge Legacy do that without prompting the user, other browsers display a confirmation dialog. There is a patch addressing the specific exploit path via MS Teams.

The underlying argument injection in LocalBridge.exe (which is the binary processing the JSON payload) is still present, which can be exploited to open other office apps with injected command line arguments. Someone might find another way to run arbitrary code using command line switches other than --gpu-launcher

Tenoke · on Dec 7, 2021

>having teams installed (which is default in windows)

Teams is not default in Windows (at least my install) - I don't have it and when I have to do meetings in Teams and I am on my Windows machine I just open the meeting in Chrome.

eitland · on Dec 7, 2021

Teams was installed without my approval on my private unmanaged laptop running Windows 10 Professional.

If you don't have Teams yet, you are either in another rollout, you have done something to prevent it or your PC is managed by someone who have prevented it somehow. I think that covers all.

As for why I only use Windows now and then and since I have had a habit of supporting others I keep my personal Windows PCs as plain as possible so I can see what others suffer (obviously I remove nagware like McAfee and make sure spyware like Chrome isn't set as default browser but I have gone as far as to voluntarily run my PC with Norwegian language).

throwaway946513 · on Dec 7, 2021

I'm curious as to the effects of running with a Norwegian language with security. Any chance at enlightenment?

eitland · on Dec 7, 2021

Sorry for the misunderstanding I created. The link between those two are how far I have gone to be able to help end users.

It is a bit tongue in cheek (since I am Norwegian) but only a bit since it is an extra hassle to try to mentally translate what translaters read in English when they created the unsearchable phrases that show up in a localized Windows version.

_kbh_ · on Dec 7, 2021

it needs edge or ie11 and teams. It doesn't appear to be a zero click without the use of edge or ie11 so just avoid both of those and you should be okay.

alalp · on Dec 7, 2021

Easier said than done in many corporate environments unfortunately.

funnyflamigo · on Dec 2, 2021

Can you elaborate on what you mean by not interrupting the scrape and instead flagging those pages?

Let's say you're scraping product info from a large list of products. I'm assuming you mean if it's strange one-off type errors to handle those, and you'd stop altogether if too many fail? Otherwise you'd just be DOS'ing the site.

cushychicken · on Dec 2, 2021

Can you elaborate on what you mean by not interrupting the scrape and instead flagging those pages?

Sure! I can get a little more concrete about this project more easily than I can comment on your hypothetical about a large list of products, though, so forgive me in advance for pivoting on the scenario here.

I'm scraping job pages. Typically, one job posting == one link. I can go through that link for the job posting and extract data from given HTML elements using CSS selectors or XPath statements. However, sometimes the data I'm looking for isn't structured in a way I expect. The major area I notice variations in job ad data is location data. There are a zillion little variations in how you can structure the location of a job ad. City+country, city+state+country, comma separated, space separated, localized states, no states or provinces, all the permutations thereof.

I've written the extractor to expect a certain format of location data for a given job site - let's say "<city>, <country>", for example. If the scraper comes across an entry that happens to be "<city>, <state>, <country>", it's generally not smart enough to generalize its transform logic to deal with that. So, to handle it, I mark that particular job page link as needing human review, so it pops up as an ERROR in my logs, and as an entry in the database that has post_status == 5. After that, it gets inserted into the database, but not posted live onto the site.

That way, I can go in and manually fix the posting, approve it to go on the site (if it's relevant), and, ideally, tweak the scraper logic so that it handles transforms of that style of data formatting as well as the "<city>, <country>" format I originally expected.

Does that make sense?

I suspect I'm just writing logic to deal with malformed/irregular entries that humans make into job sites XD

marginalia_nu · on Dec 2, 2021

I've had a lot of success just saving the data into gzipped tarballs, like a few thousand documents per tarball. That way I can replay the data and tweak the algorithms without causing traffic.

cushychicken · on Dec 2, 2021

Is that still practical even if you're storing the page text?

The reason I don't do that is because I have a few functions that analyze the job descriptions for relevance, but don't store the post text. I mostly did that to save space - I'm just aggregating links to relevant roles, not hosting job posts.

I figured saving ~1000 job descriptions would take up a needlessly large chunk of space, but truth be told I never did the math to check.

Edit: I understand scrapy does something similar to what you're describing; have considered using that as my scraper frontend but haven't gotten around to doing the work for it yet.

marginalia_nu · on Dec 2, 2021

Yeah, sure. The text itself is usually at most a few hundred Kb, and HTML compresses extremely well. Like it's pretty slow to unpack and replay the documents, but it's still a lot faster than downloading them again.

MrMetlHed · on Dec 2, 2021

And it's friendlier to the server you're getting the data from.

As a journalist, I have to scrape government sites now and then for datasets they won't hand over via FOIA requests ("It's on our site, that's the bare minimum to comply with the law so we're not going to give you the actual database we store this information in.") They're notoriously slow and often will block any type of systematic scraping. Better to get whatever you can and save it, then run your parsing and analysis on that instead of hoping you can get it from the website again.

muxator · on Dec 2, 2021

First of all, thanks for marginalia.nu.

Have you considered stored compressed blobs in a sqlite file? Works fine for me, you can do indexed searches on your "stored" data, and can extract single pages if you want.

marginalia_nu · on Dec 2, 2021

The main reason I'm doing it this way is because I'm saving this stuff to a mechanical drive, and I want consistent write performance and low memory overhead. Since it's essentially just an archive copy, I don't mind if it takes half an hour to chew through looking for some particular set of files. Since this is a format deigned for tape drives, it causes very little random access. It's important that it's relatively consistent to write since my crawler does while it's crawling, and it can reach speeds of 50-100 documents per second, which would be extremely rough on any sort of database based on a single mechanical hard drive.

These archives are just an intermediate stage that's used if I need to reconstruct the index to tweak say keyword extraction or something, so random access performance isn't something that is particularly useful.

gardnr · on Dec 3, 2021

Have you thought about pushing the links onto a queue and running multiple scrapers off that queue? You'd need to build in some politeness mechanism to make sure you're not hitting the same domain/ip address too often but it seems like a better option than a serial process.

yakshaving_jgt · on Dec 2, 2021

Why 5, exactly? This struck me as odd in the article. Perhaps I missed something. Are there other statuses? Why are statuses numeric?

cushychicken · on Dec 3, 2021

It's arbitrary.

I have a field, post_status, in my backend database, that I use to categorize posts. Each category is a numeric code so SQL can filter it relatively quickly. I have statuses for active posts, dead posts, ignored links, links needing review, etc.

It's a way for me to sort through my scraper's results quickly.

yakshaving_jgt · on Dec 3, 2021

I think you have a case of premature optimisation there, as I wrote in a recent comment[0].

[0]: https://news.ycombinator.com/item?id=29430281

cushychicken · on Dec 3, 2021

Not sure what's premature here. The optimization is to allow me, a human, to find a certain class of database records quickly. I also chose a method that I understand to be snappy on the SQL side as well.

What would you suggest as a non-optimized alternative? That might make your point about premature optimization clearer.

yakshaving_jgt · on Dec 3, 2021

There is indeed a trade-off, and the direction I would have chosen is to use meaningful status names as opposed to magic numbers. My reasoning being that maintenance cost in terms of how self-explanatory the system is makes more sense to me economically than obscuring the meaning behind some of the code/data for a practically non-existent performance benefit.

After all, hardware is cheap, but developer time isn't.

For a more concrete example, I might have chosen the value `'pending'` (or similar) instead of `5`. Active listings might have status `'active'`. Expired ones might have status `'expired'`, etc.

marginalia_nu · on Dec 3, 2021

Integer columns are significantly faster and smaller than strings in a SQL database. It adds up quickly if you have a sufficiently large database.

I use the following scheme:

   1 - exhausted
   0 - alive
  -1 - blocked (by my rules)
  -2 - redirected
  -3 - error

yakshaving_jgt · on Dec 3, 2021

The author is scraping fewer than 1,000 records per day, or roughly 365,000 records per year.

On my own little SaaS project, the difference between querying an integer and a varchar like “active” is imperceptible, and that’s in a table with 7,000,000 rows.

It would take the author 19 years to run into the scale that I’m running at, where this optimisation is meaningless. And that’s assuming they don’t periodically clean their database of stale data, which they should.

So this looks like a premature optimisation to me, which is why it stood out as odd to me in the article.

marginalia_nu · on Dec 3, 2021

I'd put it closer to the category of best practices than premature optimization. It's pretty much always a good idea. It's not that not doing this will break things, the alternative is slower and uses more resources in a way that affects all queries since larger datatypes elongate the records, and locality is tremendously important all aspects of software performance.

yakshaving_jgt · on Dec 3, 2021

I disagree. I think a better "best practice" is to make the meaning behind the code as clear as possible. In this case, the code/data is less clear, and there is zero performance benefit.

marginalia_nu · on Dec 3, 2021

There is absolutely a performance benefit to reducing your row sizes. It both reduces the amount of disk I/O and the amount of CPU cache misses and in many cases also increases the amount of data that can be kept in RAM.

You can map meaning onto the column in your code, as most languages have enums that are free in terms of performance. It does not make sense to burden the storage layer with this, as it lacks this feature.

cushychicken · on Dec 4, 2021

You can map meaning onto the column in your code, as most languages have enums that are free in terms of performance. It does not make sense to burden the storage layer with this, as it lacks this feature.

Was just looking at how to do this with an enum today! Read my mind. :)

yakshaving_jgt · on Dec 3, 2021

The performance benefit is negligible at the scale the author of the article is operating at. You already alluded to this point being context-dependent earlier when you said:

> if you have a sufficiently large database

Roughly 360,000 rows per year is not sufficiently large. It's tiny.

derac · on Dec 3, 2021

It's arbitrary.

funnyflamigo · on Dec 1, 2021

It's in a container with xorg running. It doesn't need kernel level access but it does need userland access which it does have access to in the container.

jcun4128 · on Dec 1, 2021

> UserLAnd

Hmm that's cool I was not aware of this thing, not surprising but still cool to find out now.

numpad0 · on Dec 2, 2021

“userland” is a UNIX term that roughly means “apps” but in ancient sense. UserLAnd is just an app for Android using that word stylized as its name.

jcun4128 · on Dec 2, 2021

oh okay so they're not the same gotcha

funnyflamigo · on Nov 29, 2021

This is absolutely not the case.

You can find some of the some of the patches here but there's more patches in the parent dir - https://github.com/Eloston/ungoogled-chromium/tree/master/pa...

There's a lot of telemetry, and a few other services such as time checking.

The features you're mentioning include the user syncing, translation, and DRM platforms (though you can add widevine to chromium, you cant add the others). Those are not the only things that call home to google.

funnyflamigo · on Nov 24, 2021

I doubt it.

Keep in mind this will only work for non-court-gag-ordered instances. If the US subpoenas Apple about an individual they won't be allowed to notify them.

I have no idea how this applies to other countries.

I think this is more like: "We noticed unusual API usage and we don't have a gag order so whatever it is, it's not likely to be good"

quitit · on Nov 25, 2021

The methods of detecting such attacks are not at all similar to a government requesting data which contains the non disclosure clause.

Apple doesn’t need to know the source of the attack to issue the warning, and if the attacker is competent Apple likely wouldn’t know the source, such that a gag would not apply.

simondotau · on Nov 24, 2021

To be fair, a subpoena isn't a cyberattack. But yes, this will be mostly of value of people being targeted by governments that are not the USA or best buddies with the USA.

WarOnPrivacy · on Nov 25, 2021

tl;dr: Apple will notify us as long as the attacking state isn't the US - which it very often is.

funnyflamigo · on Nov 24, 2021

Disclaimer: I think everyone in this thread agrees we should try to improve the lives of all animals

> why not just stick with feeding the locals? At least those are free to decide to visit you or not.

This is almost never a good idea (for any animal). Assuming you're feeding them the right type of food (bread is terrible for ducks for example), you're training the animals to rely on humans for food. This both makes it more difficult for them to survive on their own should you stop feeding them, but for some of the potentially aggressive ones (like geese) will become comfortable approaching humans, even the humans who do not want to be near them.

I've read your stance on animal rescue places and I generally agree - most aren't good or are glorified zoos. But I do think there's genuine ones that are helpful, and I think between those and rescues where you can give the animal direct attention are the best ways to humanely assist animals who would otherwise die in the wild.

My problem is with the traders and terrible owners who only use them as a show piece. A good owner should be providing ample enrichment and attention

funnyflamigo · on Nov 23, 2021

outline.com is good too

https://outline.com/U6wfgV

Protip: They also bypass certain paywalls

funnyflamigo · on Nov 23, 2021

It's heavily domain dependent IMO.

Leetcode only gets you pass the first round of interviews and not every company does them. I consider these useless textbook problems but you'll need them to get the job. They won't help you much in the job.

More importantly (imo) he needs domain specific knowledge. I.e. if it's going to be web development then he needs to start a web project with the backend in python. If it's stats/analysis he needs to start on analysis projects. Etc.

My advice is on making small projects that cover topics he'll be working on in the jobs he applies for to build a small portfolio. And brush up on the leetcode like a month before.

funnyflamigo · on Nov 19, 2021

If you have control of it you _might_ actually be the one that nabbed it, and somebody else complained they couldn't manage it.

If this is the case, under no circumstance tell your registrar you're okay with losing it and dig your heels HARD.

Either way you should get a cert for it with the longest expiration you can ASAP while you control it ;)

robalfonso · on Nov 19, 2021

I understand your sentiment here, but should the registry decide the domain is not theirs it will simply not be manageable from the registrar/buyer. All they do is update the management id to the new registrar and the registrant contact of the domain, no intervention is possible.

I can tell you I’ve seen legal orders transferring a domain from one registrant to another, if the order has the registry named to take action, the domain kind of just goes poof, sure it’s in your system but you can’t do anything against it.

CogitoCogito · on Nov 19, 2021

> I can tell you I’ve seen legal orders transferring a domain from one registrant to another, if the order has the registry named to take action, the domain kind of just goes poof, sure it’s in your system but you can’t do anything against it.

Then maybe the original poster should demand such a legal order? I doubt there exists one now given it's only been a couple days.

robalfonso · on Nov 19, 2021

In this context it was court orders, usually over an ownership dispute between corporations. The legal fees etc, make it unlikely to be worth the hassle.

CogitoCogito · on Nov 19, 2021

Yeah I can totally see that. I guess in this case the OP could try to stand strong and refuse to go along without any sort of court order, but even in that case it's not like the OP can stop the registry from going through with the change. So really if the registry decides to do it anyway, then I guess the OP could try to pursue legal options. I guess maybe the domain would be worth it.