Yup, it's not the greatest solution for teams as a central issue tracker.
The main reason I included the mention of small teams is that you can opt out of committing ./ghist, keeping it on your machine only and use it to track your own set of work across multiple agent sessions. It then becomes more of a disposable tool you can use to get big chunks of work done with persistent task memory.
Going with sqlite might not have been the best decision either as it's ultimately a binary file that can't be diff'd. Potentially JSON might have been a better solution for this.
> Going with sqlite might not have been the best decision
Many have tried out this general idea. I myself evaluated git-bug for a few days in 2018 when it was a novel idea, but I ran into issues I tried to raise in my previous comment.
The data format you chose is not even the main issue here.
Binary data that keeps changing is generally always unfit for source control.
In your use case, you can solve that by committing sql dumps of the database in a text format.
But how do they bypass the paywall? They can't just pretend to be Google by changing the user-agent, this wouldn't work all the time, as some websites also check IPs, and others don't even show the full content to Google.
They also cannot hijack data with a residential botnet or buy subscriptions themselves. Otherwise, the saved page would contain information about the logged-in user. It would be hard to remove this information, as the code changes all the time, and it would be easy for the website owner to add an invisible element that identifies the user. I suppose they could have different subscriptions and remove everything that isn't identical between the two, but that wouldn't be foolproof.
On the network layer, I don't know. But on the WWW layer, archive.today operates accounts that are used to log into websites when they are snapshotted. IIRC, the archive.today manipulates the snapshots to hide the fact that someone is logged in, but sometimes fails miserably:
This particular addon is blocked on most western git servers, but can still be installed from Russian git servers. It includes custom paywall-bypassing code for pretty much every news websites you could reasonably imagine, or at least those sites that use conditional paywalls (paywalls for humans, no paywalls for big search engines). It won't work on sites like Substack that use proper authenticated content pages, but these sorts of pages don't get picked up by archive.today either.
My guess would be that archive.today loads such an addon with its headless browser and thus bypasses paywalls that way. Even if publishers find a way to detect headless browsers, crawlers can also be written to operate with traditional web browsers where lots of anti-paywall addons can be installed.
Wow, did not know about the regional blocking of git servers! Makes me wonder what else is kept from the western audience, and for what reason this blocking is happening.
Thanks for sketching out their approach and for the URI.
Most of them don’t check the IP, it would seem. Google acquires new IPs all the time, plus there are a lot of other search systems that news publishers don’t want to accidentally miss out on. It’s mostly just client side JS hiding the content after a time delay or other techniques like that. I think the proportion of the population using these addons is so low, it would cost more in lost SEO for news publishers to restrict crawling to a subset of IPs.
The way I (loosely) understand it, when you archive a page they send your IP in the X-Forwarded-For header. Some paywall operators render that into the page content served up, which then causes it to be visible to anyone who clicks your archived link and Views Source.
But in the article they talk about manipulating users devices to do a DDOS, not scrape websites. And the user going to the archive website is probably not gonna have a subscription, and anyway I'm not sure that simply visiting archive.today will make it able to exfiltrate much information from any other third party website since cookies will not be shared.
I guess if they can control a residential botnet more extensively they would be able to do that, but it would still be very difficult to remove login information from the page, the fact that they manipulated the scraped data for totally unrelated reasons a few times proves nothing in my opinion.
They do remove the login information for their own accoubts (e.g. the one they use for LinkedIn sign-up wall). Their implementation is not perfect, though, which is how the aliases were leaked in the first place.
Last week, I asked Gemini to give me episode names and air dates for a tv show, and it proceeded to fabricate two seasons worth of titles, not a single name or air date was correct. The episode names was even listed on (swedish) wikipedia, so it should been in their training data.
Please don't imagine that I don't fully understand this.
Nevertheless, X11 "server" and "client" have confused very smart and highly technical people. I have had the entertainment of explaining it dozens of times, though rarely recently.
And honestly, still, a server is usually a remote machine in all common usage. When "the server's down", it is usually not a problem on your local machine.
You’d need to commit and push, and everybody else working on the project need to pull in your commits in order to see the changes.
This makes working with feature branches break immediately, and it makes cooperation really hard in all kinds of ways.
It can work, as long as you dont use branches and as long as you are solo, I guess.
The project readme suggests it would be a good fit for small teams. I beg to differ.
reply