Yes definitely, adding database schemas and queries is on our roadmap for capturing "indirect links" across codebases. We are thinking of something similar to Supabase studio's schema visualizer [1], but integrated with the map of the codebase so that which parts of the code read/writes to different tables. Is that what you have in mind? Or do you mean something else by database access patterns?
This is great work! Mechanistic interpretability has tons of use cases, it's great to see open research in that field.
You mentioned you spent your own time and money on it, would you be willing to share how much you spent? It would help others who might be considering independent research.
Thank you, I too am a big believer and enjoyer of open research. The actual code has clarity that complex research papers were never able to convey to me as well as the actual code could.
Regarding the cost I would probably sum it up to round about ~2.5k USD for just the actual execution cost. Development cost would've probably doubled that sum if I wouldn't already have a GPU workstation for experiments at home that I take for granted. That cost is made up of:
* ~400 USD for ~2 months of storage and traffic of 7.4 TB (3.2 TB of raw, 3.2 TB of preprocessed training data) on a GCP standard bucket
* ~100 USD for Anthropic claude requests for experimenting with the right system prompt and test runs and the actual final execution
* The other ~2k USD were used to rent 8x Nvidia RTX4090's together with a 5TB SSD from runpod.io for various stages of the experiments. For the actual SAE training I rented the node for 8 days straight and I would probably allocate an additional ~3-4 days of runtime just to experiments to determine the best Hyperparameters for training.
> We aren't sure, but the speculation is that in the process of training, GPT-3 found that the best strategy to correctly predicting the continuation of arithmetic expressions was to figure out the rules of basic arithmetic and encode them in some portion of its neural network, then apply them whenever the prompt suggested to do so.
I strongly disagree. GPT-3 has 100% accuracy on 2-digit addition, 80% on 3-digit addition, 25% on 4-digit addition and 9% on 5-digit addition. If it could indeed "understand arithmetic" the increase in number of digits should not affect its accuracy.
My perspective as an ML practitioner is that the cool part of GPT-3 is storing information effectively and it is able to decode queries easier than before to get the information that is required. Yet with things like arithmetic, the most efficient way would be to understand the rules of addition but the internal structure is too rigid to encode those rules atm.
The rules seem pretty clear that consent is required from any persons appearing in any external datasets that are required. The winners scraped data from Youtube videos so I am not not sure the issue is.
The more worrying takeaway is that the winners scraped videos from people who clearly had no intention of their videos being used for a deepfake detection algorithm. Yet they did not think of the ethical considerations of using that data (did everyone in the video even have a say in the video being uploaded?). I think Kaggle disqualifying the team is the right move (even if it's a painful one for the winners).
The article states the videos used a Creative Commons license that allowed for commercial use. It is an extremely liberal license that does not state "free for commercial use except for when used with facial recognition."
For people in a video you need a model release from them. This is also a mistake many people make, they use Creative Commons licenses and think they are safe. A picture or a video needs model releases for the people in the picture (several exemptions apply).
If that is true then basically all the photographs in Wikipedia are illegal since the only check they do is for copyright not for model release. Pretty sure that's not a legal requirement.
Of which photos are you thinking? It most certainly varies from country to country but public figures or random people captured when taking a picture of a landscape or a building are at least in some countries not subject to such rules.
But that Creative Commons licence was issued from the copyright holder of those videos, not the people in them. The people in those videos may not even have agreed to appear in the video if they were in a public place (the relevant legal term, at least here in the UK, is "reasonable expectation of privacy"). So if Kaggle requires people in the videos to consent taking part then that consent cannot be inferred from that licence.
What's more, if that consent is not legally required (there's a heavy "if" in this sentence, IANAL so I do pretend to know whether it's required e.g. under GDPR, but let's assume for a moment that it's not) then Kaggle are still perfectly at rights to ask for that permission to qualify for their competition. After all, it's their competition, and it's totally reasonable for them to set an ethical criteria that's even higher than legally required.
You're right, I missed that part of their rules. Looks like they did probably break them.
"A. If any part of the submission documentation depicts, identifies, or includes any person that is not an individual participant or Team member, you must have all permissions and rights from the individual depicted, identified, or included and you agree to provide Competition Sponsor and PAI with written confirmation of those permissions and rights upon request."
Yeah, with $1 million at stake, I can't believe this team of really smart people made such an incredible blunder.
The whole reason Facebook launched this challenge was to try and bury the bad PR over their data practices. If people in the external datasets had complained about the unauthorized use of their faces in the winning solution, it would've been pretty embarrassing for FB.
Note that isn't part of the rules. It's part of the "Winning submission documentation requirements" which is a separate document and wasn't mentioned at all on the "external data" Kaggle thread, which had Kaggle moderators explaining the rules.
Documentation requirements are pretty standard in Kaggle competitions, and usually cover having to supply your code, and maybe write a blog post about it. I've never seen one that had major rules in it.
I'm with you here. There are ethical concerns, legal concerns for productization, and overall this defeats the purpose of creating novel algorithms rather than a better trained model.
For instance, with the same scraping being used to train the deepfake GAN, would their model be more or less effective than a competitor model?
It seems like they won from a disparity in data not an innovative technical approach.
I prefer the emacs bindings for the command line such as C-A, C-U (mostly due to muscle memory), but have set up my $EDITOR as vim. This allows me to do C-X C-E, which opens the current command in vim to be edited.
If you are using zsh, you need to add this to ur .zshrc
How is it false? It literally says all traditional links are mostly from Bing: "We also of course have more traditional links in the search results, which we also source from multiple partners, though most commonly from Bing (and none from Google)."
All the other 400 sources and their own crawler are just used for Instant Answers and widgets.
Anyone has thoughts about cassandra and how it compares to mongodb? There seems to be a big enterprise push with azure's cosmosdb, but I've not heard much about it from people who have actually used it.
Unless you can hire Cassandra engineers who have enough experience maintaining clusters, or you can afford to pay DSE to do that for you, working with Cassandra will be quite the burden.
The advancement of PostgrSQL has been amazing, and much credit is due indeed to the NoSQL community’s innovations. I cannot comment of MongoDB except to say that it is quite different than Cassandra.
To summarize, be sure you really need Cassandra and are able to dedicate the appropriate resources to it before taking the plunge.
my solution is to use pandoc to generate the diffs. Combines the benefits of word formatting but allows me to see the changes in git. (I use it mainly for my resume)
Have you seen any options take advantage of the fact that docx files are just zipped xml files? I can see the git repo ballooning if you have a few images and you commit frequently!
created an account after lurking here for 3(!) years now. happy for this to be my first post, as the HN community has definitely been an inspiration towards starting my own website/blog (and I have in turn convinced a few others)
my website is built with gatsbyJS and hosted with netlify. i plan to blog regularly about NLP/deep learning topics and eventually add projects + photog but thats for another day.
1: https://supabase.com/blog/supabase-studio-3-0#schema-visuali...