vivekkalyan's comments

vivekkalyan · on Dec 8, 2024

Yes definitely, adding database schemas and queries is on our roadmap for capturing "indirect links" across codebases. We are thinking of something similar to Supabase studio's schema visualizer [1], but integrated with the map of the codebase so that which parts of the code read/writes to different tables. Is that what you have in mind? Or do you mean something else by database access patterns?

1: https://supabase.com/blog/supabase-studio-3-0#schema-visuali...

vivekkalyan · on Nov 22, 2024

This is great work! Mechanistic interpretability has tons of use cases, it's great to see open research in that field.

You mentioned you spent your own time and money on it, would you be willing to share how much you spent? It would help others who might be considering independent research.

PaulPauls · on Nov 22, 2024

Thank you, I too am a big believer and enjoyer of open research. The actual code has clarity that complex research papers were never able to convey to me as well as the actual code could.

Regarding the cost I would probably sum it up to round about ~2.5k USD for just the actual execution cost. Development cost would've probably doubled that sum if I wouldn't already have a GPU workstation for experiments at home that I take for granted. That cost is made up of:

* ~400 USD for ~2 months of storage and traffic of 7.4 TB (3.2 TB of raw, 3.2 TB of preprocessed training data) on a GCP standard bucket

* ~100 USD for Anthropic claude requests for experimenting with the right system prompt and test runs and the actual final execution

* The other ~2k USD were used to rent 8x Nvidia RTX4090's together with a 5TB SSD from runpod.io for various stages of the experiments. For the actual SAE training I rented the node for 8 days straight and I would probably allocate an additional ~3-4 days of runtime just to experiments to determine the best Hyperparameters for training.

vivekkalyan · on Aug 28, 2020

My solution is inspired by this blog post but creates a global attributes file. I documented it here:

https://www.vivekkalyan.com/using-git-for-word

I tend to prefer markdown for most things, but find it hard to beat Word in terms of simplicity of elegant designs for, say, resumes.

vivekkalyan · on July 28, 2020

> We aren't sure, but the speculation is that in the process of training, GPT-3 found that the best strategy to correctly predicting the continuation of arithmetic expressions was to figure out the rules of basic arithmetic and encode them in some portion of its neural network, then apply them whenever the prompt suggested to do so.

I strongly disagree. GPT-3 has 100% accuracy on 2-digit addition, 80% on 3-digit addition, 25% on 4-digit addition and 9% on 5-digit addition. If it could indeed "understand arithmetic" the increase in number of digits should not affect its accuracy.

My perspective as an ML practitioner is that the cool part of GPT-3 is storing information effectively and it is able to decode queries easier than before to get the information that is required. Yet with things like arithmetic, the most efficient way would be to understand the rules of addition but the internal structure is too rigid to encode those rules atm.

vivekkalyan · on June 15, 2020

The rules seem pretty clear that consent is required from any persons appearing in any external datasets that are required. The winners scraped data from Youtube videos so I am not not sure the issue is.

The more worrying takeaway is that the winners scraped videos from people who clearly had no intention of their videos being used for a deepfake detection algorithm. Yet they did not think of the ethical considerations of using that data (did everyone in the video even have a say in the video being uploaded?). I think Kaggle disqualifying the team is the right move (even if it's a painful one for the winners).

TheAdamAndChe · on June 15, 2020

The article states the videos used a Creative Commons license that allowed for commercial use. It is an extremely liberal license that does not state "free for commercial use except for when used with facial recognition."

KingOfCoders · on June 15, 2020

For people in a video you need a model release from them. This is also a mistake many people make, they use Creative Commons licenses and think they are safe. A picture or a video needs model releases for the people in the picture (several exemptions apply).

gman83 · on June 15, 2020

If that is true then basically all the photographs in Wikipedia are illegal since the only check they do is for copyright not for model release. Pretty sure that's not a legal requirement.

KingOfCoders · on June 15, 2020

https://en.wikipedia.org/wiki/Model_release

and

https://commons.wikimedia.org/wiki/Commons:Country_specific_...

See the other comment for some of the exceptions.

danbruc · on June 15, 2020

Of which photos are you thinking? It most certainly varies from country to country but public figures or random people captured when taking a picture of a landscape or a building are at least in some countries not subject to such rules.

quietbritishjim · on June 15, 2020

But that Creative Commons licence was issued from the copyright holder of those videos, not the people in them. The people in those videos may not even have agreed to appear in the video if they were in a public place (the relevant legal term, at least here in the UK, is "reasonable expectation of privacy"). So if Kaggle requires people in the videos to consent taking part then that consent cannot be inferred from that licence.

What's more, if that consent is not legally required (there's a heavy "if" in this sentence, IANAL so I do pretend to know whether it's required e.g. under GDPR, but let's assume for a moment that it's not) then Kaggle are still perfectly at rights to ask for that permission to qualify for their competition. After all, it's their competition, and it's totally reasonable for them to set an ethical criteria that's even higher than legally required.

TheAdamAndChe · on June 15, 2020

You're right, I missed that part of their rules. Looks like they did probably break them.

"A. If any part of the submission documentation depicts, identifies, or includes any person that is not an individual participant or Team member, you must have all permissions and rights from the individual depicted, identified, or included and you agree to provide Competition Sponsor and PAI with written confirmation of those permissions and rights upon request."

reedwolf · on June 15, 2020

Yeah, with $1 million at stake, I can't believe this team of really smart people made such an incredible blunder.

The whole reason Facebook launched this challenge was to try and bury the bad PR over their data practices. If people in the external datasets had complained about the unauthorized use of their faces in the winning solution, it would've been pretty embarrassing for FB.

nl · on June 15, 2020

Note that isn't part of the rules. It's part of the "Winning submission documentation requirements" which is a separate document and wasn't mentioned at all on the "external data" Kaggle thread, which had Kaggle moderators explaining the rules.

Documentation requirements are pretty standard in Kaggle competitions, and usually cover having to supply your code, and maybe write a blog post about it. I've never seen one that had major rules in it.

nostrebored · on June 15, 2020

I'm with you here. There are ethical concerns, legal concerns for productization, and overall this defeats the purpose of creating novel algorithms rather than a better trained model.

For instance, with the same scraping being used to train the deepfake GAN, would their model be more or less effective than a competitor model?

It seems like they won from a disparity in data not an innovative technical approach.

oars · on June 15, 2020

It's much better they learn now by being banned from a competition rather than having a lawsuit filed against them in the future.

The correct decision was made.

sjg007 · on June 15, 2020

What if you took commercial video like a news broadcast vs youtube? Would that still be off limits?

vivekkalyan · on June 10, 2020

I prefer the emacs bindings for the command line such as C-A, C-U (mostly due to muscle memory), but have set up my $EDITOR as vim. This allows me to do C-X C-E, which opens the current command in vim to be edited.

If you are using zsh, you need to add this to ur .zshrc

  autoload -z edit-command-line
  zle -N edit-command-line
  bindkey "^X^E" edit-command-line

vivekkalyan · on May 28, 2020

Can we please stop this meme? Every thread about DDG invariably seems to bring up this (false) statement.

Kiro · on May 28, 2020

How is it false? It literally says all traditional links are mostly from Bing: "We also of course have more traditional links in the search results, which we also source from multiple partners, though most commonly from Bing (and none from Google)."

All the other 400 sources and their own crawler are just used for Instant Answers and widgets.

mikkom · on May 29, 2020

It's the truth. I'm sure representatives of DDG can correct us if this is not the case.

However they have never done so.

vivekkalyan · on May 22, 2020

Anyone has thoughts about cassandra and how it compares to mongodb? There seems to be a big enterprise push with azure's cosmosdb, but I've not heard much about it from people who have actually used it.

rajman187 · on May 22, 2020

Unless you can hire Cassandra engineers who have enough experience maintaining clusters, or you can afford to pay DSE to do that for you, working with Cassandra will be quite the burden.

The advancement of PostgrSQL has been amazing, and much credit is due indeed to the NoSQL community’s innovations. I cannot comment of MongoDB except to say that it is quite different than Cassandra.

To summarize, be sure you really need Cassandra and are able to dedicate the appropriate resources to it before taking the plunge.

vivekkalyan · on Feb 3, 2020

my solution is to use pandoc to generate the diffs. Combines the benefits of word formatting but allows me to see the changes in git. (I use it mainly for my resume)

I wrote up about it here for the curious: https://www.vivekkalyan.com/using-git-for-word

ramraj07 · on Feb 3, 2020

Have you seen any options take advantage of the fact that docx files are just zipped xml files? I can see the git repo ballooning if you have a few images and you commit frequently!

cpach · on Feb 3, 2020

Diffing the contents of an Office document is not trivial. See https://news.ycombinator.com/item?id=22222667

vivekkalyan · on Feb 19, 2019

created an account after lurking here for 3(!) years now. happy for this to be my first post, as the HN community has definitely been an inspiration towards starting my own website/blog (and I have in turn convinced a few others)

https://www.vivekkalyan.com/

my website is built with gatsbyJS and hosted with netlify. i plan to blog regularly about NLP/deep learning topics and eventually add projects + photog but thats for another day.