Deep code search

hamelsmu · on June 26, 2018

Hello there! Glad to see other people working on this area as well. Ho-Hsiang and I from GitHub have been prototyping this exact same approach and have published / open sourced our work about a month ago: https://towardsdatascience.com/semantic-code-search-3cd6d244...

hamelsmu · on June 26, 2018

Oh and https://towardsdatascience.com/semantic-code-search-3cd6d244... is completely open source end-to-end, with code, data, and detailed explanations on how to reproduce step-by-step.

dpkp · on June 26, 2018

Congrats + well done!

ryanackley · on June 26, 2018

I think the reverse would be incredibly more useful. Here is a giant block of code. Break it down into snippets and explain what they are doing.

I rarely have trouble finding examples of code using Google. When I do, I'm most likely using obscure languages or it's a very niche programming scenario.

iamrohitbanga · on June 26, 2018

Great! Brings back some old memories. Straight out of college, I interviewed with some big company for a product management role. At the time the interviewer asked me to pick a website I like and describe what new features I would like to see in it. I picked github.com and said that github has a such a big corpus of code and people have so many questions about how to do this or that in a programming language. I would like to see a way to search that corpus and see related code examples. Also may be find common bug patterns and suggest fixes (I didn't know about FindBugs at the time). Sadly I didn't have a good way to implement it or a solid design idea at the time.

adrianmonk · on June 26, 2018

I wonder if a similar approach could be used to help discover common programming errors and their solutions, i.e. data mine bug hindsight.

It would work something like:

1. Go through a repo's history and look at commits.

2. Using text classification of commit messages and/or cross-referencing with a bug tracker database (and limiting to issues that are really defects, not feature requests), identify commits that fix bugs.

3. Now you have before and after code. Try to discover the salient part of what changed, perhaps by parsing them both and comparing ASTs, or by diffing the text. Or run the unit test (that you added with the fix) through the old code and the new code while tracing execution to see differences in the dynamic behavior of the code.

4. Correlate this with descriptions of what the method is trying to do.

Perhaps you might be able to generate statements like "when fixing bugs related to create a daemon, programmers often add calls to close() or add calls to umask()".

nl · on June 26, 2018

I've been playing around in this area for a while and I think there is a lot of potential - although I'm not sure code search is the ultimate expression of how this should be used.

There's another interesting paper called code2vec[1]. Code2vec uses the AST of a code snippet and builds an embedding of those rather than dealing with is as just a sequence of tokens which some earlier attempts do.

This paper does the same, which is nice.

Code2vec (at least the demo at [2] which I think is the same thing) is extremely sensitive to variable names. I'm not sure if that is a bug or if they are including that in their embedding.

If people have specific things they'd like to do if they had a tool which had deep understanding of the intent of code I'd be pretty interested to hear about it. Contact details in my profile, or reply here.

[1] https://arxiv.org/abs/1803.09473

[2] https://code2vec.com/

fullstackchris · on June 26, 2018

Another cool paper along a similar thread is DLPaper2Code: Auto-generation of Code from Deep Learning Research Papers (https://arxiv.org/abs/1711.03543).

I see opportunities with tools like this in the form of apps and/or SaaS, IF a team is willing to put thousands of hours into a top quality way to make the machine learning super user friendly. The only comparable product of such a nature I've seen is deepl.com/translator (translator better than google translate which leverages deep learning - don't trust me? go try it)

Normal_gaussian · on June 26, 2018

Anyone know if there is a chance we will actually get our hands on the core of this so we can implement it on existing codebases?

It would be an invaluable tool to search through specific codebases for the place they do X as well as for answering questions on how to do X.

cetra3 · on June 26, 2018

Code is here I believe: https://github.com/guxd/deep-code-search

daveyand · on June 26, 2018

Yeah it is, i've cloned it and trying to run it but i'm getting the following error: python codesearcher.py --mode train Traceback (most recent call last): File "codesearcher.py", line 10, in <module> from datashape.coretypes import real ImportError: No module named datashape.coretypes

clearly i'm doing something wrong and have no idea on how python works :P

PurplePanda · on June 26, 2018

something like `pip install --user DataShape`, maybe.

airstrike · on June 26, 2018

It looks like the repo actually provides two different versions, Keras and PyTorch. Whichever one you choose, make sure you install the dependencies listed in their README.

supermdguy · on June 26, 2018

There's an online demo tool here, if you want to just try it out: http://211.249.63.55:81/

daveyand · on June 26, 2018

has anyone been able to implement this yet? I'd love to test it on my codebase (which is 15 years old and a monolith)

keeganpoppen · on June 26, 2018

ok wow that actually seems to work pretty damn credibly[1], at least given the inputs that i could come up with. i did find it difficult to come up with queries that (1) clearly would rely more on the code embeddings than the description embeddings (which i'd assume are at least pretty decent on their own), while (2) actually making sense / having a sensical answer.

the best i could come up with is describing what you want ~operationally, rather than with domain-specific (read: predictive) jargon. but while also not underspecifying it to the point where the query is nonsense[2].

but when i try the query "sort the operands in decreasing order", not only does it get a bunch of sorting functions[3], but i'll be damned if the top 2 results weren't: (1) a `swapOperands` function that takes two Comparator<T>s and returns a new one that invokes the comparison with the operands reversed and (2) a function that sorts a dequeue into decreasing order.

obviously that's not the perfect query for telling the relative contributions apart because "operands" is kinda "jargon"-y by my earlier standard, but the results did (correctly) address the only reason i'd ever search for something like that: to remember if the default comparison operator is "A - B" or "B - A". if you change "operands" to "elements", the results do get a little "worse", but i'd argue that the query (1) is actually quite a bit more vague in that formulation and (2) is less likely to represent a code snippet that actually exists, if only because most comparison operators on individual "containee" types (~"element[ type]s") are parameterized by the sort direction [citation needed].

tl;dr - lmk if anyone thinks of particularly good ways to "fool" this and/or demarcate what kinds of things are and are not well-represented by the code embeddings; to my eye they definitely seem to be doing a decent job of... embedding the code... as code embeddings are wont to do.

[1] not to "damn with faint praise"-- if you had asked me whether something like this paper would work well enough to be useful i... would have guessed "no" :)

[2] e.g. queries like "do some machine learning on the user images" and "... on the training data" don't get very "good" results in terms of topic proximity, but there really isn't a sensible response anyway, so if anything that's a good sign-- garbage in, garbage out

[3] which obviously to be expected, as P(<can't find sorting functions> | <paper published>) is (hopefully) pretty low...