Hello there! Glad to see other people working on this area as well. Ho-Hsiang and I from GitHub have been prototyping this exact same approach and have published / open sourced our work about a month ago: https://towardsdatascience.com/semantic-code-search-3cd6d244...
I think the reverse would be incredibly more useful. Here is a giant block of code. Break it down into snippets and explain what they are doing.
I rarely have trouble finding examples of code using Google. When I do, I'm most likely using obscure languages or it's a very niche programming scenario.
Great! Brings back some old memories.
Straight out of college, I interviewed with some big company for a product management role. At the time the interviewer asked me to pick a website I like and describe what new features I would like to see in it. I picked github.com and said that github has a such a big corpus of code and people have so many questions about how to do this or that in a programming language. I would like to see a way to search that corpus and see related code examples. Also may be find common bug patterns and suggest fixes (I didn't know about FindBugs at the time). Sadly I didn't have a good way to implement it or a solid design idea at the time.
I wonder if a similar approach could be used to help discover common programming errors and their solutions, i.e. data mine bug hindsight.
It would work something like:
1. Go through a repo's history and look at commits.
2. Using text classification of commit messages and/or cross-referencing with a bug tracker database (and limiting to issues that are really defects, not feature requests), identify commits that fix bugs.
3. Now you have before and after code. Try to discover the salient part of what changed, perhaps by parsing them both and comparing ASTs, or by diffing the text. Or run the unit test (that you added with the fix) through the old code and the new code while tracing execution to see differences in the dynamic behavior of the code.
4. Correlate this with descriptions of what the method is trying to do.
Perhaps you might be able to generate statements like "when fixing bugs related to create a daemon, programmers often add calls to close() or add calls to umask()".
I've been playing around in this area for a while and I think there is a lot of potential - although I'm not sure code search is the ultimate expression of how this should be used.
There's another interesting paper called code2vec[1]. Code2vec uses the AST of a code snippet and builds an embedding of those rather than dealing with is as just a sequence of tokens which some earlier attempts do.
This paper does the same, which is nice.
Code2vec (at least the demo at [2] which I think is the same thing) is extremely sensitive to variable names. I'm not sure if that is a bug or if they are including that in their embedding.
If people have specific things they'd like to do if they had a tool which had deep understanding of the intent of code I'd be pretty interested to hear about it. Contact details in my profile, or reply here.
Another cool paper along a similar thread is DLPaper2Code: Auto-generation of Code from Deep Learning Research Papers (https://arxiv.org/abs/1711.03543).
I see opportunities with tools like this in the form of apps and/or SaaS, IF a team is willing to put thousands of hours into a top quality way to make the machine learning super user friendly. The only comparable product of such a nature I've seen is deepl.com/translator (translator better than google translate which leverages deep learning - don't trust me? go try it)
Yeah it is, i've cloned it and trying to run it but i'm getting the following error: python codesearcher.py --mode train
Traceback (most recent call last):
File "codesearcher.py", line 10, in <module>
from datashape.coretypes import real
ImportError: No module named datashape.coretypes
clearly i'm doing something wrong and have no idea on how python works :P
It looks like the repo actually provides two different versions, Keras and PyTorch. Whichever one you choose, make sure you install the dependencies listed in their README.
ok wow that actually seems to work pretty damn credibly[1], at least given the inputs that i could come up with. i did find it difficult to come up with queries that (1) clearly would rely more on the code embeddings than the description embeddings (which i'd assume are at least pretty decent on their own), while (2) actually making sense / having a sensical answer.
the best i could come up with is describing what you want ~operationally, rather than with domain-specific (read: predictive) jargon. but while also not underspecifying it to the point where the query is nonsense[2].
but when i try the query "sort the operands in decreasing order", not only does it get a bunch of sorting functions[3], but i'll be damned if the top 2 results weren't: (1) a `swapOperands` function that takes two Comparator<T>s and returns a new one that invokes the comparison with the operands reversed and (2) a function that sorts a dequeue into decreasing order.
obviously that's not the perfect query for telling the relative contributions apart because "operands" is kinda "jargon"-y by my earlier standard, but the results did (correctly) address the only reason i'd ever search for something like that: to remember if the default comparison operator is "A - B" or "B - A". if you change "operands" to "elements", the results do get a little "worse", but i'd argue that the query (1) is actually quite a bit more vague in that formulation and (2) is less likely to represent a code snippet that actually exists, if only because most comparison operators on individual "containee" types (~"element[ type]s") are parameterized by the sort direction [citation needed].
tl;dr - lmk if anyone thinks of particularly good ways to "fool" this and/or demarcate what kinds of things are and are not well-represented by the code embeddings; to my eye they definitely seem to be doing a decent job of... embedding the code... as code embeddings are wont to do.
[1] not to "damn with faint praise"-- if you had asked me whether something like this paper would work well enough to be useful i... would have guessed "no" :)
[2] e.g. queries like "do some machine learning on the user images" and "... on the training data" don't get very "good" results in terms of topic proximity, but there really isn't a sensible response anyway, so if anything that's a good sign-- garbage in, garbage out
[3] which obviously to be expected, as P(<can't find sorting functions> | <paper published>) is (hopefully) pretty low...