We've been working on another code search engine for open-source: https://sourcegraph.com/. Our approach is a little different -- we parse and index things at a programming language level. This has the benefit of being able to list usage examples of a function or jump to a definition (i.e., smart IDE-like behavior). Obviously, one drawback is having to implement language-specific support; right now Sourcegraph just supports Python, Javascript (node.js) and Go.
Would love to hear people's feedback on either approach!
I have been watching sourcegraph.com for a while. As you say you do much deeper and expensive parsing of the code, which is not really feasable at the moment for searchcode since it has 90 languages. Its also a constraint as I don't have the hardware to support this (searchcode is very lean).
My goal was to provide codesearch for sources other than Github as there is so much code out there to find. Github does an excellent job indexing and I am certain that Bitbucket will be following with their own implementation soon.
I too would love to know what people prefer, a deeper analysis of the code with IDE support or more bredth. I can see both being useful.
I'm curious, are there any plans for having offline-type functionality? I am especially curious because not everyone stores their code on public/cloud repositories. Especially source code that is for private contracts and government agencies which have strict rules regarding the source code.
Yes. We've had a number of large and small companies reach out to us (Sourcegraph) about this. It's definitely something we plan to support, after we make an awesome public product for open source code. Email me if you want to learn more (sqs at sourcegraph dot com).
Yes, thats my fault. Whats happening under the hood is searchcode is trying to match "MongoDBObject find" exactly, where as Github is going for a looser but phrase heavy search.
Its something I am going to change based on feedback as clearly what I am doing is not what people are expecting. I will however make it an option, so you can chose exact match (current) or loose.
I am modifying it now to be a little more like github's.
I had planned on removing the highlights too, its going to be modified to just affect the portion where the match occurred. Its mostly there as a hangover from when I was not actually highlighting the actual match and just the line.
Handy! However, this seems to do fuzzy searches by default, and I don't see an obvious way to disable that.
I tried searching for "xcb_connect", and got a pile of results for "xcb_connection_t". However, unlike with other search engines, I can't quote the term to force an exact match, because it'll search for the quotes literally (which I'd otherwise appreciate when trying to find a string).
Hi. That's actually a very good point. I need to have a think on how to allow forcing exact matches, perhaps an advanced option which does this for you. Thanks for the feedback.
As a first pass, a "match whole words" checkbox should suffice. Eventually you might want a syntax that allows more flexibility, whether full regexes or something more scalable, but "match whole words" would solve a large fraction of the problem. (You might even want to make that the default, and have a checkbox for "match partial words" instead; for code search, I'd bet that most of the time you want the whole word matches.)
Yes I have looked at that. The catch is it would require another copy of the index, and I don't quite want to front the cash for that.
I did write some code which would turn a regex query into a normal sphinx query some time ago so I am going to try to implement that again and hopefully get similar results.
I've been working on another tool, http://crossclj.info focused on cross-referencing a large parte of the open-source ecosystem for Clojure and ClojureScript (some 3600+ projects).
You can browse both the source code (jumping to definitions across projects) and the auto-generated docs of the whole codebase.
It cross reference of any strings in any text files inside a project.
I indexed the linux, bsd kernel, android AOSP (JB), openstack, golang, raspberrypi_userland, elasticsearch, nodejs and few other projects with it on that site.
I use sourcegraph occasionally and mostly rely on Gihub search. I wish the search has all those advanced refinement options that grep & Sublime Text search has. Some examples would be to use regex, search a word within a scope of lines, search within search results etc. Additionally, it's very useful to be able to sort the search results by stars/forks. Sometimes I just want to see how popular projects have implemented a certain feature. A keyword based search isn't enough for that.
I guess these features are very expensive & slow to implement but it would be super useful if it can be achieved. Source code search is for geeks so it is probably fair to say that a truly advanced & complex interface won't turn away users.
Sounds like what I want to achieve with searchcode. I am working on most of what you have listed there. If you want to email me with your ideas on how these features should work I would be more then happy to add them. My email is in my bio and listed on searchcode
This is nicely implemented, however like for most code search engine I'm wondering what is the exact use case? When I want to search for a function, I usually try Google and find sample codes on forums, GitHub or Stack Exchange sites along with detailed information and discussions about them. What additional feature would searchcode provide?
Personally I use it quite a lot when looking for implementations. I also used code search before searchcode as well to do this, so it fits into my workflow.
However I also use DuckDuckGo, and a !code redirects to searchcode for this so there is that.
Possibly it depends on how you learn and use things. I find examples useful. For instance when looking at how to do things in Python's Fabric I would rather see some examples then read about how someone else doing it.
I like the search engines and discoverability. My own project - much, much smaller - is for embedding search into the Javadocs themselves (along with some SEO improvements, like iframes and better meta-info).
GWT is not on Google Code anymore (well we were, but it's old code). We're at gwt.googlesource.com.
I imagine you'll want android.googlesource.com and other googlesource.com repos.
Also, ranking is really important. My usual test is to search for java.lang.String and see what comes up. If the first result isn't the String class from some version of the Java SDK, something is wrong :-)
Also: CrossSiteIframeLinker finds the right file on the first page (though not at the top), but CrossSiteIframeLinker.java has no results.
The reason you aren't seeing string from java is it actually be default goes for files using the string class, rather then the implementation. Its something I need to work on though as I agree it should pop near the top, or possibly appear in the documentation listing at the top.
As for CrossSiteIframeLinker, I don't index the filenames, although I am considering it for these sort of cases.
Thanks. Yes, all characters are acutally indexed, with some logic to split intelligently when required. It will always go for the most exact match first though.
I have started looking at ranking the main implementation over usage, but found the results were less useful generally. This may have just been me though, as I wanted to see usage rather then the implementation, since if I know that I will just go to the source.
Perhaps an option to request that the orginal source is given greater weight on an advanced search page would solve this.
Does anyone know details on the API - what is the request quota? Will there be api-keys, or just anonymous users to the api?
I am interested how does it work with Github search - their API allows something like 5 unauthenticated req/min and 20 req/min if authenticated (at least the search API)?
There is no request quota at the moment, although if I discover any abuse I may end up rate limiting per IP address (which I really do not want to do as I don't currently track IP's). No API key is required either. Neither of these are likely to change.
I only ask for a link back to the site and that you don't spam it. I want to operate by the motto "Be excellent to each other".
If you want to pass a referrer in the GET request so I know who's using the API that would be nice, but not required.
searchcode has no integration with Github search at all. I am unlikely to add it either as I belive they do an excellent job, hence the link on the right hand side to be redirected to Github search.
Thanks! Hope you find a nice use for the API. If so, let me know as I am always happy to hear about usage.
Thanks. This is an awesome plan. I might be using the api through a desktop app, so eventual limit per IP would work far better than overall.
About the link back to the site - I don't see anything in the API results to link back to and just a link to the site will probably get a lot lower rate of clicking than a link with context to the search... Perhaps, it will be a nice idea to add a link_back_url (just url?) for every result, so people can navigate to "searchcode.com/codesearch/view/n"?
Ah that's a good point. I meant just link back to searchcode itself, but that make sense. Added for you. It should appear on new searches, and all once the cache falls off (don't want to flush it right now while its getting so much traffic).
There was a code search engine wave a few years back which I thought was a decent idea but they apparently never gained much traction (Google Code Search, Krugle, Koders come to mind). One thing I don't recall seeing which is a good call are the real world examples.
Congratulations! Excellent usability. The user interface is pretty nice too. It would be nice being able to use it for a regular intranet search engine.
Howdy. Thats mostly because searchcode is trying to match exactly "eval($_GET" and currently has none in the index. Looking at GitHub it seems to come back with only 4 results which are exact and the rest as loose matches.
I am seeing a pattern that most people would prefer the loose match over exact to get back more results, but with the exact matches (if any) at the top.
I will take this onboard and see if I can improve things.
I would definitely expect the google pattern of doing exact match when in quotes. Throw in a regexp engine as well, because I tried using wildcards and that didn't do anything.
For example, if searching for vulnerabilities I would like to be able to do something like:
It's called Desktop Search [1] or Enterprise Search [2].
If you want to create your own solution, the easiest way is using Sphinx Search [3] (searchcode.com uses it too [4]) and a bit more advanced with Lucene [5] (and its related sub-projects) or Xapian [6].
If you want your personal Google Code Search [7] with its powerful scalable Regex functionality, Russ Cox published the related code [8].
As someone else mentioned OpenGrok, but if you really want I can look into providing a searchcode implementation if you want. Just email me and we should be able to work something out.
I actually think the new koders (code.ohloh.net) is an improvement over the old coders in many ways, but wanted a leaner implementation.
Also while ohloh was working on the new version koders was becoming stale hence starting my own. Lastly I thougth that this sort of service should provide an API which ohloh sadly does not (hopefully this will change).
Ooooh, good point. I'm an engineer for Black Duck and in charge of Ohloh.net. We have an API for most everything available on Ohloh.net (Organizations, which is still in Beta, is an exception and we're going to fix that real soon).
We're working on an improved code search and I want to leverage this new infrastructure, which should start being available for our internal development in the next few weeks, to make code search a top-level citizen within Ohloh. For example, searching for key phrases within a project right from the Project page.
I'll add some stories to the back log to see what we can do to make the code search itself API accessible.
You mentioned "leaner implementation"; could you expand on that a bit? I'd love to hear your thoughts.
Sounds good to me. An exposed API is a big boon, and the lack of one is one of the main reasons I started searchcode.
Sure, just email me at bboyte01@gmail.com and ill be happy to discuss. Mostly it comes down to not overcrowding the UI and running on minimal hardware.
We've been working on another code search engine for open-source: https://sourcegraph.com/. Our approach is a little different -- we parse and index things at a programming language level. This has the benefit of being able to list usage examples of a function or jump to a definition (i.e., smart IDE-like behavior). Obviously, one drawback is having to implement language-specific support; right now Sourcegraph just supports Python, Javascript (node.js) and Go.
Would love to hear people's feedback on either approach!
For other examples of code search, check out: Ohloh: http://www.ohloh.net/, Krugle: http://opensearch.krugle.org/, and Google code search (or what remains of it) https://code.google.com/p/chromium/codesearch