Searchcode: A source code search engine

beyang · on June 25, 2014

Neat project, thanks for sharing!

We've been working on another code search engine for open-source: https://sourcegraph.com/. Our approach is a little different -- we parse and index things at a programming language level. This has the benefit of being able to list usage examples of a function or jump to a definition (i.e., smart IDE-like behavior). Obviously, one drawback is having to implement language-specific support; right now Sourcegraph just supports Python, Javascript (node.js) and Go.

Would love to hear people's feedback on either approach!

For other examples of code search, check out: Ohloh: http://www.ohloh.net/, Krugle: http://opensearch.krugle.org/, and Google code search (or what remains of it) https://code.google.com/p/chromium/codesearch

boyter · on June 26, 2014

Thanks.

I have been watching sourcegraph.com for a while. As you say you do much deeper and expensive parsing of the code, which is not really feasable at the moment for searchcode since it has 90 languages. Its also a constraint as I don't have the hardware to support this (searchcode is very lean).

My goal was to provide codesearch for sources other than Github as there is so much code out there to find. Github does an excellent job indexing and I am certain that Bitbucket will be following with their own implementation soon.

I too would love to know what people prefer, a deeper analysis of the code with IDE support or more bredth. I can see both being useful.

pfraze · on June 26, 2014

Maybe a merge?

Reich · on June 26, 2014

Don't --force it

zo1 · on June 26, 2014

I'm curious, are there any plans for having offline-type functionality? I am especially curious because not everyone stores their code on public/cloud repositories. Especially source code that is for private contracts and government agencies which have strict rules regarding the source code.

sqs · on June 26, 2014

Yes. We've had a number of large and small companies reach out to us (Sourcegraph) about this. It's definitely something we plan to support, after we make an awesome public product for open source code. Email me if you want to learn more (sqs at sourcegraph dot com).

rane · on June 26, 2014

If you compare these two search results:

https://searchcode.com/?q=MongoDBObject+find+lang%3AScala

https://github.com/search?q=MongoDBObject+find+language%3Asc...

I find Github much, much more digestible, and to the point than search code.

boyter · on June 26, 2014

Yes, thats my fault. Whats happening under the hood is searchcode is trying to match "MongoDBObject find" exactly, where as Github is going for a looser but phrase heavy search.

Its something I am going to change based on feedback as clearly what I am doing is not what people are expecting. I will however make it an option, so you can chose exact match (current) or loose.

rane · on June 27, 2014

Thanks, would be great to have searchcode up to par with Github's search.

I would pay special attention to the yellow highlights. It's only confusing if the technique is overused.

boyter · on June 29, 2014

I am modifying it now to be a little more like github's.

I had planned on removing the highlights too, its going to be modified to just affect the portion where the match occurred. Its mostly there as a hangover from when I was not actually highlighting the actual match and just the line.

JoshTriplett · on June 25, 2014

Handy! However, this seems to do fuzzy searches by default, and I don't see an obvious way to disable that.

I tried searching for "xcb_connect", and got a pile of results for "xcb_connection_t". However, unlike with other search engines, I can't quote the term to force an exact match, because it'll search for the quotes literally (which I'd otherwise appreciate when trying to find a string).

boyter · on June 26, 2014

Hi. That's actually a very good point. I need to have a think on how to allow forcing exact matches, perhaps an advanced option which does this for you. Thanks for the feedback.

JoshTriplett · on June 26, 2014

As a first pass, a "match whole words" checkbox should suffice. Eventually you might want a syntax that allows more flexibility, whether full regexes or something more scalable, but "match whole words" would solve a large fraction of the problem. (You might even want to make that the default, and have a checkbox for "match partial words" instead; for code search, I'd bet that most of the time you want the whole word matches.)

boyter · on June 26, 2014

That sounds reasonable. I will probably go with that. Added to the queue and thanks for the feedback.

LukeShu · on June 26, 2014

Perhaps support regex search? Russ Cox did a write up of how he implemented regex search for Google Code, back when that was a thing.

boyter · on June 26, 2014

Yes I have looked at that. The catch is it would require another copy of the index, and I don't quite want to front the cash for that.

I did write some code which would turn a regex query into a normal sphinx query some time ago so I am going to try to implement that again and hopefully get similar results.

alecthomas · on June 27, 2014

A better solution would probably be to just rank exact matches higher.

fbellomi · on June 26, 2014

That's a very interesting project.

I've been working on another tool, http://crossclj.info focused on cross-referencing a large parte of the open-source ecosystem for Clojure and ClojureScript (some 3600+ projects).

You can browse both the source code (jumping to definitions across projects) and the auto-generated docs of the whole codebase.

srcmap · on June 26, 2014

Very Neat! Fast search of src code is also a personal itch of mine - I also work on site/webapp at let you search source code with mouse click.

http://www.srcmap.org/s/sl.htm/p=about#c=ABOUT

It cross reference of any strings in any text files inside a project.

I indexed the linux, bsd kernel, android AOSP (JB), openstack, golang, raspberrypi_userland, elasticsearch, nodejs and few other projects with it on that site.

chdir · on June 26, 2014

I use sourcegraph occasionally and mostly rely on Gihub search. I wish the search has all those advanced refinement options that grep & Sublime Text search has. Some examples would be to use regex, search a word within a scope of lines, search within search results etc. Additionally, it's very useful to be able to sort the search results by stars/forks. Sometimes I just want to see how popular projects have implemented a certain feature. A keyword based search isn't enough for that.

I guess these features are very expensive & slow to implement but it would be super useful if it can be achieved. Source code search is for geeks so it is probably fair to say that a truly advanced & complex interface won't turn away users.

boyter · on June 26, 2014

Sounds like what I want to achieve with searchcode. I am working on most of what you have listed there. If you want to email me with your ideas on how these features should work I would be more then happy to add them. My email is in my bio and listed on searchcode

chdir · on June 26, 2014

That sounds great. If I have something useful to contribute, I'll get in touch.

laurent123456 · on June 26, 2014

This is nicely implemented, however like for most code search engine I'm wondering what is the exact use case? When I want to search for a function, I usually try Google and find sample codes on forums, GitHub or Stack Exchange sites along with detailed information and discussions about them. What additional feature would searchcode provide?

boyter · on June 26, 2014

Personally I use it quite a lot when looking for implementations. I also used code search before searchcode as well to do this, so it fits into my workflow.

However I also use DuckDuckGo, and a !code redirects to searchcode for this so there is that.

Possibly it depends on how you learn and use things. I find examples useful. For instance when looking at how to do things in Python's Fabric I would rather see some examples then read about how someone else doing it.

frik · on June 26, 2014

Sphinx and searchcode: http://www.boyter.org/2014/06/sphinx-searchcode/ (by Ben E. Boyter, searchcode.com)

arafalov · on June 26, 2014

I like the search engines and discoverability. My own project - much, much smaller - is for embedding search into the Javadocs themselves (along with some SEO improvements, like iframes and better meta-info).

The search is driven by Apache Solr and can be seen at: http://www.solr-start.com/javadoc/solr-lucene/index.html . I use custom doclet to generate the actual Javadoc, which is an interesting challenge all by itself.

skybrian · on June 26, 2014

GWT is not on Google Code anymore (well we were, but it's old code). We're at gwt.googlesource.com.

I imagine you'll want android.googlesource.com and other googlesource.com repos.

Also, ranking is really important. My usual test is to search for java.lang.String and see what comes up. If the first result isn't the String class from some version of the Java SDK, something is wrong :-)

Also: CrossSiteIframeLinker finds the right file on the first page (though not at the top), but CrossSiteIframeLinker.java has no results.

boyter · on June 26, 2014

Thanks. I'll update the details.

The reason you aren't seeing string from java is it actually be default goes for files using the string class, rather then the implementation. Its something I need to work on though as I agree it should pop near the top, or possibly appear in the documentation listing at the top.

As for CrossSiteIframeLinker, I don't index the filenames, although I am considering it for these sort of cases.

ch · on June 25, 2014

We'll. They don't break tokens on '_' which is a plus (trend searching for pthread_t), and they seem to prioritize definitions over declarations.

This could become a useful code search option.

I would request that relevancy should take into account provenance. Meaning a search for pthread_t would return pthread implementations over uses.

This obviously can't make use of traditional tfidf for scoring.

boyter · on June 26, 2014

Thanks. Yes, all characters are acutally indexed, with some logic to split intelligently when required. It will always go for the most exact match first though.

I have started looking at ranking the main implementation over usage, but found the results were less useful generally. This may have just been me though, as I wanted to see usage rather then the implementation, since if I know that I will just go to the source.

Perhaps an option to request that the orginal source is given greater weight on an advanced search page would solve this.

ch · on June 26, 2014

Perhaps simply an info box which points to the implementation, set off from the usage results. No reason to give the results equal screen real-estate.

tsenkov · on June 26, 2014

Does anyone know details on the API - what is the request quota? Will there be api-keys, or just anonymous users to the api?

I am interested how does it work with Github search - their API allows something like 5 unauthenticated req/min and 20 req/min if authenticated (at least the search API)?

Congrats to @boyter.

boyter · on June 26, 2014

Hi tsnkov,

There is no request quota at the moment, although if I discover any abuse I may end up rate limiting per IP address (which I really do not want to do as I don't currently track IP's). No API key is required either. Neither of these are likely to change.

I only ask for a link back to the site and that you don't spam it. I want to operate by the motto "Be excellent to each other".

If you want to pass a referrer in the GET request so I know who's using the API that would be nice, but not required.

searchcode has no integration with Github search at all. I am unlikely to add it either as I belive they do an excellent job, hence the link on the right hand side to be redirected to Github search.

Thanks! Hope you find a nice use for the API. If so, let me know as I am always happy to hear about usage.

tsenkov · on June 26, 2014

Thanks. This is an awesome plan. I might be using the api through a desktop app, so eventual limit per IP would work far better than overall.

About the link back to the site - I don't see anything in the API results to link back to and just a link to the site will probably get a lot lower rate of clicking than a link with context to the search... Perhaps, it will be a nice idea to add a link_back_url (just url?) for every result, so people can navigate to "searchcode.com/codesearch/view/n"?

boyter · on June 26, 2014

Feel free to go nuts.

Ah that's a good point. I meant just link back to searchcode itself, but that make sense. Added for you. It should appear on new searches, and all once the cache falls off (don't want to flush it right now while its getting so much traffic).

tsenkov · on June 26, 2014

Awesome. Thank you. :)

pbreit · on June 26, 2014

There was a code search engine wave a few years back which I thought was a decent idea but they apparently never gained much traction (Google Code Search, Krugle, Koders come to mind). One thing I don't recall seeing which is a good call are the real world examples.

sevko · on July 2, 2014

A similar (proof of concept) application: http://bitshift.it/ Demo video: https://vimeo.com/98697078

rubycodesearch · on June 26, 2014

I made a site that dedicated to Ruby developers. It's mostly about searching a RubyGem's source code. RegExp supported.

http://rubycodesearch.com

eng_monkey · on June 26, 2014

Congratulations! Excellent usability. The user interface is pretty nice too. It would be nice being able to use it for a regular intranet search engine.

boyter · on June 26, 2014

Actually you can if you set your default search provider in your browser. All DDG bang searches are supported so its actually possible to do.

adzicg · on June 25, 2014

just tried searching for a few obscure things I can find through Github and the site couldn't find anything. meh... not much of an index then

zapt02 · on June 25, 2014

I tried searching for: eval($_GET

No results.

Same search on GitHub gives over 100 000 results for PHP. Doh.

https://github.com/search?q=eval%28%24_GET&ref=searchresults...

boyter · on June 26, 2014

Howdy. Thats mostly because searchcode is trying to match exactly "eval($_GET" and currently has none in the index. Looking at GitHub it seems to come back with only 4 results which are exact and the rest as loose matches.

I am seeing a pattern that most people would prefer the loose match over exact to get back more results, but with the exact matches (if any) at the top.

I will take this onboard and see if I can improve things.

zapt02 · on June 26, 2014

If you search for an exact match you still get almost 600 results, see:

https://github.com/search?q=%22eval%28%24_GET%22&type=Code&r...

I would definitely expect the google pattern of doing exact match when in quotes. Throw in a regexp engine as well, because I tried using wildcards and that didn't do anything.

For example, if searching for vulnerabilities I would like to be able to do something like:

  eval(.*$_[(GET|POST)].*)

boyter · on June 26, 2014

Sorry about that. I don't actually index as much of Github as they have their own index. I mostly focus on Bitbucket and Codeplex.

If you could supply some specific examples that would be very useful though.

splitbrain · on June 26, 2014

A bit off-topic. But are there any open source projects like this that can be self-hosted, working on a largish local directory of projects?

frik · on June 26, 2014

It's called Desktop Search [1] or Enterprise Search [2].

If you want to create your own solution, the easiest way is using Sphinx Search [3] (searchcode.com uses it too [4]) and a bit more advanced with Lucene [5] (and its related sub-projects) or Xapian [6].

If you want your personal Google Code Search [7] with its powerful scalable Regex functionality, Russ Cox published the related code [8].

[1] http://en.wikipedia.org/wiki/Desktop_search

[2] http://en.wikipedia.org/wiki/Enterprise_search

[3] http://en.wikipedia.org/wiki/Sphinx_(search_engine)

[4] http://www.boyter.org/2014/06/sphinx-searchcode/

[5] http://en.wikipedia.org/wiki/Lucene

[6] http://en.wikipedia.org/wiki/Xapian

[7] http://en.wikipedia.org/wiki/Google_Code_Search

[8] http://swtch.com/~rsc/regexp/

ajhai · on June 26, 2014

Take a look at http://opengrok.github.io/OpenGrok/.

boyter · on June 26, 2014

As someone else mentioned OpenGrok, but if you really want I can look into providing a searchcode implementation if you want. Just email me and we should be able to work something out.

slashdotaccount · on June 26, 2014

You can do regex based searches of Debian here:

http://codesearch.debian.net/

talles · on June 25, 2014

Ty so much, I've be waiting for something like this since when they ruined koders.com!

boyter · on June 26, 2014

No problem at all.

I actually think the new koders (code.ohloh.net) is an improvement over the old coders in many ways, but wanted a leaner implementation.

Also while ohloh was working on the new version koders was becoming stale hence starting my own. Lastly I thougth that this sort of service should provide an API which ohloh sadly does not (hopefully this will change).

PDegenPortnoy · on June 27, 2014

Ooooh, good point. I'm an engineer for Black Duck and in charge of Ohloh.net. We have an API for most everything available on Ohloh.net (Organizations, which is still in Beta, is an exception and we're going to fix that real soon).

We're working on an improved code search and I want to leverage this new infrastructure, which should start being available for our internal development in the next few weeks, to make code search a top-level citizen within Ohloh. For example, searching for key phrases within a project right from the Project page.

I'll add some stories to the back log to see what we can do to make the code search itself API accessible.

You mentioned "leaner implementation"; could you expand on that a bit? I'd love to hear your thoughts.

boyter · on June 29, 2014

Sounds good to me. An exposed API is a big boon, and the lack of one is one of the main reasons I started searchcode.

Sure, just email me at bboyte01@gmail.com and ill be happy to discuss. Mostly it comes down to not overcrowding the UI and running on minimal hardware.

voltagex_ · on June 26, 2014

grepcode.com seems to be good for Java/Android. I'm a little annoyed they have a generic domain for a language-specific search but oh well.

chewxy · on June 26, 2014

Did you abandon your .de domain?

boyter · on June 26, 2014

Its still active, but just as a redirect.

Google does not like .de domains and moving over to .com gave a massive boost in search results.

thejosh · on June 26, 2014

Does this also sort gists?

boyter · on June 26, 2014

No it don't sorry. If you want to send me an email with some details on how to implement this I would be happy to though.