The present and future of semantic code search

Hello, who are you and what are you working on?

Hi, I'm Hamel Husain. I'm a Machine Learning Engineer at GitHub, where I focus on applications of natural language processing for developer tools. Before this, I was a data scientist at Airbnb where I worked on growth marketing and before that, I was a data scientist at DataRobot. Recently, I led a project that open sourced a large corpus of code and text for the machine learning community called CodeSearchNet.

Why did you start working on CodeSearchNet?

I've always been interested in building tools to help developers and people like myself automate mundane or tedious parts of their workflow. One of the reasons I joined DataRobot was that I wanted to automate some of the mundane aspects of machine learning. Eventually, I was drawn back to the same theme at GitHub where I focus on trying to automate or enhance the software development workflow more generally (which is a superset of the data scientist workflow).

More recently, I've taken a keen interest in how GitHub can automate parts of the data science workflow specifically, so you can say that I've come full circle. One thing that I have realized about myself during this journey is that I love solving problems that frustrate me as a software engineer and data scientist and automating those parts of my job that I do not find as enjoyable.

About a year ago, I began working with very talented colleagues on Natural Language Semantic Code Search, and we realized then that the public dataset we were using would be very valuable to the community if prepared correctly. That is the genesis of CodeSearchNet.

What were the early days like?

For the aforementioned projects, the GitHub team had the opportunity to partner with some very talented people at Microsoft Research: Miltos Allamanis and Marc Brocschmidt, who are pioneers in the field of code understanding. This project involved coordination between GitHub and Microsoft Research, which was surprisingly painless and which I attribute to their amazing personalities and intelligence.

We first explored techniques to solve Natural Language Semantic Code Search and scale the approach to many languages. However, we soon realized that a tractable solution required a larger effort and that it made sense to leverage the open source and research communities to drive further progress. This started as a grassroots effort inside of GitHub and went through a fair number of iterations. When it was presented to GitHub's CEO, he correctly pointed out that we needed to go back and include GitHub's most popular language (JavaScript) as well as make several modifications to the framing of the problem. Even though it was hard to take a step back and refactor our project, it was the right thing to do and I believe the project ended up being more polished and defensible because of this.

How do you manage the workload among different contributors?

Since this started as a grassroots effort among a team of 5 people, everyone contributed at their own pace and naturally settled into working on aspects of the project that they were strongest at.

One new piece of technology that was instrumental in the collaboration process was Weights & Biases, which is an experiment tracking system for machine learning. This helped us maintain our sanity in terms of understanding model progression and reproducing each other's results. I don't think this project would have been possible without Weights & Biases because collaborators were each using their own infrastructure for training models and this system allowed us to see each other's results and track progress in a common forum.

What are the biggest obstacles you've had to overcome?

The biggest obstacles were differences in time zones and taking care that we adhered as closely as possible to and respected the software licenses in the data we were using from public repositories. This required multiple iterations with policy and legal experts, which was extremely time consuming but ultimately necessary.

We hope that the community can use this dataset to improve developer tools generally, which may include semantic code search. We hope that the state of the art with regards to representation learning of code is advanced because researchers and practitioners now have a common dataset and a forum in which to discuss results. We also hope that the uniqueness of the dataset will inspire the community to uncover new approaches and techniques for code and natural language understanding. Lastly, there is generally a dearth of large, interesting datasets for people entering the field to experiment with and we hope this dataset helps make that experience more enjoyable.

What advice do you have for other open source projects and maintainers?

Just begin. Don't be afraid if you feel that your work is not good enough as there is always room to improve. Solve your own problems and see if you can generalize the solutions (even if just a little) so that other people can use your tools to help them when they encounter the same situation. I always like to start with the end in mind, so I imagine what the blog post or announcement would look like, think about why I am excited by a creation and then work backward.

Of course, I am still learning myself, so please take all of this advice with a grain of salt.

To find out more about semantic code search, take a look at GitHub's blog post announcing the CodeSearchNet Challenge, or Hamel's previous post on Medium. The GitHub repository is looking for contributors to help with documentation, bug fixing and improving the data pre-processing pipeline, so consider getting involved if you want to advance the state of the art on code search. You can find the GitHub repo here:

To find out more about Hamel and his work, you can follow him on Twitter and GitHub, read him on Medium, or connect with him on LinkedIn.