Complementing a 15k star open source NLP library with commercial developer tools

Hello, who are you and what are you working on?

Hi, I'm Ines. I'm a developer and the co-founder of Explosion, a software company specializing in developer tools for AI and Natural Language Processing (NLP).

We develop spaCy, which is an open-source library for production NLP in Python. If you're working with lots of text, you eventually want to know more about it and extract information that goes beyond just keyword search. For example: What's the text about? What persons, companies and other concepts are mentioned? What are the relationships between them? spaCy lets you build powerful pipelines to do exactly that, using neural network models and rule-based approaches.

We also develop Prodigy, an annotation tool for creating training data for machine learning models and our first commercial product.

Why did you start working on spaCy?

I met my co-founder Matthew Honnibal in 2014, shortly after he'd left academia to focus on writing spaCy. We quickly realized that our skills were a great fit and that there's a lot we could be doing together. Our first project was an interactive visualizer for syntactic dependencies – so basically, how words in a sentence are related. It's still one of our most popular demos to date.

I've always been interested in language (I did linguistics as part of my degree), and I've always been interested in computers. NLP is a great field with no shortage of interesting problems to solve that lets me combine all the different skills I have and things I'm into.

What were the early days like?

The very early days were mostly just Matt writing the foundations of the library. There was a very long incubation period before the first release, to get the design right. Most of the underlying structures developed during this period still haven't changed, by the way! After the launch, there was a lot of experimentation: we built various demos, tried out different approaches for the documentation and worked on fine-tuning the statistical models.

How have you grown its usage?

Natural language processing and its usage, in general, has grown a lot, which is reflected in spaCy's growth. We try to make it as easy as possible for people to get started, provide good documentation and make sure that we're always communicating things clearly.

How do you manage the workload and the community?

First, we make sure to keep the scope of the project clear and keep a clear vision of what belongs in the core library and what doesn't. For instance, spaCy focuses on analyzing text and provides the data structures and model implementations for a range of NLP tasks. Very specific components or alternative implementations are a better fit for extensions and plugins. Having a clear scope means that it's much easier to know what to work on and pay attention to.

We also focus on making the docs as useful as possible and having the library raise helpful errors. If a project is well-documented, both in terms of usage and API, this also greatly reduces the questions and issues that come up, because users can look things up and find answers themselves. spaCy is a developer tool, and developers typically like helping themselves. If something goes wrong and the tool you use tells you exactly what the problem is and how to fix it, that's great! You can keep working, and the developers of the tool don't have to answer an issue and tell you what to do.

You are cofounder of a company, Explosion, that houses your open source work. How does the company interact with the open source work?

The plan for spaCy has always been to turn it into a business. We decided against the classic "open core" model where you make the core functionality free, and charge money for additional features. We also didn't want to make money by providing commercial support, since this leads to very problematic and misaligned incentives: if the tool is too easy to use and the docs are too good, nobody will need your services. But if it's difficult to use and has bad docs, you'll lose users, who are also your potential customers.

That's why Explosion is a developer tools company. Some of our tools, like spaCy, are free and open-source, and some of our tools, like Prodigy, are commercial. Our free tools like spaCy play a very important role in our business. Companies who are using spaCy in production are also likely training their own models, so they might find Prodigy useful as well to create training data more efficiently. Making spaCy free and open-source also lets us show that we can build good software. Without anything to show, it'd be quite difficult to convince companies to buy other products from us.

How much time do you devote to spaCy?

spaCy is part of my job, and it's difficult to separate it from all the other tasks and estimate how much time I spend on it. I typically divide my time between open-source development (including spaCy), development of our commercial products (Prodigy and our upcoming tools), communications and managing the company. We now also have two more spaCy core team members, Sofie and Adriane.

What are the biggest obstacles you've had to overcome?

Starting a company is always hard, and there's a lot to learn the first time you do it. For the first six months, we did consulting to bootstrap the company and balancing that with our open-source work and product development was especially tricky. So it was a big relief when we managed to release Prodigy because that let us continue with funding purely from software sales.

I guess one thing we've had to learn is how useless most advice is: everyone's situation is different, and even people who've been successful aren't necessarily right about why that happened. So even though there's a lot to learn, and other people have done similar things, there's not really a cheat sheet you can look at.

What is your biggest current problem/frustration in your open source work?

I think the most common frustrations for a library like spaCy are the maintenance tasks associated with managing dependencies, supporting lots of operating systems and versions, and release processes. The build matrix for spaCy and the rest of our packages is pretty big: we have to consider all combinations of operating systems, Python version, GPU vs. CPU, pip vs. conda, etc. We've now got this pretty well sorted out (and open-sourced part of our solution), but it's taken a lot of work. At one point we had five different CI systems. It's helped a lot to move our CI entirely to Azure Pipelines. We also really appreciate the steady improvements to setuptools and pip that the PyPA folks have been making.

What are your hopes for the future of spaCy?

We hope the library can keep being useful to developers, especially developers who have a mix of different experiences and who need to get something done with NLP. We also want to make sure we offer solutions that are up-to-date with current NLP technologies. This can be challenging, as we also want to avoid big breaking changes or big redesigns. I think one specific transition that we'll be making is accommodating user-created models better. When the library first started, relatively few users wanted to train and define their own models. Now that's much more expected, so we want to make sure spaCy is a good way to integrate lots of models and components you've defined yourself, using any framework of your choice.

What advice do you have for other open source projects and maintainers?

One question we get asked a lot at conferences, especially by newer developers, is "How can I get started contributing to open source?". I think the primary question to ask is "What are your goals?". There are lots of different types of open-source projects and different ways to contribute. One thing I'd definitely push back against a bit is the idea that contributing to open-source is necessarily a great resume-building strategy. It can be, if done strategically, but this advice can really mislead people to spend a lot of time with pretty unclear benefits.

Overall I'm pretty reluctant to give advice, as situations differ. If I have to though, one thing I'd say is, it really helps everyone if maintainers set expectations around their project. For spaCy, the project was always intended to be commercial, so the message from the start was, "You can count on this". In other words, we made a promise to the community that we would continue developing this, and it would've been a breach of trust to tell people "You know, this no longer sparks joy. Sorry, we're out!". Even though the library was and is free, we do owe something to the community – because we made a promise, and we've benefited from the resulting trust. But that's definitely not all open source projects! A lot of other open-source projects are entirely different. Problems happen when the community thinks they're owed something, and the maintainer disagrees. So it's very helpful to set expectations.

To find out more about spaCy, you can visit the website, follow the project on Twitter or contribute to the code on GitHub. To find out more about the company, Explosion, including their annotation tool, Prodigy, you can visit the website here ot follow them on Twitter here.

To find out more about Ines, you can follow her on Twitter or visit her website, where she writes about her work.