Welcome to Once a Maintainer, where we interview an open source maintainer and tell their story.
This week we’re talking to James Lamb, Staff Machine Learning Engineer at Spothero and maintainer of LightGBM, a machine learning framework out of Microsoft. James is a prolific contributor to the open source and data science communities and the co-organizer of the MLOps meetup in Chicago.
Once a Maintainer is written by the team at Infield, a platform for managing open source dependency upgrades.
How did you become a software developer?
My high school did not have computer programming, and I didn’t write any code in high school. I really liked hip hop, and I wanted to be in marketing at a record label. So I went to college for marketing. At the school that I went to, you could get a double major in the business school by taking one extra class. And from high school, I already had three credits for Econ. So purely out of laziness, I was like, I guess I'll double major in Econ. So I started taking Econ classes and I really, really liked it. So I went to graduate school. I happened to be at one of the two universities in the US at the time, Marquette, that had a terminal master's degree in Economics, so I was really lucky to get into that.
My thesis project was a time series forecasting project where I was using methods from finance to forecast the sales of frozen dinners at Dominic's grocery stores in Chicago. It was like fit a model to some data, make a couple weeks ahead forecast, save the results to an Excel spreadsheet. That's what I was using. Add another week of data, retrain, repredict. Over and over and over again like hundreds of times. And I was staring, like staring at this project with, you know, like four months to go until graduation. I was like, do I try to learn R? People do R in Econ. Or do I just grind it out in the computer lab and just like point, click, copy, in Excel over and over again. And I actually chose to grind it out. But as soon as I graduated, I was like, I'm never doing that again. I have to learn how to code.
My first job was an economics job and while I was there I started learning R on the side from the Johns Hopkins Data Science Coursera courses, which were amazing. Those courses taught me to write my first function. I created my first plot programmatically. I read data into a program for the first time, used git for the first time. But I wanted to write code to be a better economist. So I kept taking online classes, Coursera, Codecademy, DataCamp, whatever I could find.
Eventually I did end up getting a data science job at a startup here in Chicago. And that just put me on a whole other path. I learned a lot more really fast. That was the place I wrote my first CI pipeline, first time I put a machine learning model to production, my first container image, etc.
Did you join the engineering team or did they have a separate data science team?
This is another place in my career where I just was really fortunate, right place, right time. I joined this company called Uptake. They had just gotten a large round of funding, and they were looking to hire like 35 data scientists of varying experience levels, all at once, full time in Chicago. Data science was the product at this company - we took in sensor data from industrial equipment like mining trucks, wind turbines, locomotives, stuff like that. And we built applications that were powered by models that predicted when that stuff would break so that you could do maintenance ahead of time. Maintenance ahead of time is always cheaper than “this thing is broken right now, we need to fix it” type maintenance.
So when I got in there, I joined a data science team, but we had like 70 full time data scientists in a head count of like maybe 700 people. It was a huge part of the business. And so within that team, we did a lot. When I joined, I was the most junior member of the team and I was joining as a data scientist. The expectation was that I would write R and a little bit of Python, and bash scripts to produce reports and statistical models. And then by the time I left there, I really wasn't producing models anymore. I was doing something more like what you'd expect from a staff engineer. I was writing code, but also writing long form architecture proposals. I was translating the data science team's requirements into engineering requirements. You know, that could be understood by engineers who weren't familiar with machine learning.
That feels like a major transition to make, from self taught to “I can do what I want with data”, to “I’m ready to help an entire team do data science engineering in the right way”. I’m curious how you do that.
You know I was fortunate in a lot of ways, a lot of opportunities that just kind of like came to me and and I took them. I was one of the more junior members of a team of very, very talented people. I had to learn to keep up with with some of those people. Absolutely brilliant. I was lucky to be in this group of really curious, just like nice, friendly, but very, very smart people who were never bothered if you were like, hey, I saw you did this thing. Like, what does that actually mean? What is Docker? How does Docker work?
At the same time I did another master's degree in data science from UC Berkeley.
Did they encourage you to do that degree or did you go to them and say, hey, I want to do this degree, and I would like to do it while I'm still working here?
No one that I worked with said you should go get a graduate degree, but they were very supportive when I did it. I was actually encouraged by a mentor of mine, who I first got in touch with doing my economics degree. At the time there wasn’t a ton of research into the exact topic I was interested in, which was applying these financial models outside of finance to a supply chain setting. There was one set of papers that I cited heavily in my work, and I reached out to one of the authors of the papers, Dr. Shoumen Palit Austin Datta. I just said hey thank you, I’m a student at Marquette, I cited your work, here it is. And amazingly, he emailed me back and had questions for me and we developed this relationship over email. He recommended and really pushed me to UC Berkeley.
It sounds like you had great mentorship on both the commercial side and the academic side, which is really rare.
I feel very fortunate and you know, very privileged to be able to take these risks. I had a strong family support system. I came out of college with less debt than the average person because my parents helped me with money for college. So I was able to take these bets, you know, like I'm going to spend all this time on Coursera courses or take out another loan to go to Berkeley. And I had the time, too. I didn't have kids that I was taking care of or elderly relatives or other obligations. So I was very fortunate in that way too. That’s a really big part of the story.
This idea of having the time to work on open source comes up often, so that’s a bit of a natural transition. What was your first exposure to open source software, and how did you start to get involved in it?
When I was at Uptake, one of my coworkers, Yuan Tang, he was already a prolific open source contributor. He was one of the maintainers of the XGBoost project. He had written the first official guide to TensorFlow in Mandarin. He had worked on just a ton of projects.
He was really passionate about open source, and he organized a little hack night for us coworkers at Uptake one night after work. I still remember, he had a little R package called dml, it was an R package for calculating distance metrics between data sets. And it was on CRAN. You know, it was like a real project. But it was small enough that it was basically only him contributing to it. And so he took the time to write up manageable sort of good first issue types of issues. And then he got the company to buy us a little bit of food and drinks. And one night after work, we just went into a conference room and he gave like a 10 minute presentation that was like, here's how to contribute to open source projects on GitHub. Specifically, here's how it's different from working at your job. Here's what a fork is, for example. You don't really think about that maybe when you work in a company. Here's the etiquette, here's the mechanism, here’s what continuous integration is.
And then for the night, he paired us up and we each picked an issue and we just worked on it. And he sort of walked around the room like a teacher and we could raise our hand and be like, hey, I have a question about this. And we all came out of that night having made our first open source contribution.
Getting that first one was what I needed to break through that mental barrier of “What do I have to contribute?” To hear him say you know, most of my first contributions were writing unit tests, fixing documentation, removing unused imports. I thought OK, I can do that. And so I started doing it. One of the tricks that I use is to take something I learned something at work, for example I learned if you combine together multiple run statements in a docker file, the resulting image is smaller because there's fewer layers. Once I learned that, I was like I'm going to go look in all the repos in the Apache GitHub project and see if there’s any there that are publishing images that I could go make that change. It’s a super small, well defined thing, and I can go in and they’ll be grateful. I don't have to understand how all of Apache Pulsar works to just like cut 50 megabytes out of their container image. I repeated that pattern a few times, where I would learn something at work, then go look at high profile projects and see if I could apply it.
It’s interesting in the data science world it sounds different from contributing to a JavaScript library or something, where it’s coming from an implementation of an idea that may be in an academic paper, versus here’s a new calendar widget or something. It seems more intimidating to me to jump into a project like that, like LightGBM, where there's a sort of deep academic foundation to it that you feel like you might need to understand. You describing here how you got into some of those projects with these well defined changes is really interesting.
This feeling of needing to understand the entire project to make a contribution, it was really important for me to get over that and to realize that's not true. There are a ton of these helpful contributions that are appreciated by maintainers that do not require you to understand the whole project, and and forming pattern recognition of what those are can reduce the amount of time it takes to get contributions out there and start forming a relationship with maintainers. By the time I started contributing to LightGBM I had done maybe dozens but not hundreds of these small open source contributions. I needed to be taught the the etiquette for for open source, where I was originally changing things that didn't need to be changed, or I was making arbitrary style-related changes that I had no business making in a project. I had to learn those lessons. And so by the time I came to LightGBM, I understood the etiquette and how to contribute.
As you said, it’s an academic project that was then turned into software. The people running it were researchers, they were not software engineers by training. They had PhDs in machine learning and statistics. The original LightGBM paper was a C library with a C API. Shortly after that, a Python package was added on top that takes in Python data structures and then calls the underlying C stuff through that C API. When I started getting involved with the project around 2017, there was an R package that was contributed by an outside contributor, not a maintainer, but it was kind of sitting there.
The maintainers were not R programmers. They knew it a little bit, but it wasn't getting the attention that the rest of the project did. So I came in there and I had just been learning about R packaging and I saw a bunch of things right away that I felt I could help with. R is very particular about how you structure your documentation, how you import third party dependencies, all this stuff. So I started just making submitting pull requests related to the R package, and I tried to explain in each of those what I was doing because I knew the maintainers weren’t R programmers. And again, I was very fortunate timing-wise. The project was sort of at an inflection point in its popularity, and the maintainers were feeling the strain of all these feature requests, bug reports. And they saw me showing up, making contributions to the R package. They invited me to the project. I got commit rights on the repo. I joined the private maintainer Slack, this was 2017 or so, and over time I've just gotten more and more involved with the project. At this point I am its most active maintainer. I’ve learned so so much from it.
There’s this reciprocal nature between you doing more academic work that allowed you to contribute to the open source project to begin with, because you had this R expertise that the original creators of the package didn't have. But then on the flip side, you're saying that you learned so much from the project that you didn't know. That’s really interesting.
The other thing this project prepared me for really well is for working remotely in my day to day job. The entire time I've been involved in this project, I have been the only American and I think there are only two of us involved with the project who are native English speakers. Most of the other maintainers are not native English speakers and they are not in Western Hemisphere time zones. They're in Beijing, Moscow. There was one in France, one in Australia and so I had to learn through this project how to communicate effectively in writing asynchronously. I had to learn to anticipate the questions I was going to be asked and be proactive in putting all my assumptions into writing to reduce the number of cycles. As I’ve gotten into more senior engineering positions where more of my job is writing words than writing code, that’s helped me a lot.
The human side of open source is something that comes up again and again as a reason for people feeling like they can’t contribute. How do you think about that?
When I was at Uptake I was sort of their unofficial Head of Open Source which wasn’t real but it meant that I was given a little bit of budget to run a biweekly hack night after work, get some food and drinks. We even ran an event for Hacktoberfest that was open to the public in our space where my coworkers and I walked around and helped people make contributions. I tried to break down that barrier for people feeling like they’re not good enough or not knowing where to start. I show them look, here’s a real PR I made to Apache Spark or something where I changed the word “three” to the word “several” because the list that it was referring to had four things in it. Something where the audience can say “Oh, well I could have done that.” That’s all it takes, you know?
I’ve definitely had some rough experiences with open source, socially. People can get very gruff or defensive. And if I hadn’t had those great first experiences, or been on the other side of it as a maintainer, that might have discouraged me. There’s a book that I want to mention, this book by Nadia Eghbal called Working in Public that saved me from burning out. I know that sounds dramatic, but I mean it. About two years ago I was ready to quit open source, or at least walk away for a few years, until I read her book. Cannot recommend it highly enough.
Why don't we end with what are some other open source projects that you think are really interesting right now?
I have been getting really into the Trino projects. Trino was Apache Presto, then Facebook got overly aggressive with the way it was placing its employees into positions of power in the Presto project, and so the creators of Presto forked it to a new project called Trino. It’s an open source query engine that I find really interesting.
I've just started getting into a project called JAX. JAX is a machine learning and deep learning framework that also has some really nice lower level primitives for just in time compiling Python functions, and the coolest thing that it offers is automatic differentiation of functions which is a really important thing in machine learning. This JAX project lets you come with basically arbitrary Python code that describes what is good or bad to a machine learning model and use it to train a model. It's really cool.
There's a whole other class of projects that I'm really interested in and wish I could spend more time on. Those are all projects around storing machine learning models in a language agnostic way. So that is like the ONNX project, the PFA project, the PMML project. I'm really interested in those. They’re like data interchange formats for machine learning models. I haven’t spent enough time with them recently.