Moesif’s Podcast Network: Providing actionable insights for API product managers and other API professionals
Joining Moesif is Jesscia Lam, currently a technical advisor and angel investor in startups, and the original Chief Architect and VP Engineering at LoungeBuddy, which was acquired by American Express. At LoungeBuddy she designed their APIs, many of which continue to be in use today.
As a CTO, architect, and engineering lead at multiple companies, Jessica shares her experience on how to build products to be more resilient, why error handling is so important, how to treat internal APIs vs. external APIs and many more.
Derric Gilling, Moesif’s CEO is your host today.
Make Your API Platform Successful with Moesif!
Transcript
Derric Gilling (Moesif): Welcome to another episode from Moesif’s APIs over IPAs podcast network. I’m Derric Gilling your host today and the CEO and Moesif, the API Observability platform. Joining me is Jessica Lam who was previously the Chief Architect and VP Engineering at LoungeBuddy. They got acquired by American Express and she designed many of the APIs there, that continue to be used today. Love to hear a little bit more on that, like how did you architect it and what were some of the design decisions that went into that planning.
APIs are Concise Units of Value
Think of your APIs as Concise Units of Value, providing features or tasks that others can reuse and derive value from.
Jessica Lam (LoungeBuddy): Cool. So the evolution of those APIs were in my mind like a natural progression of the company, as well as the product. Initially, the product was just a mobile app, with the usual app backing behind that, there wasn’t any sort of API in addition to that. After that mobile app took off I joined at pre-seed stage. And so, as the first engineer hire, and also the chief architect, one of the first major projects that we were thinking of was how do you enable purchasing for lounges. So some context on LoungeBuddy, it was initially a lounge finder and then later on, it became a lounge management platform.
At the point where we decided to add purchasing, there was sort of the idea that purchasing lounges shouldn’t conceptually be just restricted to the app. And so at that point the insight was that this is a concise unit of value that anyone can derive value from. So in the future, if we’re building a web app, or if we’re building a website for external partners, this concise unit of value can be kind of reused. And so, going back to my calculator programming days, where I offset repetitive tasks, and really thinking about what is the thing that has been repeatedly used that other people can derive value from, that’s where the API idea came from. So let’s not do this as part of the app backend, it would make more sense that we start an API service.
So we thought about what is the centralized way to manage all of those things. Breaking it down into what are the essential steps for making a purchase, first we needed to check availability to see if there’s inventory for a particular lounge. And so, just looking at that sentence you realize that there are certain variables, like which lounge, so you know that would be a parameter you pass to the API. After going through availability you move on to the next thing that you’re interested in: what is the pricing on it? And then, after that, figuring out how to make the purchase. So initially, thinking about conceptually, what are the essential steps, and then just putting it out there, and iterating on it.
One of my core engineering philosophies is that making something easy to change is better than trying to predict the future. So, after the purchasing API was constructed, we first started to test it using our mobile app. And so in the mobile app we used the purchasing flow which helped us work out the communication, as in: whether or not the API makes sense, whether it’s concise enough, or was it overly complex. After that’s worked out, then we started developing the website to use the same API.
Derric: Really great to hear especially thinking about concise units of value. Whenever you’re looking at an API understanding how do you provide more value to your customers, whether it’s something that can be repeatable. How did you measure that? Was there a way to understand each value that each API brought to the business?
Jessica: I would say that this is something that was lacking on the platform and I really wish that you guys were around when we started many, many years ago. We didn’t really have a way of measuring that. We were focused on error handling and making sure that it was robust and that people weren’t experiencing errors. We had some very preliminary surface-level metrics around how often do APIs have calls, but not necessarily anything more than that. And obviously we would know how many people made purchases through the API. But apart from those surface metrics, we didn’t really have anything else beyond that.
Do Not Treat Internal APIs Differently
Treat every client as if they were an external client, so that if you do open up an internal API it’ll just work out the box.
Derric: I guess we should have started a few years earlier. Speaking more towards your APIs, some of these were used by your mobile app and some were used by your website, how did you actually organize your internal APIs versus external facing ones? Were they the same or were they different?
Jessica: That’s a really good question and my principal on that is that just because you’re internal, it doesn’t mean that you get special treatment. One of the ways which was really helpful in using that principle for developing the API, was that when we did open it up to external partners, it literally just worked. Because of the way that we were developing, where we assumed every internal application that uses the API is actually an “external partner”, we didn’t let the internal API have some special back end thing to just optimize some small thing in order to do whatever they wanted. And that’s always the case in negotiation between front end and app back end. For anyone that has worked with different developers, people would say “oh, if the API just returned all of this it would be so much easier for the front end” or something like that. So resisting the urge to have those sort of workarounds will make the API pretty clean and also the separation of concerns and responsibility very clean across the system. So I think my recommendation for developing APIs internally is to treat every client as if it were an external client.
Derric: That’s a really good point, especially when we go to larger companies where you have so many different teams accessing the API, where each each team is effectively a customer. This also helps with security, in terms of making sure the same security that you apply externally, you also apply to other teams, so you don’t ever have a case where there’s maybe a vulnerability in one area of your network.
Simplify New APIs With a Traffic Controller
When standing up new API functionality you don’t want to have to refactor your app. Use a gateway-like service, a traffic controller, for access to your APIs, so that migration is simply supported through the controller.
Derric: Speaking of these APIs a little bit more, did you design the APIs up front, or was it more like you start with the service and then organize it around an API? How do you think about API first?
Jessica: So that’s a really good question. I’ve actually done it two different ways. For the purchasing API it was very obvious from the onset that it was something that was going to be reused across the platform, across all of our different services, and externally. But there were other things that kind of sneaked up on us. For example, there was this one core utility on the application, which is looking at lounges based on access rules and access items. An example of an access item would be your credit card, and then an access rule would be something like: if you have this credit card and you’re inbound to a particular airport then you have access to this particular lounge, otherwise if you’re outbound then you don’t have access. That was initially all inside the iOS app. So what we realized later on when we did integration with American Express, was that there was a need to be able to access that functionality. At that point, figuring out how to effectively abstract that out is an interesting process.
There are people who would say “okay let’s just stop everything, completely refactor that application, take out all of those services and put it in a different service”. In my mind that’s actually a very high-risk endeavor and it doesn’t necessarily give you as much benefit as you think it would. One of the ways that typically I do this type of migration, or changing a paradigm within a system, is to have it use not really an API gateway but a service called traffic control. Essentially the expectation is that if you’re accessing any API you would use the traffic control service. Behind the traffic control service sometimes the API is implemented, but other times it’s actually rerouting to the back end of another service that might have what you’re looking for. So to actually facilitate migration, and if you want to move the code from an app backend into the actual service, you can optionally do that with traffic control.
By setting the expectation that if you’re accessing APIs, use traffic control. You don’t care how traffic control gets that information as long as that contract is with traffic control. And so I think that’s also a really good way to preserve velocity, because with a startup velocity is make our break. I feel like that way of operating was really helpful - to not have to predict the future. We didn’t know that we’re going to be using those externally, but there’s a way to do it, in a very low risk way.
Build in Good Error Handling
When Amex’s iPad app was launched an integration partner of AmEx didn’t have error handling, so when multiple 500 errors appeared, it was up to LoungeBuddy to identify and let the partner know of their issue.
Derric: That’s a really good point, the ability to iterate as quick as possible, especially for early stage startups. But you also brought up an interesting point around migration. What process did you have in place to make sure you didn’t break your integration with AmEx or other customers? Was there a process in place?
Jessica: So startup and process that’s kind of an oxymoron. There was a couple points of integration that we did with American Express. One was that their app used our purchase API. So right now if you log into the American Express app and then go into lounges, the QR code that’s generated at the end is from our purchase flow. The other integration was actually when the iPad was introduced to the AmEx lounges. There was the integration behind the scenes with the process flow, but also integration with their external partner. So the way that it worked was that when someone swiped the card, the card number was actually sent to their service. So there wasn’t a risk in terms of us getting certification and having all of that security stuff in place. After that the external partner returned an access item, which then we did something within our service to figure that out.
We had a lot of error handling in place. The initial integration with that third party actually had a lot of 500 errors. The interesting thing is that they didn’t have error handling, so that they didn’t actually know what errors were happening. But because we had error handling, we were like “we’re getting 500 actually from you” and here are the error messages. And so for the initial launch, not sure I’m supposed to be talking about this, but it was sort of like “so on the front end of the app they will see the 500 errors”, but because their service didn’t have error handling it was up to us. There was a lot of back and forth from us and them. It’s like when this happened we’re getting an error from your service, even though the user is seeing an error on our app. And so that’s kind of the importance of having error handling and very good error handling. Anytime there is a potential for some sort of error, log it, definitely.
Derric: Definitely. Have that error handling and monitoring in place, so you can ship with confidence. Without it then you’re almost flying blind. How did you actually set this up and what were some of the challenges that you ran into.
Jessica: Right. I wonder how much of that has changed since I left about maybe a year or two ago. But the way that we had at that time was that we were using the Heroku backend and they had some tooling just right off the bat, that if you had over a certain percentage of errors, then you could set up an alert to track that. And so we had all that in place and the other thing that calls you, PagerDuty. We had alerting setup with PagerDuty so that over a certain percentage of errors on different services, especially the poor external facing services, someone would get a call. So, have all the error handling in place and then make sure you’re not going to get a call at 3am.
Follow the 80-20 Rule for Testing
Instead of aiming for 100% code coverage test the specific API calls that you’re exposing externally, like the GETs and POSTs, run Artillery for load testing and then get something out there and make it more robust over time.
Derric: No definitely, we see a lot of customers using PagerDuty and so far the service is pretty good. Moesif itself has an integration with PagerDuty. When it comes to testing, did you have any best practices or process there, what were they like and what were some of the challenges for testing the APIs?
Jessica: Yes, that’s a really great question. When it comes to startup and optimizing we try to go for the 80/20 rule. There are obviously theoretical best practices like “oh, we want 100% code coverage”, or “we want to make sure everything is tested”. I feel like I’m a bit of a contrarian on that. I guess my principle was that test the most important things and then, if there’s additional bugs that over time have surfaced, then you add test that way. So, doing the minimum essential thing and then looking at what kind of air you have and then build on top of that - sort of the anti-fragile approach to development.
So some of the things that we did was obviously having really robust tests on the actual specific API calls themselves, like the GETs and POSTs and all of that that you’re exposing externally. And then after those are completed then sometimes, depending on how often a particular service is called on, we also use Artillery to do load testing just before things get deployed as a sanity test to make sure that things aren’t airing out randomly because of high load and things like that. Those tests are phased in after initial launch. So it goes back to what I was talking about before, on getting something out there first and then seeing what kind of errors that you’re getting back and then making it more robust over time, instead of trying to over engineer right off the bat. So when we started to see load issues, then we started to do load testing just to see how many requests per second we were getting. We ran load testing on 2x the amount just to make sure that we had enough padding there.
Minimize Technical Debt and Move Quickly
Always leave the campsite cleaner than how you found it: when fixing a bug or creating a new feature, you’re going to be touching existing code, and when you touch it, if something is inefficient, have the will to think through it and realign the code base.
Derric: Definitely, making sure you don’t over engineer is always important for startups. In engineering leadership a lot of times you’re balancing customer needs versus engineering needs, and you have technical debt that gets built up. What are some ways that you’re able to iterate quickly and balance customer needs without overburdening your engineering team?
Jessica: That’s a really good question. From all of the different startup that I’m an advisor with, or have worked on in the past, that’s always a problem. As in, things are never moving fast enough and there’s always technical debt. The way to kind of risk mitigate that sort of refactoring is to actually always leave the campsite cleaner than how you found it. For example, instead of taking a sprint or two out to refactor something, because you know it started to have technical debt, the ideal way to go about it is that in the midst of fixing a bug, or in the midst of creating a new feature, you’re going to be touching existing code. And when you touch it, if something is inefficient, have the will to think through it and realign the code base.
The primary reason why technical debt is accumulated is that when you’re developing the system for the first time there’s certain business assumptions around the relationship, but over time, perhaps because of different features, those relationships might change. So even if you might notice that change on a design level, or on a feature level, the missing part is that sometimes you don’t do the work to change it on the code level. Again, this goes back to thinking about the system in a holistic way - it’s not that you have design and then you have front end, back end and system, it’s actually all the same thing. Like I said, there should be a consistent conceptual thread that runs through all of it, and design it just like a visual representation of that relationship. And so understanding that relationship deeply, and making sure that that’s reflected in code, in my opinion, is kind of the best we can do to minimize technical debt, and that’s the way to make sure that you can move quickly. If those concepts are properly represented, then to reuse them in the right way is super easy. And so there’s this ideal that when you’re not even designing for a particular method to take on a particular way of being used, it should just work, it’s kind of like a framework. This is actually one of the reasons I really love Apple’s framework - it usually just works if you’re using it in the right way. Whereas a lot of time I think developers don’t really think about the intention of the specific methods or APIs and just try to do it the way that they want to. And so I think it’s important to just also really take the time to understand and then refactor as you develop, as opposed to refactor in a vacuum.
Derric: Really great insights, especially around technical debt and how to think about developing. Really great to have you here Jessica, on our podcast, and hopefully we will visit you in LA sometime soon.
Jessica: Yeah, definitely.