Sort, Filter, and Remap API Data in Python

Mark Michon - Sep 30 '20 - - Dev Community

Are you taking data from an API in the format the web services gives it to you? You should not dictate the structure of data inside your application based on how an API provider structures their data. Instead, you can take advantage of the power of Python's list manipulation techniques to sort, filter, and reorganize data in ways that best suit your needs.

In this article, we'll explore a few methods of using native Python features to sort, filter, and remap data from an external REST API that you can then use within your application, or pass down to a client as part of your own API. Looking for techniques to do this in Javascript? We wrote about that in a previous article.

To follow along, you'll need to have Python3 installed and be familiar with running Python code.

Retrieving data from an API

To get started, we'll need some data to manipulate. For the examples in this article, we can use GitHub's v3 REST API and its search/repositories endpoint. The endpoint takes a few parameters, but we'll use q to pass in the Bearer search query, and per_page to limit the results to 10. This makes it easier to follow along if you are printing results to the console, but feel free to increase the returned results per page if you want a larger dataset to work with. You can also change the search query to anything you like, as it won't affect the manipulations that we'll perform.

You can use any HTTP client, but our examples will be in Requests. Begin by installing requests and importing it into your project.

To install it, use pip (or pip3, depending on your local setup):

pip install requests
Enter fullscreen mode Exit fullscreen mode

Next, import requests and set up the API call.

import requests

def get_repos():
  res = requests.get('https://api.github.com/search/repositories?q=bearer&per_page=10')
  return res.json()['items'] # Returns the value of the 'items' key from the JSON response body
Enter fullscreen mode Exit fullscreen mode

The code above defines a function, get_repos that will make the API call and return only the part of the response that we are interested in. In this case, the items list on the JSON response. From this point forward, whenever we need the items data, we can assign it to a variable with x = get_repos(). This will leave us with a list of dict, or dictionary, data types.

Now with the setup out of the way, we can begin manipulating the data from the API.

Sorting results in Python

While many APIs allow you to sort the results in some way, rarely do you have access to sort by any property. GitHub, for example, allows sorting by stars, forks, help-wanted-issues, and how recently a repo has been updated. If you want a different sorting method, you'll need to handle this on your own. One thing to keep in mind is that we are only going to sort the results we retrieve. This means that if you capture the first ten results, sort them, then capture the next 10 results, they will not all be in the right order. If working with large datasets, it's best to capture all the data before sorting to avoid repeating the work on the same set of data multiple times.

Python offers the sorted built-in function to make sorting lists easier. sorted takes the list to you want to sort, and a key representing the value to use when sorting the results. Let's take the following example: Sort the repositories by open issue count, from fewest issues to most issues.

# This line can be used in each example, but you only need it once in your file
repos = get_repos()

fewest_issues_sort = sorted(repos, key=lambda repo: repo['open_issues_count'])
Enter fullscreen mode Exit fullscreen mode

The code above passes repos in to the sorted function, and then uses a lambda function for the key. Lambda functions in Python are anonymous functions that you can use in-line without defining them elsewhere. They are great for use-cases like ours. This lambda is saying: "pass each iteration of repos to this function as repo, then return its open_issues_count." The sorted function uses the open issues count as the key. The result is a new list of dictionaries sorted by fewest issues to most issues.

What if you want to reverse the order to show repos with the highest issue count first? You can achieve this with the reverse argument in sorted. By setting reverse to True the same logic will apply, but in reverse.

most_issues_sort = sorted(repos, key=lambda repo: repo['open_issues_count'], reverse=True)
Enter fullscreen mode Exit fullscreen mode

You can use the same technique to sort by any value.

Filtering list data with Python

Filters pair well with sorting. They allow you to reduce a list down to only the entries that matter for your needs. There are a variety of ways to filter a list, but they all use the same concept of building a new list from a subset of the original list's entries.

One of the most effective ways to do this is using list comprehension. List comprehension is a method for creating lists, but it can also simplify complex loops into shorter expressions. Let's take an example where we only want repositories in our list that have a description filled out.

# Fetch the repos if you haven't already
repos = get_repos()

repos_with_descriptions = [repo for repo in repos if repo['description'] is not None]
Enter fullscreen mode Exit fullscreen mode

The code above does a lot with a single line, so let's break it down. It assigns repos_with_descriptions to a new list, [ ]. The first repo in repo for represents what we want to add to the new list. The next part repo in repos should look familiar from any for-in loop. Finally, the end is our condition: if repo['description'] is not None. When a description is not set in the GitHub API, it returns null to the JSON response. Python interprets this as None, so we check to see if the description is not set to None since we only want repositories with a description.

What if we want two conditions? For example, only repositories that have a description and a homepage URL set.

repos_with_desc_and_home = [repo for repo in repos if repo['description'] is not None and repo['homepage'] is not None]
Enter fullscreen mode Exit fullscreen mode

As with our previous example, this code says: "Make a new list made up of each repo in repos, but only include those where the description is not None and the homepage is not None."

In both examples we've used the is not operator. Let's look at a more conventional condition and only include repositories with 100 or more stars.

popular_repos = [repo for repo in repos if repo['stargazers_count'] >= 100]
Enter fullscreen mode Exit fullscreen mode

The code again uses list comprehension, but this time includes a more traditional greater-than-or-equal condition to compare values. You could also implement any of these examples with your preferred looping and list-generation methods.

Remapping or normalizing data

Sometimes an API returns more information than you need. GraphQL solves this by allowing the client to specify exactly the data it needs, but for now most APIs are still REST-like. If you need to reduce the data down to only a few properties before you interact with it or send it on to other parts of your app, you can do so by iterating over the list and building a new one with only the parts you want.

Loops or list comprehension can do this, but we will use Python's map function. map applies a function to every item in an iterable. Let's look at an example where we take the original repos list and simplify it into fewer key/value pairs with map.

def simplify(repo):
    return {
        'url': repo['html_url'],
        'name': repo['name'],
        'owner': repo['owner']['login'],
        'description': repo['description'],
        'stars': repo['stargazers_count']
    }

new_repos = map(simplify, repos)
Enter fullscreen mode Exit fullscreen mode

Here, map receives the simplify function which will run over each item in repos. You can use this same method to manipulate the data further, normalize values, and even strip sensitive information.

Beyond the basics

The techniques mentioned in this article are a great starting point to handle the majority of your use cases. API's mostly return JSON, which Python can convert into dictionaries. In our examples, we captured the items list from the response dictionary and used it alongside many built-in Python functions to iterate over the data.

Whether the data coming from an API is the core of your product offering, or only a part of the value you provide to your users, you need to be able to manipulate it to best suit your needs. You also need to know that the source of this data is reliable.

How vital is this API data to the success of your app? What happens when the APIs experience performance problems or outages? There are safeguards you can put in place to avoid and manage downtimes and protect your integrations from failure. At Bearer we're building a solution that automatically monitors API performance, notifies you of problems, and can even intelligently react when a problem happens. Learn more about Bearer and give it a try today.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .