Occasionally, I find myself teaching some courses on a number of topics. Most recently, I’ve been teaching a course around getting started with Python and Data Science concepts.
Wait. Didn’t you say that already? We get it. You teach classes.
Yeah, I know. I said that before. But then my brain got real hyped about ways of teaching and took me in a different direction. But we’re back! And now we’re gonna talk about what I actually meant to write the first time
And what’s that?
Well, setting up a nice starter environment for a Python Data Science class of course! I’ve taught classes like these in a few different places and in a few different ways and it’s always bothered me a little bit that it’s done in a way and generally with tooling that I’ve just never seen or heard of before.
Why Teach What We Don’t Use?
Listen, I get it. We’ve got a lot to cover. We just want to make things easy. We will just give you a ready baked environment to get started in and we’re off the the races. Which is probably okay. They can focus on the topics at hand. But that’s good until either the next course OR even worse, their first job. How are they going to know how to get started on their next project without actually having to set it up themselves?!
I don’t know about you, but speaking for myself, a developers tooling is fine tuned to the developer themselves. We may show someone how to get started, but that environment eventually becomes a reflection of that persons brain and their way of thinking. With enough time and practice, it’s an extension of their brain itself.
So that’s why I believe in introducing students of all experience levels to tools that they’re actually going to see. Let them get used to and familiar with them now. Make sure they know how to navigate with them, make sure they know how to discover and navigate the community and documentation and most importantly, make sure they know how to make it their own.
Easy Bake Oven, But For Code And Data
Now this isn’t to say it’s best to just toss everyone into the deep end (although sometimes it is; an exception for every rule). My thinking here though is give them the baseplate. Give folks the basic ingredients and let the kids cook, as they say.
So for this particular example, what we’ll get started with is VSCode. I’m not going to be here to argue about what’s best, the value of some other IDE or why I’m not forcing anyone into starting out in VIM. What I’m here to do is to show that you can (and should) get started with a well known IDE and get it set up in a way that shows you how to actually accomplish the goal you want to complete.
Our Goal:
Have a basic environment where we can begin completing Python based data science exercises, focusing initially on primarily on Pandas.
So let’s get started
Installing VSCode
This first bit is about as plain as it gets. We need the tool. So let’s go get the tool. You can pick up the installer VSCode for your platform here. This is a standard installer, so just launch it and follow the standard instructions to get it installed. Once you’ve got it installed, you can launch VSCode and you’ll have a regular, degular install that looks like this:
A very empty, freshly installed VSCode installation
My install will reflect a MacOS setup, but for the most part after the install everything should be the same regardless of the platform you’re running on.
The Plugins
That being said, a normal install of VSCode is kind of like buying a thing, and then just leaving it in the box. You know it can do so many things! But you’re just leaving all of that untapped potential sitting there, not serving you. A waste. So let’s make sure we don’t make that mistake.
Plugins can be the key to unlocking a ton of velocity in VSCode. And the nice (but sometimes very sharp) bit of this is that it mostly comes at the click of the button. There are times when installing a plugin can reduce a ton of overhead in your developer workflow. That being said, because of the ease of installation, it can also lead you down a deep dark hole of plugin FOMO, which inevitably results in the bloat of your IDE. So buyer, beware. Start with what you need and build from there!
In our case, let’s remind ourselves of our goal:
- Python Environment
- Data Science Starter Kit
So to begin we’ll find ourselves the following plugins:
- Python
- Pylance
- Python Debugger
This will give us the Python environment that we’re desiring. A majority of the other tooling that we’re interested will be handled via Python itself, but one particular set of plugins will be helpful to us as we start some of the DS work. Specifically, we’ll want the following Jupyter plugins:
- Jupyter
- Jupyter Keymap
- Jupyter Slideshow
- Jupyter Cell Tags
- Jupyter Notebook Renderers
To grab all of these we will want to navigate to the plugins tab of our VSCode window and we can search for them by name:
Tracking down our Python plugins
On the left hand side of your window, you’ll see the little building blocks icon which is where you’ll find your plugins. From there, you can see a search bar where you can type in the name of all of the above plugins that I’ve mentioned. You’ll see the install button and you can go ahead and click on them to begin the installation. Once it’s finished, you may (or may not) be prompted to restart VScode. If you see no restart prompt, you’re good to go!
We Have The Technology
Alright, we’ve got the tools and presumably we’ve got the time. So what better time to get started than now! However, this is really just the beginning. Before we can write any code, we still need the packages that will unlock all of our data science needs. To do this, we’ll use a few different tools.
Virtual Environments
The first, and maybe most important, is configuring a virtual environment. Why do we want a virtual environment. To save yourselves the pain of forever fumbling between dependencies and potentially causing yourself a ton of grief by interfering with your local system packages. I’ve experienced no greater pain than having one project that works, going to work on another and then coming back to my previously perfect project and finding out, nothing works and the house is on fire because suddenly none of my packages know how to work together any longer.
This is exactly what a virtual environment is for. While this of course can be done directly from the command line, since we’re talking about VSCode we’re going to focus just on how to do this through the UI.
The first thing we need to do is get ourselves some space to begin working. We’ll use VSCode to create a new directory to begin creating our python files in. After you’ve launched VSCode, you can just click the Open prompt and it will pull up your file explorer.
Creating a space to work in VSCode
From here, you can go ahead and click the New Folder option (or otherwise navigate to somewhere you would like to create a new folder at) and create a directory of your choosing. For our purposes, we’ll name ours starter-pack . Once the directory has been created, go ahead and click on it and then click Open .
Creating a directory in VSCode
From here, you’ll see your folder appear in the file explorer in VSCode (that’s the icon that looks like multiple pages folded over each other). The directory is empty for now as we’ve just created it, but let’s go ahead and solve that problem.
Our empty directory in the VSCode file explorer
We’ll go ahead and click on the New File icon, which is the first icon next to the name of our directory. You should see your cursor start flashing and you’ll be able to type a filename of your choosing. For our purposes, let’s start with main.py . We’ve go ourselves a directory to work in and we’ve stubbed out an example file. WE’RE ALMOST THERE!
At this point, we’re ready to get our virtual environment up and running. This is where our plugins will begin to come in handy. Once a plugin is installed it can come bundled with any number of pre-set actions. In the case of our Python plugin, it provides a set of actions to allow you to set up a virtual environment. The way that we gain access to these actions is through the Command Pallete . You can find this under the View menu, alongside the keyboard shortcut you can use to automatically pop it up. However you decide to access it, once it pops up, we can begin typing Python: Create Environment and you’ll begin to see the action that we want to launch. Once we see that, go ahead and click on it to begin.
Creating a Python virtual environment
Once launched, you’ll have two options. You can create a virtual environment with venv OR you can create an environment with conda . What you’ll find is this will really come down to your preference. For our purposes, I’ll stick with venv because it’s what I know. Go ahead and click venv and you’ll then be prompted to choose a Python interpreter. Depending on the number of versions installed on your system, you may have a number to choose from. It’s generally safest to go with the latest to ensure you’re developing with an up to date version of Python. However, if there’s a specific version of Python that you need to target, feel free to proceed with that one as well.
Selecting venv as our preferred virtual environment
After we’ve got that sorted, you should see a small loading bar at the bottom right hand of your editor. But more importantly, we should see a .venv folder pop into existence in our file explorer. We don’t have to worry too much about the specifics of what’s in this folder — but just know that VSCode has activated this virtual environment for you and Python is now using the contents of this directory. It includes the binaries that you’ll be referencing (python
,pip
, etc.) and any of the packages that you will go on to install.
One additional step (although unnecessary, but I highly recommend) is updating your settings to automatically activate your virtual environment when you launch a terminal. This can be done by navigating to Settings . Once this screen pops up, you can begin typing python.terminal in and a number of settings should rise to the top. The ones that we’re interested in is Python > Terminal: Activate Env In Current Terminal and Python > Terminal: Activate Environment . In my experience, without these checked you can sometimes find yourself attempting to do something and then realize that your environment actually hasn’t been activated. With these checked, it should take care of this for you. You may need to restart your editor after making these changes.
Configuring venv activation by default in the Settings menu
Note: If you ever find yourself having to manually activate your environment, it’s not a big deal. You can just run source .venv/bin/activate and you should be in good shape.
Packages? What Packages?
Ah, yes. Those pesky packages that we actually need to begin doing our work. What we have at the moment is just a clean environment to begin doing our work in and the promise of not blowing up our laptops. We’re not even at the point of having the tools we need for the job yet! But now we can begin to solve that. The first time we go to set this up, we’ll be doing some additional leg work to get this all set up. To do this, we’ll need to drop into a terminal.
Note: It’s going to be okay. I promise. If you’re not familiar with the terminal, it can’t hurt you. We’re going to just start with some basic pip commands to get what we need. We can make sure you two become friends another time.
We’ll first start by launching our terminal. Again, you can find this under the menu Terminal and choose New Terminal . You should also see the shortcut to launch this a bit quicker in the future. You also could use the Command Palette like we did earlier. You’ll find there are usually many ways to do the same thing and you’ll slowly find the way that you’re most comfortable working. Do what works for you!
With our terminal now open, let’s install some packages. As a reminder, the second part of our goal here is to build out a data science toolkit that will allow us to quickly jump into a number of DS exercises. So the packages we’ll jumpstart our project(s) with will be used for that purpose.
With that said, the packages we’ll be interested in are:
- pandas
- seaborn
- numpy
- requests
- matplotlib
Note: This should not be considered an exhaustive list. Again, starter pack. This will give us just enough to get started with some initial exercises. What you’ll find as you begin to do more and more is that you’ll find that either a) some of this doesn’t fit your needs or b) it does and actually you want to add more to it. That’s great! Keep what works. Toss the rest! This is most importantly an exercise in building out a toolbox that could work for you. What you fill it with will be completely up to you!
With that rant complete, let’s go ahead and get these packages installed. We’ve got our terminal open, we’ve got our list of packages. Now all we need to do is feed them to pip and we’ll get these installed. We can do this all at the same time with:
pip install pandas seaborn numpy requests matplotlib
You’ll then see a bunch of text pass you by telling you that all of these packages have been installed. You’ll notice we didn’t pass in any particular version. At this point, we’ve done that on purpose. We want the latest. However, there will be a time when you want to make sure that you continue to build to certain tooling.
To make sure we’ve got a list of what exactly it is we pulled down, we can actually use pip to generate ourselves an inventory so that future us can take this same code, in a new environment, and end right back up in the same place as we are today. To do this, we can run:
pip freeze > requirements.txt
If we take a small sample out of that file, we’ll see something like:
requests==2.32.3
seaborn==0.13.2
We can see that it tells us the exact version of the package that has been installed. We could then use therequirements.txt in any other environment (or even the same environment if we trash and recreate our virtual environment) to recreate everything without running those individual pip install commands. We could instead run the following (with our generated file) and get the same result:
pip install -r requirements.txt
And just like that, we now know how to setup a space to start developing in and cobble together the tools we need to get started. We also have the ability now to recreate that same environment over and over again should we like to. Almost like it would be cool to use it as a template for any future projects or exercises we might want to get into.
Ah, a template. This was the plan all along.
GitHub Template
And this brings us to the nice, bright pink bow we can slap on all of this. The worst part of any project tends to be getting started. You’ll sit there and think to yourself:
I do this. Every time. Why can’t I just click a button and deal with all of this scaffolding?!
Well, as it turns out, you can. You could keep things simple and just do the ol’ copy paste from this directory into a new directory each time. But.. that’s kind of gross and only works from your local machine. What if instead we made this a template that could be accessed from anywhere that we could grab it from GitHub?
I’ll save the lecture on the importance of source control for another time. But that aside, being able to get a fresh copy of your template with a few taps of the keyboard can save a ton of time. All we need to do is create a repository in GitHub and commit our main.py and requirements.txt files. We’ll save ourselves the pain of the terminal for the moment and look at this via the UI.
Assuming that we already have a GitHub account and are logged in, we can visit https://github.com/new to create a new repository. For our purposes, we’ll call this example-starter-pack , but you can call it whatever you would like. Click Add a README file as well and then click Create Repository .
Creating a new repository in GitHub
Once we’ve got our empty repository, it should drop you on a screen that looks like this:
A GitHub repository with an empty README
Now all we need to do is add the two files we mentioned above main.py and requirements.txt . So now all we need to do is click the Add file button that we see above and choose either Add file or Upload files . Since main.py is just going to be an empty file, we can just click Add file
, give it the title of main.py and click Commit changes . Since we’re setting this up as our own template repo, it’s okay to merge directly to main . We can talk about the peril of committing to main another day ❤.
At this point, we’ll have something that looks like the following:
Our GitHub repository with our empty main.py file
So now our final piece will be adding the requirements.txt file. Since this is just a handful of lines, let’s go ahead and copy those from VSCode to our clipboard (a little copy pasta never hurt anybody). Once we’ve copied those contents, let’s go ahead and add another file just like we did for main.py . This time we’ll be configuring the name of the file AND pasting the contents into that file.
Adding our inventory of python packages to our repository
With that completed, we’ll commit this file just like we did the last one and we’ll end up with all the contents that we need. You might be asking:
But what about our .venv folder ?!
While that’s important on the machine and during the time you’re working in your virtual environment, once you’re done with it you don’t need it at all! You especially don’t want/need to be committing that into source control as you’ll just be lugging a number of binaries and other packages along that you don’t actually need.
But now that we’ve got our repository in the shape that we want, there’s one last step in making it a fully fledged template directory. As it stands, you could just keep cloning this repository, making changes, and then committing it to some other repo. But again, we have better tools to make your developer workflow fly!
What we’ll want to do is click on Settings at the top of our GitHub screen
Finding the Settings tab in GitHub
Once we’ve clicked through this menu, we should a button that says Template repository right under our repository name. Go ahead and click that. Once you click it, it will immediately make this repository a template repository!
If we navigate back to the root of our repository, you’ll notice a new button that says Use this template . If you go to click on that button, you’ll see that you have the option to Create a new repository which will return you to the repo creation screen that we saw before. This will look almost identical to how we created ours manually previously, except that it nows shows that we’re using our template. It’ll ask us where we want to put it and then we can choose to create a repository as normal from there.
Just like that! No more copy and pasting all over the place. Anytime we want to jump into a new project, we can bust out our newly built toolkit and get right to business.
The End
If you’ve made it this far, you’re probably saying to yourself.
That was a lot of work. And a whole lot of words.
And I completely agree. But like with most things, a little investment up front will save you a whole lot of time in the long run. At this point, we now know how to get our VSCode environment up and running, ensure that we keep our work confined to just the location we want it to and then set up a toolkit which we can then use repeatedly anytime we want to dive into a new project.
Are there better ways? Almost definitely. Are there better tools? I’m sure you’ll continue to refine that as you dig deeper into more complex projects. But for folks just getting started or joining a new team, I’ve found this to be a nice way to remove complexity to allow them to focus on the task(s) at hand and provide them a jumping off point to begin filling out their own toolboxes. My approach has changed over time and I’m sure will continue to change over time and I hope that if you take anything away from this, it’s that yours should too! As I mentioned earlier: keep what works, toss everything else. And then keep doing that, repeatedly.