TDM 30100: Project 8 — 2022

Motivation: Python is an incredible language that enables users to write code quickly and efficiently. When using Python to write code, you are typically making a tradeoff between developer speed (the time it takes to write a functioning program) and program speed (how fast your code runs). Python code does not have the advantage of easily being compiled to machine code and shared. In Python, you need to learn how to use virtual environments, and it is good to have an understanding of how to build and push a package to pypi.

Context: This is the first in a series of 3 projects that focuses on setting up and using virtual environments, and creating a package. This is not intended to teach you everything, but rather, give you some exposure to the topics.

Scope: Python, virtual environments, pypi

Learning Objectives
  • Explain what a virtual environment is and why it is important.

  • Create, update, and use a virtual environment to run somebody else’s Python code.

  • Create a Python package and publish it on pypi.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/movies_and_tv/imdb.db

Questions

Question 1

This project will be focused on creating, updating, and understanding Python virtual environments. Since this is The Data Mine, we will pepper in some small data-related tasks, like writing functions to operate on data, but the focus is on virtual environments.

Let’s get started.

Use this article as your reference. First thing is first. We have a Jupyter notebook that we tend to work in, running our bash code in bash cells. This is very different than your typical environment. For this reason, let’s start by popping open a terminal, and working in the terminal.

You can open a terminal in JupyterLab by clicking on the blue "+" button in the upper left-hand corner of the Jupyter interface. Scroll down to the last row and click on the button that says "Terminal".

Start by taking a look at which python3 you are running. Run the following in the terminal.

which python3

Take a look at the available packages as follows.

python3 -m pip list

This doesn’t look right, it doesn’t look like our f2022-s2023 environment, does it? It doesn’t even have pandas installed. This is because we don’t have JupyterLab configured to have our f2022-s2023 version of Python pre-loaded in a fresh terminal session. In fact, with this project, we aren’t going to use that environment!

The f2022-s2023 environment runs inside a container. You will learn more about this later on, but suffice it to say it makes it much more difficult to do what we want to do for this project.

Instead, we are going to use the non-containerized version of Python that is running the JupyterLab instance itself! To load up this environment, run the following.

module load python/jupyterlab

Then, check out how things have changed.

which python3
python3 -m pip list

Looks like we are getting there! Let’s back up a bit and explain some things.

What does which python3 do? which will print out the absolute path to the command which would be executed. In this case, running python3 would be the same as executing /anvil/projects/tdm/apps/python/3.10.5/bin/python3.

What does the python3 -m pip mean? The -m stands for module-name. In a nutshell, this ensures that the correct pip — the pip associated with the current python3 is used! This is important, because, if you have many versions of Python installed on your system, if environment variables aren’t correctly set, it could be possible to use a completely different pip associated with a completely different version of Python, which could cause all sorts of errors! To prevent this, it is safer to do python3 -m pip instead of just pip.

What does python3 -m pip list do? The python3 -m pip is the same as before. The list command is an argument you can pass to pip that lists the packages installed in the current environment.

Perform the following operations.

  1. Use venv to create a new virtual environment called question01.

  2. Confirm that the virtual environment has been created by running the following.

    source question01/bin/activate
  3. This should activate your virtual environment. You will notice that python3 now points to an interpreter in your virtual environment directory.

    which python3
    output
    /path/to/question01/bin/python3
  4. In addition, you can see the blank slate when it comes to installed Python packages.

    python3 -m pip list
    output
    Package    Version
    ---------- -------
    pip        22.0.4
    setuptools 58.1.0
    WARNING: You are using pip version 22.0.4; however, version 22.2.2 is available.
    You should consider upgrading via the '/home/x-kamstut/question01/bin/python3 -m pip install --upgrade pip' command.
Items to submit

See here on how to properly include a screenshot in your Jupyter notebook. If you do not properly submit the screenshot, you will likely lose points, so please take a minute to read it.

  • Screenshot showing source question01/bin/activate output.

  • Screenshot showing which python3 output after activating the virtual environment.

  • Screenshow showing python3 -m pip list output after activating the virtual environment.

Question 2

Okay, in question (1) you ran some commands and supposedly created your own virtual environment. You are possibly still confused on what you did or why — that is okay! Things will hopefully become more clear as you progress.

Read this section of the article provided in question (1). In your own words, explain 2 good reasons why virtual environments are important when using Python. Place these explanations in a markdown cell in your notebook.

We are going to create and modify and destroy environments quite a bit! Don’t be intimidated by messing around with your environment.

Okay, now that you’ve grokked why virtual environments are important, let’s try to see a virtual environment in action.

Activate your empty virtual environment from question (1) (if it is not already active). If you were to try and import the requests package, what do you expect would happen? If you were to deactivate your virtual environment and then try and import the requests package, what would you expect would happen?

Test out both! First activate your virtual environment from question (1), and then run python3 and try to import requests. Next, run deactivate to deactivate your virtual environment. Run python3 and try to import requests. Were the results what you expected? Please include 2 screenshots — 1 for each attempt at importing requests.

As you should hopefully see — the virtual environments do work! When a certain environment is active, only a certain set of packages is made available! Pretty cool!

Items to submit
  • 1-2 sentences, per reason, on why virtual environments are important when using Python.

  • 1 screenshot showing the attempt to import the requests library from within your question01 virtual environment.

  • 1 screenshot showing the attempt to import the requests library from outside the question01 virtual environment.

Question 3

Create a Python script called imdb.py that accepts a single argument, id, and prints out the following.

python3 imdb.py imdb tt4236770
output
Title: Yellowstone
Rating: 8.6

You can use the following as your skeleton.

#!/usr/bin/env python3

import argparse
import sqlite3
import sys
from rich import print

def get_info(iid: str) -> None:
    """
    Given an imdb id, print out some basic info about the title.
    """

    conn = sqlite3.connect("/anvil/projects/tdm/data/movies_and_tv/imdb.db")
    cur = conn.cursor()

    # make a query (fill in code here)

    # print results
    print(f"Title: [bold blue]{title}[/bold blue]\nRating: [bold green]{rating}[/bold green]")


def main():
	parser = argparse.ArgumentParser()
	subparsers = parser.add_subparsers(help="possible commands", dest="command")
	some_parser = subparsers.add_parser("imdb", help="")
	some_parser.add_argument("id", help="id to get info about")

	if len(sys.argv) == 1:
		parser.print_help()
		sys.exit(1)

	args = parser.parse_args()

	if args.command == "imdb":
		get_info(args.id)

if __name__ == "__main__":
	main()

Deactivate any environment you may have active.

deactivate

Confirm that the proper python3 is active.

which python3
output
/anvil/projects/tdm/apps/python/3.10.5/bin/python3

Now test out your script by running the following.

python3 imdb.py imdb tt4236770

What happens? Well, the package rich should not be installed to our current environment. Easy enough to fix, right? After all, we know how to make our own virtual environments now!

Create a virtual environment called question03. This time, when creating your virtual environment, add an additional flag --copies to the very end of the command. Activate your virtual environment and confirm that we are using the correct environment.

source question03/bin/activate
which python3

Immediately trying the script again should fail, since we still don’t have the rich package installed.

python3 imdb.py imdb tt4236770
output
ModuleNotFoundError: No module named 'rich'

Okay! Use pip (using our python3 -m pip trick) to install rich and try to run the script again!

Not only should the script now work, but, if you take a look at the packages installed in your environment, there should be some new additions.

python3 -m pip list
output
Package    Version
---------- -------
commonmark 0.9.1
pip        22.0.4
Pygments   2.13.0
rich       12.6.0
setuptools 58.1.0

That is awesome! You just solved the issue of not being able to run some Python code because a package was not installed for you. You did this by first creating your own custom Python virtual environment, installing the required package to your virtual environment, and then executing the code that wasn’t previously working!

Items to submit
  • Screenshot showing the activation of the question03 virtual environment, the pip install, and successful output of the script.

  • Screenshot showing the resulting set of packages, python3 -m pip list, for the question03 virtual environment.

Question 4

Okay, let’s take a tiny step back to peek at a few underlying details of our question01 and question03 virtual environments.

Specifically, start with the question01 environment. The entire environment lives within that question01 directory doesn’t it? Or _does it!?

ls -la question01/bin

Notice anything about the contents of the question01 bin directory? They are symbolic links! python3 actually points to the same interpreter that was active when we created the virtual environment, the /anvil/projects/tdm/apps/python/3.10.5/bin/python3 interpreter! But wait, how do we have a different set of packages then, if we are using the same Python interpreter? The answer is, your Python interpreter will look in a variety of locations for your packages. By activating your virtual environment, we’ve altered our PYTHONPATH.

If you run the following, you will see the list of directories that Python searches for packages, when importing.

import sys

sys.path
example output
['', '/anvil/projects/tdm/apps/python/3.10.5/lib/python3.10/site-packages', '/anvil/projects/tdm/apps/python/3.10.5/lib/python3.10', '/anvil/projects/tdm/apps/python/3.10.5/lib/python310.zip', '/anvil/projects/tdm/apps/python/3.10.5/lib/python3.10', '/anvil/projects/tdm/apps/python/3.10.5/lib/python3.10/lib-dynload', '/home/x-kamstut/question01/lib/python3.10/site-packages']

sys.path is initialized from the PYTHONPATH environment variable, plus some additional installation-dependent defaults. If you take a peek in question01/lib/python3.10/site-packages, you will see where rich is located. So, even if you look /anvil/projects/tdm/apps/python/3.10.5/lib/python3.10/site-packages, and see that rich is not installed in that location, because Python searches all of those locations for rich and rich is installed in question01/lib/python3.10/site-packages, it will be successfully imported!

This begs the question, what if /anvil/projects/tdm/apps/python/3.10.5/lib/python3.10/site-packages has an older version of rich installed — which version will be imported? Well, let’s test this out!

If you look at plotly in the jupyterlab environment, you will see it is version 5.8.2.

import plotly
plotly.__version__
output
5.8.2

Activate your question03 environment and install plotly==5.10.0. Re-run the following code.

import plotly
plotly.__version__

What is your output? Is that expected?

We modified this question Thursday, October 27 due to a mistake by your instructor (Kevin). If you previously did this problem, no worries, you will get credit either way.

If you take a look at question03/bin/python you will notice that they are not symbolic links, but actual copies of the original interpreter! This is what the --copies argument did earlier on! In general, you’ll likely be fine using venv without the --copies flag.

Items to submit
  • Screenshots of your operations performed from start to finish for this question.

  • 1-2 sentences explaining where Python looks for packages.

Question 5

Last, but certainly not least, is the important topic of pinning dependencies. This practice will allow someone else to replicate the exact set of packages needed to run your Python application.

By default, python3 -m pip install numpy will install the newest compatible version of numpy to your current environment. Sometimes, that version could be too new and create issues with old code. This is why pinning is important.

You can choose to install an exact version of a package by specifying the version. For example, you could install numpy version 1.16, even though the newest version is (as of writing) 1.23. Just run python3 -m pip install numpy==1.16.

This is great, but is there an easy way to pass an entire list of all of the packages in your current virtual environment? Yes! Yes there is! Try it out.

python3 -m pip freeze > requirements.txt
cat requirements.txt

That’s pretty cool! That is a specially formatted list containing a pinned set of packages. You could do the reverse as well. Create a new file called requirements.txt with the following contents copied and pasted.

requirements.txt contents
commonmark==0.9.1
plotly==5.10.0
Pygments==2.13.0
requests==2.2.1
rich==12.6.0
tenacity==8.1.0
thedatamine==0.1.3

You can use the -r option of pip to install all of those pinned packages to an environment. Test it out! Create another new virtual environment called question05, activate the environment, and use the -r option and the requirements.txt file to install all of the packages, with the exact same versions. Double check that the results are the same, and that the installed packages are identical to the requirements.txt file.

Great job! Now, with some Python code, and a requirements.txt file, you should be able to setup a virtual environment and run your friend or co-workers code! Very cool!

Unfortunately, there is more to this mess than meets the eye, and a lot more that can go wrong. But these basics will serve you well and help you solve lots and lots of problems!

Items to submit
  • Screenshots showing the results of running the bash commands from the start of this question to the end.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.