TDM 30200: Project 6 — 2023
Motivation: In this project we will slowly get familiar with SLURM, the job scheduler installed on Anvil.
Context: This is the second in a series of (now) 4 projects focused on parallel computing using SLURM and Python.
Scope: SLURM, unix, Python
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/coco/unlabeled2017/*.jpg
-
/anvil/projects/tdm/data/coco/attempt02/*.jpg
Questions
Interested in being a TA? Please apply: purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE |
Question 1
The more you practice the clearer your understanding will be. So we will be putting our new skills to use to solve a problem.
We begin with a dataset full of images: /anvil/projects/tdm/data/coco/unlabeled2017/*.jpg
.
We know a picture of Dr. Ward is (naturally) included in the folder. The problem is, Dr. Ward is sneaky and he has added a duplicate image of himself in our dataset. This duplicate could cause problems and we need a clean dataset.
It is time consuming and not best practice to manually go through the entire dataset to find the duplicate. Thinking back to some of the past work, we remember that a hash algorithm is a good way to identify the duplicate image.
Below is code you could use to produce a hash of an image.
import hashlib
with open("/anvil/projects/tdm/data/coco/unlabeled2017/000000000013.jpg", "rb") as f:
print(hashlib.sha256(f.read()).hexdigest())
In general a hash function, is a function that takes an input and produces a unique "hash", or alphanumeric string. Meaning if you find two identical hashes, most likely you can assume that the inputs are identical. |
By finding the hash of all of the images in the first folder, then using sets to quickly find the duplicate image. You can write a Python script that outputs a file containing the hash of each image
For our example, the file |
Parallelize the file creating and search process will make finding the duplicate faster.
#!/usr/bin/python3
import os
import sys
import hashlib
import argparse
def hash_file_and_save(files, output_directory):
"""
Given an absolute path to a file, generate a hash of the file and save it
in the output directory with the same name as the original file.
"""
for file in files:
file_name = os.path.basename(file)
file_hash = hashlib.sha256(open(file, "rb").read()).hexdigest()
output_file_path = os.path.join(output_directory, file_name)
with open(output_file_path, "w") as output_file:
output_file.write(file_hash)
def main():
parser = argparse.ArgumentParser()
subparsers = parser.add_subparsers(help="possible commands", dest='command')
hash_parser = subparsers.add_parser("hash", help="generate and save hash")
hash_parser.add_argument("files", help="files to hash", nargs="+")
hash_parser.add_argument("-o", "--output", help="directory to output file to", required=True)
if len(sys.argv) == 1:
parser.print_help()
sys.exit(1)
args = parser.parse_args()
if args.command == "hash":
hash_file_and_save(args.files, args.output)
if __name__ == "__main__":
main()
Quickly recognizing that it is not efficient to have an srun
command for each image, you’d have to programmatically build the job script, also the script runs quickly so there would be a rapid build up wasted time with overhead related to calling srun
, allocating resources, etc. Instead for efficency create a job script that splits the images into groups of 12500 or less. Then, within 10 srun
commands you will be able to use the provided Python script to process the 12500 images.
The Python script we’ve provided works as follows.
./hash.py hash --output /path/to/outputfiles/ /path/to/image1.jpg /path/to/image2.jpg
The above command will generate a hash of the two images (although there could be n images provided) and save the hash in the output directory with the same name as the original image. For example, the following command will calculate the hash of the image 000000000013.jpg
and save it in a file named 000000000013.jpg
in the $SCRATCH
directory. This file is not an image — it is a text file containing the hash, 7ad591844b88ee711d1eb60c4ee6bb776c4795e9cb4616560cb26d2799493afe
. You can see this by running cat $SCRATCH/000000000013.jpg
.
./hash.py hash --output $SCRATCH /anvil/projects/tdm/data/coco/unlabeled2017/000000000013.jpg
You’ll need to give execute permissions to your |
This stackoverflow post shows how to get a Bash array full of absolute paths to files in a folder. |
To pass many arguments (n arguments) to our Python script, you can |
This stackoverflow post shows how to break an array of values into groups of x. |
Don’t forget to clear out the SLURM environment variables in any new terminal session:
|
Create a job script that processes all of the images in the folder, and outputs the hash of each image into a file with the same name as the original image. Output these files into a folder in $SCRATCH
, so, for example, $SCRATCH/q1output
. You will likely want to create the q1output
directory before running your job script.
This job took about 3 minutes and 32 seconds to run. Finding the duplicate image took about 36 seconds. |
Once the images are all hashed, in your Jupyter notebook, write Python code that processes all of the hashes (by reading the files you’ve saved in $SCRATCH/q1output
) and prints out the name of one of the duplicate images. Display the image in your notebook using the following code.
from IPython import display
display.Image("/path/to/duplicate_image.jpg")
To answer this question, submit the functioning job script AND the code in the Jupyter notebook that was used to find (and display) the duplicate image.
Using sets will help find the duplicate image. One set can store new hashes that haven’t yet been seen. The other set can store duplicates, since there is only 1 duplicate you can immediately return the filename when found! This stackoverflow post shares some ideas to manage this. |
-
Code used to solve this problem.
-
Output from running the code.
Question 2
In the previous question, you were able to use the sha256 hash to efficiently find the extra image that the trickster Dr. Ward added to our dataset. Dr. Ward, knowing all about hashing algorithms, thinks he has a simple solution to circumventing your work. In the "new" dataset: /anvil/projects/tdm/data/coco/attempt02
, he has modified the value of a single pixel of his duplicate image.
Re-run your SLURM job from the previous question on the new dataset, and process the results to try to find the duplicate image. Was Dr. Ward’s modification successful? Do your best to explain why or why not.
I would start by creating a new folder in
Next, I would update the job script to output files to the new directory, and change the directory of the input files to the new dataset. |
If at this point in time you are wondering "why would we do this when we can just use |
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Unfortunately, Dr. Ward was right, and our methodology didn’t work. Luckily, there is a cool technique called perceptual hashing that is almost meant just for this! Perceptual hashing is a technique that can be used to know whether or not any two images appear the same, without actually viewing the images. The general idea is this. Given two images that are essentially the same (maybe they have a few different pixels, have been cropped, gone through a filter, etc.), a perceptual hash can give you a very good idea whether the images are the "same" (or close enough). Of course, it is not a perfect tool, but most likely good enough for our purposes.
To be a little more specific, two images are very likely the same if their perceptual hashes are the same. If two perceptual hashes are the same, their Hamming distance is 0. For example, if your hashes were: 8f373714acfcf4d0
and 8f373714acfcf4d0
, the Hamming distance would be 0, because if you convert the hexadecimal values to binary, at each position in the string of 0s and 1s, the values are identical. If 1 of the 0s and 1s didnt match after converting to binary, this would be a Hamming distance of 1.
Use the imagehash
library, and modify your job script from the previous project to use perceptual hashing instead of the sha256 algorithm to produce 1 file for each image where the filename remains the same as the original image, and the contents of the file contains the hash.
Make sure to clear out your slurm environment variables before submitting your job to run with
If you are in a bash cell in Jupyter Lab, do the same.
|
In order for the
|
To help get you going using this package, let me demonstrate using the package.
|
Make sure that you pass the hash as a string to the |
Make sure that once you’ve written your script, |
It would be a good idea to make sure you’ve modified your hash script to work properly with the
This should produce a file, |
Make sure your
|
Process the results. Did you find the duplicate image? Explain what you think could have happened.
-
Code used to solve this problem.
-
Output from running the code.
Question 4
What!?! That is pretty cool! You found the "wrong" duplicate image? Well, I guess it is totally fine to find multiple duplicates. Modify the code you used to find the duplicates so it finds all of the duplicates and originals. In total there should be 50. Display 2-5 of the pairs (or triplets or more). Can you see any of the subtle differences? Hopefully you find the results to be pretty cool! If you look, you will find Dr. Wards hidden picture, but you do not have to exhaustively display all 50 images.
Please turn in all 3 job scripts (for questions 1-3). Please turn in both |
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |