Date of Award

Spring 2024

Document Type

Restricted Thesis

Terms of Use

© 2024 Hoang (Tommy) Vu. All rights reserved. Access to this work is restricted to users within the Swarthmore College network and may only be used for non-commercial, educational, and research purposes. Sharing with users outside of the Swarthmore College network is expressly prohibited. For all other uses, including reproduction and distribution, please contact the copyright holder.

Degree Name

Bachelor of Arts


Engineering Department, Computer Science Department

First Advisor

Allan R. Moser

Second Advisor

Christian Murphy


With the rapid advancement of computer vision and artificial intelligence, object detection and tracking have become more relevant with increasing potential practical applications within the society. This project simulates these applications by developing a web application for object detection and tracking based on the Streamlit framework. The object detection implementation is based on two cutting-edge object detection algorithms: You Only Look Once Version 8 (YOLOv8) and Mask Region-based Convolutional Neural Network (Mask R-CNN). For YOLOv8, the advantages are fast and reasonable accuracy that is ideal for real-time applications, but the disadvantages is lower efficiency with smaller objects and object segmentation. Meanwhile, Mask R-CNN is more effective at object segmentation and detecting objects of varied sizes with higher accuracy, in exchange for slower run time and using more computational resources. To enable object tracking capabilities, the application leverages two popular tracking algorithms, BoT-SORT and ByteTrack. The strengths of BoT-SORT are higher accuracy with the usage of appearance and motions cue, but its weaknesses are in slower run time with higher demands for computational power and sensitivity to changes in appearance. Conversely, ByeTrack is generally faster with less consumption for computational power and better at tracking despite missing intermediate frames, but it has relatively lower accuracy and higher tendency for incorrect tracking in crowded scenes.