Roles: Full-stack development
Timeline: November 2018
Tools: Python (including NLTK, Matplotlib, NetworkX)
Background
This program was my final project in 15-112, Fundamentals of Programming and Computer Science at Carnegie Mellon University. During the three week timeline, students are instructed to independently implement some unique interactive program in Python.
For my project, I chose to focus on building a Natural Language Processing program and data visualization and labeling tool for emails. The program generates labels for topics and conducts both semantic and sentiment analysis with details and summary insights to help identify important emails.
Problem


Ideation
Wireframes
I first sketched out my ideas onto paper to map out the flow of the program, ultimately culminating in a "results" screen that offered the insight into the emails analyzed. Given we learned Tkinter for graphics in python, I decided to keep the program fairly minimal in design and focus on the core of the program—the analysis.
Solution
Ultimately, my final program was able to perform relatively well in providing useful insights into the contents of email complaints. I was able to measure this against a large dataset of complaints from data.gov (comparing results to already tagged labels).
The six features I ultimately included in the program were:
     1. Labeling
     2. Word frequency distribution
     3. Network diagram connecting all recipients to senders
     4. Summarization
     5. Sentiment analysis
     6. A downloadable CSV of emails with results attached
A video showcasing my final solution can be found below, along with the link to my GitHub URL. You can find my code and full work here.
Reflection
Labely's features are terribly difficult to implement in reality, even given today's machine learning and NLP capabilities and standards. Consequently, the program certainly has room for improvement in filtering, labeling, and analyzing the data reliably overall. Given a dataset completely unrelated to company complaints (e.g. my ~6,000 college admissions newsletter and marketing emails), the program will pull out common labels, though doesn't always recognize pronouns or some irrelevant words.
However, with no prior experience using NLP or filtering through big data, I am proud of the results and look forward to continuing my education in the application of machine learning in the future.
Back to Top