◆ Portfolio Projects

Selected
Project Work

Three projects from the MS Applied Data Science program, each representing a different domain, technique set, and stakeholder context.

◆ IST 707 · Applied Machine Learning · Spring 2026

Predicting NYC Restaurant Inspection Likelihood

Using historical inspection patterns to help restaurant owners prepare proactively for future inspections, shifting the NYC DOHMH from a reactive scheduling model to a risk-based, data-driven approach.

◆ View Project on GitHub →
Dataset Source
NYC DOHMH Open Data
Primary Tools
Python · Scikit-learn · Pandas
Model Type
Classification (B/C Grade Risk)
Core Goal
Risk-based inspection prioritization

Project Overview

The NYC Department of Health and Mental Hygiene currently conducts restaurant inspections on a fixed schedule, typically once per year for all establishments regardless of their compliance history. This one-size-fits-all approach means limited inspector resources are distributed evenly even when risk is not.

This project built a machine learning system that identifies restaurants at high risk of receiving poor inspection grades (B or C) before their scheduled inspections occur. By shifting from reactive to proactive, inspectors can focus where they are most needed, potentially preventing food safety incidents before they happen.

Approach and Methodology

The model analyzed multiple signals: restaurant characteristics such as cuisine type, borough, and establishment age; historical inspection patterns including prior grades and violation counts; temporal trends in violation frequency; and neighborhood-level factors that correlate with compliance patterns.

A key design principle was interpretability. Unlike black-box models that maximize accuracy at the expense of explainability, this system produces predictions that can be explained to restaurant owners and justified to the public — making the choice of model architecture both a technical and ethical decision.

What Makes This Novel

While prior research has explored predictive inspection models, this project incorporated a richer feature set including temporal patterns and neighborhood characteristics beyond what earlier studies have used. The focus on NYC's A/B/C grading system made predictions directly actionable for restaurant owners who understand exactly what each grade means for their business.

The model was designed with multiple stakeholder audiences in mind: health inspectors need prioritized risk queues; restaurant owners need understandable risk signals; the public needs confidence the system is fair. These competing needs shaped every design decision.

Program Outcomes Demonstrated

02 · Actionable Insight 03 · Predictive Modeling 04 · Python / ML 05 · Communication 06 · Ethics / Fairness
◆ IST 737 · Visual Analytic Dashboards · Spring 2026

Two Decades of Student Aid: Is NY's TAP Keeping Up?

Analyzing New York State's Tuition Assistance Program from 2000 to present to surface inequities in aid distribution across income levels, age groups, and institution types.

◆ Tableau Dashboard — Link Coming Soon
Dataset Source
NY State Open Data Portal
Primary Tool
Tableau
Analysis Type
Time Series · Comparative
Time Range
2000 to Present

Project Overview

New York's Tuition Assistance Program is the state's largest financial aid initiative, supporting eligible residents in paying tuition at in-state colleges. But as tuition costs have climbed and demographics have shifted, the question of whether TAP has kept pace — and whether it serves all groups equitably — is both a policy and a data question.

This project used publicly available annual records of TAP recipient counts and total award amounts, categorized by income group, age group, and program type, to trace how the program has evolved since 2000. The dataset contains both numerical and categorical variables, making it suitable for time series and comparative breakdowns.

Visualization Design

The primary deliverable was a Tableau dashboard built for a non-technical audience: policymakers, students, and administrators who need to understand funding patterns without statistical expertise. The design focused on clarity and narrative, leading with the big-picture trend then allowing users to drill into demographic breakdowns.

A key design decision was to frame the data around access and affordability rather than raw award totals — turning a dataset into an argument about equity.

Key Insights

The analysis revealed meaningful shifts in how TAP dollars are distributed across income brackets and age groups over two decades. Certain program types saw significant changes in recipient volume that do not align with broader enrollment trends, suggesting structural shifts in eligibility or program design worth further investigation.

Financial aid access is a gateway to educational opportunity, and data can make visible the patterns that policy debates often treat as abstract.

Program Outcomes Demonstrated

01 · Data Collection 02 · Actionable Insight 03 · Visualization 05 · Communication 06 · Equity / Ethics
◆ IST 652 · Scripting for Data Analysis · Fall 2025

Hazardous Cosmetic Chemical Disclosures in California (2007–2020)

Investigating 114,000+ chemical disclosure records from California's Safe Cosmetics Program to identify usage trends, company behavior, and the relationship between chemical complexity and product discontinuation.

◆ View Project Poster → ◆ View Presentation →
Project Poster Preview Open full screen →
Dataset Source
CA Safe Cosmetics Program
Original Dataset
114,635 rows × 22 columns
Tools Used
Python · Pandas · Seaborn · Matplotlib
Time Range
2007 to 2020

Project Overview

California's Safe Cosmetics Program requires manufacturers to report cosmetic ingredients known or suspected to cause cancer, birth defects, or reproductive harm. The result is a rich, policy-relevant public dataset spanning 13 years and over 114,000 product-chemical combinations.

This project examined how chemicals are distributed across cosmetic products, companies, and product categories, and whether chemical count correlates with product discontinuation rates.

Data Cleaning and Engineering

The raw dataset contained 114,635 rows across 22 columns. Through systematic deduplication, removal of records with missing critical fields, and feature engineering, the cleaned dataset was brought to approximately 41,000 high-quality observations across 13 key analytical variables.

Every decision about what to keep and remove reflects assumptions about what matters — and documenting those decisions transparently was a core part of the project.

Key Findings

Products containing higher chemical counts were significantly more likely to be discontinued, suggesting that product complexity is a marker of both regulatory risk and consumer safety concern. Titanium dioxide appeared in 31,989 records, making it by far the most commonly reported ingredient in the dataset.

Reframing this project for portfolio presentation means centering the human story: chemical transparency in consumer products has measurable consequences for what stays on shelves. This is a public health story, not just a technical one.

Program Outcomes Demonstrated

01 · Data Collection 02 · Actionable Insight 03 · EDA Visualization 04 · Python / Pandas 06 · Transparency