Supervised Learning and Teenager Spending Decisions

Noelle
Sep 13
3 min read

In our last post, we looked at whether teenagers are spending money wisely through the lens of machine learning. The takeaway for students and families is that we should think critically about the categories we use in daily life, and question how we draw the line between short-term consumption and long-term investment.

On the technical side, supervised learning helps uncover patterns between spending choices and future outcomes. But we must also stay mindful of its limits and the context in which we interpret results.

For now, let’s focus on the observable benefits of spending without getting too philosophical too fast. As we mentioned in the last blog post, what counts as “investment” depends on the time horizon and the definition of return we choose.

Data Cleaning, Labeling, and Training Setup

In real life, data cleaning and preprocessing is messy and detail-heavy. In this case, we would expect it to involve reconciliation of intricate financial and demographic records. Lots of data mapping. To keep the focus on logic, we will work with a clean dataset (something that looks like it came off the “data wishlist”) while preserving the rough structure of the real task.

What We’re Training and Predicting

Goal (prediction target): classify each purchase as Investment or Consumption.

Training signal (label y): a proxy outcome, possibly derived from follow-ups (e.g., did the purchase plausibly produce longer-run benefits within ~2–5 years?). See “Labeling rule” below.

Inputs (features X): transactional details (date, item, barcode that informs category/durability), plus basic student context (age).

Labeling uses weak supervision: we combine quick heuristics (e.g., tutoring = investment) with spot-checked outcomes (e.g., evidence of sustained benefit). We expect there will be noise.

This is similar to spam filtering. For spending decisions, we don’t always have perfect ground truth labels (investment vs. consumption). For email, we don’t always know if something is spam until evidence confirms it. Either probabilistic classifiers give a score, or a human flags it. In both cases, the true label is imperfectly known, and new evidence helps refine it.

Assumptions (explicit)

Time horizon: “Investment” means a purchase with a reasonable expectation of positive benefit over ~2–5 years (short run) or longer.
Barcode (mapping category): Barcodes or merchant codes map to industry and durability (durable, non-durable, or service).
Privacy: Names are pseudonyms in the working set; personally identifiable information is anonymized before modeling.

Datasets

We’ll use two tables joined on Student_ID and time.

A) Transactions (training features)

These are the actual spending choices teenagers make.

Columns: Date, Student_ID, Age, Item_Name, Barcode, Category, Durability, Amount (CAD).

Sample (toy, 4 rows):

Date	Student_ID	Age	Item_Name	Barcode	Category	Durability	Amount	Outcome_Label
2025-09-01	S001	16	Algebra Tutoring	9780000001	tutoring	service	60	Investment
2025-09-03	S001	16	Protein Snacks	0123456789	snacks	non-dur.	12	Consumption
2025-09-05	S002	15	Running Shoes	0888888888	sports gear	durable	95	Investment
2025-09-06	S003	17	Mobile Game Credits	0999999999	entertainment	non-dur.	15	Consumption

B) Outcomes (for labeling/sanity checks)

Columns: Student_ID, Window (6m, 12m, 24m), Savings_Streak_6m, Study_Hours_Trend, GPA_Trend, Athletic_Training_Consistency.

Real-life interpretation: cheap, observable signals that purchases might be generating benefits (not perfect, but useful proxies).

Sample (toy, 4 rows):

Student_ID	Window	Savings_Streak_6m	Study_Hours_Trend	GPA_Trend	Athletic_Training_Consistency
S001	6m	1	up	up	high
S002	6m	0	flat	up	med
S003	6m	0	flat	flat	low
S004	6m	0	flat	flat	low

Labeling Rule (weak supervision)

Outcome_Label : Investment, Consumption

If Category ∈ {tutoring, sports coaching, music lessons}, then label “Investment,”
If Category ∈ {snacks, mobile games, slime}, then label “Consumption,”

Advanced overrides:

If within 6–12 months we observe positive trends plausibly linked to the purchase (e.g., GPA up after tutoring, consistent training after coaching), allow Investment override.
If an “investment” shows no engagement (unused tutoring, abandoned program), relabel as Consumption.

Integrating the advanced overrides:

If Category ∈ {tutoring, sports coaching, music lessons}, then label “Investment,” unless flagged as one-off with no follow-through and no outcomes.
If Category ∈ {snacks, mobile games, slime}, then label “Consumption,” unless part of a documented project with sustained outcomes.

This yields noisy but directionally useful labels for supervised learning. Rules improve as more outcomes accumulate.

Feature Sketch (for modeling)

Categorical features: Category, Durability, binned Amount, binned Age, weekday/weekend (from Date). Qualitative patterns in how money is spent.
Numerical features: Amount, rolling counts (e.g., # tutoring/coaching purchases in the last 60 days), time since last “investment” purchase. Quantitative patterns in how money is spent.

Models are in the eyes of the beholders

Labels are judgments and not facts, especially with consideration for time horizon and possible externalities (or spillover benefits). Outcome overrides can improve them.

Barcodes/merchant codes are starting points. Interviews and follow-ups reduce mislabeling. Institutional knowledge and context experience can guide logic.

Continuous improvement: as you collect more outcomes, retrain with cleaner labels and recalibrated thresholds (e.g., what counts as “sustained” engagement).

Lecture slides of CPSC 340 Machine Learning and Data Mining by Professors Prajeet Bajpai and Mathias Lécuyer.

Supervised Learning and Teenager Spending Decisions

A) Transactions (training features)

Recent Posts

Comments