Supervised Learning and Teenager Spending Decisions
- Noelle
- Sep 13
- 3 min read
In our last post, we looked at whether teenagers are spending money wisely through the lens of machine learning. The takeaway for students and families is that we should think critically about the categories we use in daily life, and question how we draw the line between short-term consumption and long-term investment.
On the technical side, supervised learning helps uncover patterns between spending choices and future outcomes. But we must also stay mindful of its limits and the context in which we interpret results.
For now, let’s focus on the observable benefits of spending without getting too philosophical too fast. As we mentioned in the last blog post, what counts as “investment” depends on the time horizon and the definition of return we choose.
Data Cleaning, Labeling, and Training Setup
In real life, data cleaning and preprocessing is messy and detail-heavy. In this case, we would expect it to involve reconciliation of intricate financial and demographic records. Lots of data mapping. To keep the focus on logic, we will work with a clean dataset (something that looks like it came off the “data wishlist”) while preserving the rough structure of the real task.
What We’re Training and Predicting
Goal (prediction target): classify each purchase as Investment or Consumption.
Training signal (label y): a proxy outcome, possibly derived from follow-ups (e.g., did the purchase plausibly produce longer-run benefits within ~2–5 years?). See “Labeling rule” below.
Inputs (features X): transactional details (date, item, barcode that informs category/durability), plus basic student context (age).
Labeling uses weak supervision: we combine quick heuristics (e.g., tutoring = investment) with spot-checked outcomes (e.g., evidence of sustained benefit). We expect there will be noise.
This is similar to spam filtering. For spending decisions, we don’t always have perfect ground truth labels (investment vs. consumption). For email, we don’t always know if something is spam until evidence confirms it. Either probabilistic classifiers give a score, or a human flags it. In both cases, the true label is imperfectly known, and new evidence helps refine it.
Assumptions (explicit)
Time horizon: “Investment” means a purchase with a reasonable expectation of positive benefit over ~2–5 years (short run) or longer.
Barcode (mapping category): Barcodes or merchant codes map to industry and durability (durable, non-durable, or service).
Privacy: Names are pseudonyms in the working set; personally identifiable information is anonymized before modeling.
Datasets
We’ll use two tables joined on Student_ID and time.
A) Transactions (training features)
These are the actual spending choices teenagers make.
Columns: Date, Student_ID, Age, Item_Name, Barcode, Category, Durability, Amount (CAD).
Sample (toy, 4 rows):
Date | Student_ID | Age | Item_Name | Barcode | Category | Durability | Amount | Outcome_Label |
2025-09-01 | S001 | 16 | Algebra Tutoring | 9780000001 | tutoring | service | 60 | Investment |
2025-09-03 | S001 | 16 | Protein Snacks | 0123456789 | snacks | non-dur. | 12 | Consumption |
2025-09-05 | S002 | 15 | Running Shoes | 0888888888 | durable | 95 | Investment | |
2025-09-06 | S003 | 17 | Mobile Game Credits | 0999999999 | entertainment | non-dur. | 15 | Consumption |
B) Outcomes (for labeling/sanity checks)
Columns: Student_ID, Window (6m, 12m, 24m), Savings_Streak_6m, Study_Hours_Trend, GPA_Trend, Athletic_Training_Consistency.
Real-life interpretation: cheap, observable signals that purchases might be generating benefits (not perfect, but useful proxies).
Sample (toy, 4 rows):
Student_ID | Window | Savings_Streak_6m | Study_Hours_Trend | GPA_Trend | Athletic_Training_Consistency |
S001 | 6m | 1 | up | up | high |
S002 | 6m | 0 | flat | up | med |
S003 | 6m | 0 | flat | flat | low |
S004 | 6m | 0 | flat | flat | low |
Labeling Rule (weak supervision)
Outcome_Label : Investment, Consumption
If Category ∈ {tutoring, sports coaching, music lessons}, then label “Investment,”
If Category ∈ {snacks, mobile games, slime}, then label “Consumption,”
Advanced overrides:
If within 6–12 months we observe positive trends plausibly linked to the purchase (e.g., GPA up after tutoring, consistent training after coaching), allow Investment override.
If an “investment” shows no engagement (unused tutoring, abandoned program), relabel as Consumption.
Integrating the advanced overrides:
If Category ∈ {tutoring, sports coaching, music lessons}, then label “Investment,” unless flagged as one-off with no follow-through and no outcomes.
If Category ∈ {snacks, mobile games, slime}, then label “Consumption,” unless part of a documented project with sustained outcomes.
This yields noisy but directionally useful labels for supervised learning. Rules improve as more outcomes accumulate.
Feature Sketch (for modeling)
Categorical features: Category, Durability, binned Amount, binned Age, weekday/weekend (from Date). Qualitative patterns in how money is spent.
Numerical features: Amount, rolling counts (e.g., # tutoring/coaching purchases in the last 60 days), time since last “investment” purchase. Quantitative patterns in how money is spent.
Models are in the eyes of the beholders
Labels are judgments and not facts, especially with consideration for time horizon and possible externalities (or spillover benefits). Outcome overrides can improve them.
Barcodes/merchant codes are starting points. Interviews and follow-ups reduce mislabeling. Institutional knowledge and context experience can guide logic.
Continuous improvement: as you collect more outcomes, retrain with cleaner labels and recalibrated thresholds (e.g., what counts as “sustained” engagement).
Lecture slides of CPSC 340 Machine Learning and Data Mining by Professors Prajeet Bajpai and Mathias Lécuyer.



Comments