top of page
Search

Supervised Learning and Teenager Spending Decisions

In our last post, we looked at whether teenagers are spending money wisely through the lens of machine learning. The takeaway for students and families is that we should think critically about the categories we use in daily life, and question how we draw the line between short-term consumption and long-term investment.


On the technical side, supervised learning helps uncover patterns between spending choices and future outcomes. But we must also stay mindful of its limits and the context in which we interpret results.


For now, let’s focus on the observable benefits of spending without getting too philosophical too fast. As we mentioned in the last blog post, what counts as “investment” depends on the time horizon and the definition of return we choose.


Data Cleaning, Labeling, and Training Setup


In real life, data cleaning and preprocessing is messy and detail-heavy. In this case, we would expect it to involve reconciliation of intricate financial and demographic records. Lots of data mapping. To keep the focus on logic, we will work with a clean dataset (something that looks like it came off the “data wishlist”) while preserving the rough structure of the real task.


What We’re Training and Predicting


Goal (prediction target): classify each purchase as Investment or Consumption.


Training signal (label y): a proxy outcome, possibly derived from follow-ups (e.g., did the purchase plausibly produce longer-run benefits within ~2–5 years?). See “Labeling rule” below.


Inputs (features X): transactional details (date, item, barcode that informs category/durability), plus basic student context (age).


Labeling uses weak supervision: we combine quick heuristics (e.g., tutoring = investment) with spot-checked outcomes (e.g., evidence of sustained benefit). We expect there will be noise.


This is similar to spam filtering. For spending decisions, we don’t always have perfect ground truth labels (investment vs. consumption). For email, we don’t always know if something is spam until evidence confirms it. Either probabilistic classifiers give a score, or a human flags it. In both cases, the true label is imperfectly known, and new evidence helps refine it.


Assumptions (explicit)

  • Time horizon: “Investment” means a purchase with a reasonable expectation of positive benefit over ~2–5 years (short run) or longer.

  • Barcode (mapping category): Barcodes or merchant codes map to industry and durability (durable, non-durable, or service).

  • Privacy: Names are pseudonyms in the working set; personally identifiable information is anonymized before modeling.


Datasets

We’ll use two tables joined on Student_ID and time.


A) Transactions (training features)


These are the actual spending choices teenagers make.


Columns: Date, Student_ID, Age, Item_Name, Barcode, Category, Durability, Amount (CAD).


Sample (toy, 4 rows):

Date

Student_ID

Age

Item_Name

Barcode

Category

Durability

Amount

Outcome_Label

2025-09-01

S001

16

Algebra Tutoring

9780000001

tutoring

service

60

Investment

2025-09-03

S001

16

Protein Snacks

0123456789

snacks

non-dur.

12

Consumption

2025-09-05

S002

15

Running Shoes

0888888888

sports gear

durable

95

Investment

2025-09-06

S003

17

Mobile Game Credits

0999999999

entertainment

non-dur.

15

Consumption


B) Outcomes (for labeling/sanity checks)


Columns: Student_ID, Window (6m, 12m, 24m), Savings_Streak_6m, Study_Hours_Trend, GPA_Trend, Athletic_Training_Consistency.


Real-life interpretation: cheap, observable signals that purchases might be generating benefits (not perfect, but useful proxies).


Sample (toy, 4 rows):

Student_ID

Window

Savings_Streak_6m

Study_Hours_Trend

GPA_Trend

Athletic_Training_Consistency

S001

6m

1

up

up

high

S002

6m

0

flat

up

med

S003

6m

0

flat

flat

low

S004

6m

0

flat

flat

low



Labeling Rule (weak supervision)


Outcome_Label : Investment, Consumption

  • If Category ∈ {tutoring, sports coaching, music lessons}, then  label “Investment,”

  • If Category ∈ {snacks, mobile games, slime}, then label “Consumption,”


Advanced overrides:

  • If within 6–12 months we observe positive trends plausibly linked to the purchase (e.g., GPA up after tutoring, consistent training after coaching), allow Investment override.

  • If an “investment” shows no engagement (unused tutoring, abandoned program), relabel as Consumption.


Integrating the advanced overrides: 

  • If Category ∈ {tutoring, sports coaching, music lessons}, then  label “Investment,” unless flagged as one-off with no follow-through and no outcomes.

  • If Category ∈ {snacks, mobile games, slime}, then label “Consumption,” unless part of a documented project with sustained outcomes.


This yields noisy but directionally useful labels for supervised learning. Rules improve as more outcomes accumulate.


Feature Sketch (for modeling)

  • Categorical features: Category, Durability, binned Amount, binned Age, weekday/weekend (from Date). Qualitative patterns in how money is spent.

  • Numerical features: Amount, rolling counts (e.g., # tutoring/coaching purchases in the last 60 days), time since last “investment” purchase. Quantitative patterns in how money is spent.


Models are in the eyes of the beholders


Labels are judgments and not facts, especially with consideration for time horizon and possible externalities (or spillover benefits). Outcome overrides can improve them. 

Barcodes/merchant codes are starting points. Interviews and follow-ups reduce mislabeling. Institutional knowledge and context experience can guide logic.

Continuous improvement: as you collect more outcomes, retrain with cleaner labels and recalibrated thresholds (e.g., what counts as “sustained” engagement).


Lecture slides of CPSC 340 Machine Learning and Data Mining by Professors Prajeet Bajpai and Mathias Lécuyer.

 
 
 

Comments


bottom of page