Decision Tree Classification on Wisconsin Breast Cancer Data¶
Task 3: Decision Tree Model
1. Setup & Data Loading¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
# Load data
column_names = ['ID', 'diagnosis'] + [f'{feat}_{stat}' for stat in ['mean', 'se', 'worst']
for feat in ['radius', 'texture', 'perimeter', 'area',
'smoothness', 'compactness', 'concavity',
'concave_points', 'symmetry', 'fractal_dimension']]
data = pd.read_csv('wdbc.data', header=None, names=column_names)
X = data.drop(['ID', 'diagnosis'], axis=1)
y = data['diagnosis'].map({'B': 0, 'M': 1})
Data Description¶
- Features (numeric): 30 measurements of cell nuclei (e.g., radius_mean, texture_mean, …, fractal_dimension_worst)
- Target: diagnosis (0 = Benign, 1 = Malignant)
Task Definition & Model Choice¶
We perform a binary classification to predict tumor malignancy. Decision trees are chosen because they handle numerical splits automatically, produce interpretable rule-based models, and manage feature interactions without requiring scaling.
Summary Visualization¶
plt.figure(figsize=(10,4))
for i, feature in enumerate(['radius_mean', 'texture_mean']):
plt.subplot(1, 2, i+1)
for label, group in data.groupby('diagnosis'):
plt.hist(group[feature], alpha=0.5, label=label)
plt.title(feature)
plt.legend()
plt.tight_layout()
plt.show()
Part 1: Decision Tree (Gini)¶
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
# Default (Gini) tree
clf_gini = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_gini.fit(X_train, y_train)
y_pred = clf_gini.predict(X_test)
print('Classification Report (Gini):')
print(classification_report(y_test, y_pred, target_names=['Benign', 'Malignant']))
Classification Report (Gini):
precision recall f1-score support
Benign 0.97 0.95 0.96 187
Malignant 0.90 0.94 0.92 98
accuracy 0.94 285
macro avg 0.93 0.94 0.94 285
weighted avg 0.94 0.94 0.94 285
Tree Visualization (Gini, depth=2)¶
plt.figure(figsize=(12,8))
plot_tree(clf_gini, max_depth=2, feature_names=X.columns, class_names=['Benign', 'Malignant'], filled=True)
plt.show()
Discussion of Gini-Based Decision Tree Results¶
Overall Accuracy (94%)
The classifier correctly labels 94% of the test cases, which is strong given the 50/50 train/test split.Class-Specific Performance
- Benign: precision = 0.97, recall = 0.95, F1 = 0.96
- Malignant: precision = 0.90, recall = 0.94, F1 = 0.92
High recall (0.94) on malignant tumors means few true cancers are missed. A precision of 0.90 indicates about 10% false positives for malignancy.
Tree Structure Insights
- Root split on
concave_points_worst <= 0.147separates most benign samples. - Further splits on
area_worst,area_se, andconcavity_serefine classification for tougher cases.
- Root split on
Overfitting vs. Underfitting
- Strong accuracy and balanced precision/recall with a shallow tree (depth = 2) suggest minimal overfitting.
- Performance is high enough that there is no clear underfitting either.
Comparison to Baseline
- A majority-class baseline would achieve ~66% accuracy.
- Our Gini tree reaches 94% accuracy and an F1 of 0.92 on the malignant class, showing that even a simple two-level tree captures key patterns.
Next Steps For Future Enhancement
- Prune or cross-validate to confirm split stability.
- Compare to an entropy-based tree or ensemble methods (e.g. random forest) for potential gains without losing interpretability.
Decision Tree with Entropy¶
# Entropy criterion
clf_entropy = DecisionTreeClassifier(criterion='entropy', random_state=42)
clf_entropy.fit(X_train, y_train)
y_pred_e = clf_entropy.predict(X_test)
print('Classification Report (Entropy):')
print(classification_report(y_test, y_pred_e, target_names=['Benign', 'Malignant']))
Classification Report (Entropy):
precision recall f1-score support
Benign 0.97 0.90 0.94 187
Malignant 0.84 0.95 0.89 98
accuracy 0.92 285
macro avg 0.90 0.93 0.91 285
weighted avg 0.93 0.92 0.92 285
plt.figure(figsize=(12,8))
plot_tree(clf_entropy, max_depth=2, feature_names=X.columns, class_names=['Benign', 'Malignant'], filled=True)
plt.show()
Discussion of Entropy-Based Decision Tree Results¶
Overall Accuracy (92%)
Accuracy drops slightly from 94% (Gini) to 92% with the entropy criterion, reflecting small differences in split selection.Class-Specific Performance
- Benign: precision = 0.97, recall = 0.90, F1 = 0.94
- Malignant: precision = 0.84, recall = 0.95, F1 = 0.89
Entropy yields higher recall (0.95) on malignant cases—fewer missed cancers—but at the cost of lower precision (0.84), i.e. more benign samples misclassified as malignant.
Tree Structure Insights
- Root split remains on
concave_points_worst <= 0.147, consistent with Gini. - Left branch:
area_worst <= 957.45thentexture_mean <= 21.26isolates most benign samples. - Right branch:
smoothness_mean <= 0.081captures nearly all malignant cases, explaining the boosted recall.
- Root split remains on
Bias/Variance Trade-off
- The shallower false-positive branch suggests a slightly more conservative strategy: it favors catching all malignancies (high recall) even if that increases false alarms.
- Overall performance indicates low variance (tree remains small) but a slight increase in bias against precision.
Comparison to Baseline & Gini
- Still far above a 66% majority-class baseline.
- Versus the Gini tree, entropy trades off ~4 pts of malignant precision for ~1 pt of recall—useful if missing a malignancy is deemed costlier than a false alarm.
Next Steps For Future Enhancement
- Adjust the decision threshold or class weights to rebalance precision/recall.
- Evaluate via cross-validation to confirm whether Gini or entropy performs more reliably.
- Explore pruning or small random forest ensembles to reduce false positives while maintaining high cancer detection.
Part 2: Impact of max_depth¶
We vary max_depth to control tree complexity and overfitting. Values chosen based on initial EDA of tree size and performance. Deeper trees may overfit by fitting noise, leading to poor generalization. Shallow trees might underfit, missing key patterns, shown by lower accuracy and recall scores.
Exploratory Analysis of Decision Tree Depth¶
Before selecting specific max_depth values for our Decision Tree classifier, we run a quick EDA to see how depth affects test accuracy. The cell below trains trees with depths from 2 to 11, records their accuracy on the held‐out test set, and plots the results. Based on this plot, depths of 3, 5, and 7 emerge as good candidates—striking a balance between underfitting (too shallow) and overfitting (too deep).
depths = range(2, 12)
accs = []
for d in depths:
clf = DecisionTreeClassifier(criterion='entropy', max_depth=d, random_state=42)
clf.fit(X_train, y_train)
accs.append(accuracy_score(y_test, clf.predict(X_test)))
plt.plot(depths, accs, marker='o')
plt.xlabel('max_depth')
plt.ylabel('Test Accuracy')
plt.title('Decision Tree: Depth vs. Accuracy')
plt.grid(True)
depths = [3, 5, 7]
results = []
for d in depths:
clf = DecisionTreeClassifier(criterion='entropy', max_depth=d, random_state=42)
clf.fit(X_train, y_train)
y_pred_d = clf.predict(X_test)
results.append({
'max_depth': d,
'accuracy': accuracy_score(y_test, y_pred_d),
'precision': precision_score(y_test, y_pred_d),
'recall': recall_score(y_test, y_pred_d),
'f1': f1_score(y_test, y_pred_d)
})
metrics_df = pd.DataFrame(results)
print(metrics_df)
max_depth accuracy precision recall f1 0 3 0.940351 0.917526 0.908163 0.912821 1 5 0.926316 0.866667 0.928571 0.896552 2 7 0.919298 0.837838 0.948980 0.889952
for metric in ['accuracy', 'precision', 'recall', 'f1']:
plt.figure()
plt.plot(metrics_df['max_depth'], metrics_df[metric])
plt.xlabel('max_depth')
plt.ylabel(metric)
plt.title(f'{metric} vs. max_depth')
plt.show()
Interpretation of max_depth Results¶
As max_depth increases from 3 → 7 on the test set, we observe:
| max_depth | Accuracy ↓ | Precision ↓ | Recall ↑ | F₁ ↓ |
|---|---|---|---|---|
| 3 | 94.04 % | 91.75 % | 90.82 % | 91.28 % |
| 5 | 92.63 % | 86.67 % | 92.86 % | 89.66 % |
| 7 | 91.93 % | 83.78 % | 94.90 % | 88.99 % |
Accuracy drops (94.0 % → 91.9 %)
- Indicates overfitting beyond depth 3–5 as the tree captures noise.
Precision–Recall trade-off
- Precision falls (91.8 % → 83.8 %): more false positives at greater depth
- Recall rises (90.8 % → 94.9 %): fewer false negatives
- Implication: choose shallow for fewer false alarms, deep for more true-positive coverage.
F₁-score declines (91.3 % → 89.0 %)
- Best balance of precision & recall at max_depth=3.
Middle-ground at depth 5
- Recall boosts to 92.9 % with a moderate precision drop to 86.7 %
- Good compromise if you need higher sensitivity without drastic precision loss.
Recommendations¶
Maximize overall performance:
Usemax_depth=3(highest accuracy & F₁).Maximize recall (minimize false negatives):
Usemax_depth=7, accepting more false positives.Balanced trade-off:
Usemax_depth=5for decent recall improvement with moderate precision/accuracy loss.
Part 3: Train/Test Split Experiment¶
Using max_depth = 5 (Middle-ground from Part 2), we vary training set size to observe its effect on performance.
split_results = []
for p in np.arange(0.2, 0.9, 0.1):
X_tr, X_te, y_tr, y_te = train_test_split(X, y, train_size=p, random_state=42)
clf = DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=42)
clf.fit(X_tr, y_tr)
split_results.append({
'train_pct': p,
'n_train_samples': len(y_tr),
'n_test_samples': len(y_te),
'train_acc': accuracy_score(y_tr, clf.predict(X_tr)),
'test_acc': accuracy_score(y_te, clf.predict(X_te))
})
split_df = pd.DataFrame(split_results)
print(split_df)
train_pct n_train_samples n_test_samples train_acc test_acc 0 0.2 113 456 1.000000 0.896930 1 0.3 170 399 1.000000 0.909774 2 0.4 227 342 1.000000 0.929825 3 0.5 284 285 0.985915 0.926316 4 0.6 341 228 0.991202 0.960526 5 0.7 398 171 0.992462 0.953216 6 0.8 455 114 0.993407 0.947368
plt.figure()
plt.plot(split_df['train_pct'], split_df['train_acc'])
plt.xlabel('train_pct')
plt.ylabel('Train Accuracy')
plt.title('Train Accuracy vs. Training Set Percentage')
plt.show()
plt.figure()
plt.plot(split_df['train_pct'], split_df['test_acc'])
plt.xlabel('train_pct')
plt.ylabel('Test Accuracy')
plt.title('Test Accuracy vs. Training Set Percentage')
plt.show()
Train/Test Split Analysis¶
After evaluating splits from 20 – 80 % (training / testing), the 60 % / 40 % split emerges as the best trade-off:
- Peak test accuracy: at 60 % training, test accuracy reaches 96.05%, the highest across all splits.
- Balanced bias–variance: training accuracy is 99.12%, only ~3.1 pp above test, indicating low overfitting.
- Underfitting at smaller splits: with ≤ 40 % training, test accuracy lags (90.9 %–92.9%) despite perfect train scores, showing too little data to generalize.
- Overfitting at larger splits: with ≥ 70 % training, test accuracy dips (95.32 %–94.74%) as the model memorizes noise (train ≈ 99.3 %).
Conclusion: A 60%/40% split maximizes test performance while keeping generalization error minimal.
Conclusions¶
Criterion comparison:
The Gini‐based tree achieved 94.0 % overall accuracy (Benign precision = 97 %, Malignant precision = 90 %; recall = 95 %/94 %), whereas the Entropy‐based tree reached 92.3 % accuracy (Benign precision = 97 %, Malignant precision = 84 %; recall = 90 %/95 %). Gini impurity therefore provided a slightly better balance of precision and recall across both classes, while the Entropy criterion favored higher recall on malignant cases at the expense of precision.Optimal
max_depth:
In the depth‐tuning experiment, a shallow tree withmax_depth=3yielded the highest test accuracy (94.04 %) but sacrificed some recall on malignant tumors (90.8 %). Increasing depth to 5 improved malignant recall to 92.9 % (with overall accuracy of 92.63 %), representing a more balanced bias-variance tradeoff. Consequently, we selectedmax_depth=5as our default.Train/Test Split:
Varying the training fraction from 20 % to 80 % showed that a 60/40 split produced the best generalization (test accuracy = 96.05 %), balancing sufficient training data with a representative validation set. Splits larger than 60 % began to overfit slightly, as evidenced by declining test accuracy beyond that point.Limitations & Future Work:
Although the decision tree model performs strongly on this dataset, it is prone to overfitting and sensitive to hyperparameters. Future enhancements should include pruning or cost‐complexity regularization, k-fold cross‐validation for robust hyperparameter search, and exploration of ensemble methods (e.g., Random Forests, Gradient Boosting) to improve stability. Additional analyses—such as feature importance ranking and ROC AUC evaluation—would further inform model interpretability and clinical utility.