Boosted Trees for Reddit Comment Analysis
Overview
This project is the first part of my Reddit analysis series, focusing on building predictive models using gradient boosted trees.
Data Collection
Python: Reddit API Data Collection
import praw
import pandas as pd
from datetime import datetime, timedelta
reddit = praw.Reddit(
client_id="YOUR_CLIENT_ID",
client_secret="YOUR_SECRET",
user_agent="sentiment_analysis"
)
def collect_daily_comments(subreddit_name, date, limit=1000):
"""Collect comments from a subreddit for a specific date."""
subreddit = reddit.subreddit(subreddit_name)
comments_data = []
for submission in subreddit.new(limit=limit):
submission.comments.replace_more(limit=0)
for comment in submission.comments.list():
comments_data.append({
'body': comment.body,
'score': comment.score,
'created_utc': comment.created_utc,
'ticker_mentions': extract_tickers(comment.body)
})
return pd.DataFrame(comments_data)</code></pre>
Model Training
Python: XGBoost Sentiment Model
import xgboost as xgb
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import roc_auc_score
Prepare features
X = daily_features[['comment_count', 'avg_sentiment', 'ticker_diversity',
'weekend_flag', 'vix_level']]
y = (next_day_returns > 0).astype(int)
Time series cross-validation
tscv = TimeSeriesSplit(n_splits=5)
model = xgb.XGBClassifier(
max_depth=4,
n_estimators=100,
learning_rate=0.1,
objective='binary:logistic'
)
scores = []
for train_idx, test_idx in tscv.split(X):
model.fit(X.iloc[train_idx], y.iloc[train_idx])
preds = model.predict_proba(X.iloc[test_idx])[:, 1]
scores.append(roc_auc_score(y.iloc[test_idx], preds))
print(f"Average AUC: {np.mean(scores):.3f}")
Results
- Achieved 0.58 AUC in predicting next-day return direction
- Comment volume was the strongest predictive feature
- Model performance degraded significantly during high-volatility periods