Using network analytics and logistic regression to identify who actually drives behavior on social platforms — and putting a dollar figure on it.
“Influencer” gets thrown around a lot. But what does it actually mean, quantitatively? Is follower count enough? What about retweets, or how central someone is in their network?
We decided to stop guessing and build a model.
The Problem
Given two Twitter users — A and B — which one is more influential?
That’s the core question. And it turns out answering it rigorously requires rethinking how you represent both people in a dataset.
The raw data includes 11 features per person: follower counts, following counts, retweet activity, and three network centrality metrics. The naive approach would be to throw all 22 features (11 for A, 11 for B) into a classifier and call it a day.
We took a more thoughtful approach.
Feature Engineering: Thinking Relationally
The insight that changed everything: influence is relative, not absolute.
A user with 10,000 followers is influential compared to someone with 500 — but not compared to someone with 10 million. What matters is the difference and ratio between the two users in each pair.
Instead of feeding raw A and B values separately, we engineered A−B difference features and A/B ratio features for each metric. This transformation:
- Captures the comparison directly
- Reduces dimensionality from 22 features to 11
- Makes the model easier to interpret (positive coefficient = “A having more of this makes A more influential”)
We also renamed the abstract feature columns — A_network_feature_1 became A_degree, A_network_feature_2 became A_betweenness, and so on — because interpretability matters.
The Three Centrality Metrics (And Why They Matter)
The network features aren’t arbitrary. Each captures something distinct about a person’s position in the social graph:
Degree Centrality — How many direct connections does this person have? High degree = broad reach.
Betweenness Centrality — How often does this person sit on the shortest path between two other users? High betweenness = information broker. Ideas pass through this person.
Closeness Centrality — How quickly can this person reach everyone else in the network? High closeness = fast information spreader.
Of these three, betweenness centrality emerged as the strongest predictor of influence in our model. Raw follower counts were surprisingly weaker predictors — which challenges the conventional “more followers = more influential” assumption.
The Model
After normalizing all features to [0, 1] and running logistic regression, the confusion matrix showed strong classification performance on the held-out data. More importantly, the model’s coefficients told a clear story:
- Network position (especially betweenness) matters more than volume metrics
- Retweet activity is a stronger signal than follower count
- The ratio of followers-to-following is a meaningful indicator of influence asymmetry
Putting a Dollar Figure on It
This is where the project goes beyond academic exercise.
Without the model: A retailer pays $5 to every person (both A and B) to tweet a promotion once. No discrimination.
With the model: Pay $10 only to the predicted influencer in each pair to tweet twice.
The math:
- If an influencer tweets once: 0.01% chance each follower buys → $10 profit per unit
- If an influencer tweets twice: 0.015% chance → higher expected return
- Non-influencers selected without analytics: zero return on the $5 spent
The lift in expected net profit from using the analytic model vs. random targeting was substantial — the model earned back its complexity many times over by concentrating spending on users who actually move the needle.
A perfect model (theoretical upper bound) provided even higher lift, giving us a ceiling to benchmark against.
From Model to Real Influencer Discovery
Part II of the project applied the trained model to actual Twitter data. We scraped tweets from a domain of our choice, extracted the same 11 network features for real accounts, and used the classifier to rank the top 20 influencers.
This is the full pipeline: train on labeled data → validate with financial analysis → deploy on real-world users.
What This Means for Businesses
The takeaway isn’t just “use ML for influencer marketing.” It’s more specific:
- Follower count is a weak proxy — betweenness and retweet behavior predict influence more reliably
- Pairwise comparison framing works — relative features outperform absolute ones for classification
- Analytics has measurable financial value — not just directional improvement, but quantifiable lift
The next time someone tells you an influencer is valuable because they have a million followers, ask: where do they sit in the network? That’s the question that actually matters.