Social Media Influencer

Using network analytics and logistic regression to identify who actually drives behavior on social platforms — and putting a dollar figure on it.

“Influencer” gets thrown around a lot. But what does it actually mean, quantitatively? Is follower count enough? What about retweets, or how central someone is in their network?

We decided to stop guessing and build a model.

The Problem

Given two Twitter users — A and B — which one is more influential?

That’s the core question. And it turns out answering it rigorously requires rethinking how you represent both people in a dataset.

The raw data includes 11 features per person: follower counts, following counts, retweet activity, and three network centrality metrics. The naive approach would be to throw all 22 features (11 for A, 11 for B) into a classifier and call it a day.

We took a more thoughtful approach.

Feature Engineering: Thinking Relationally

The insight that changed everything: influence is relative, not absolute.

A user with 10,000 followers is influential compared to someone with 500 — but not compared to someone with 10 million. What matters is the difference and ratio between the two users in each pair.

Instead of feeding raw A and B values separately, we engineered A−B difference features and A/B ratio features for each metric. This transformation:

Captures the comparison directly
Reduces dimensionality from 22 features to 11
Makes the model easier to interpret (positive coefficient = “A having more of this makes A more influential”)

We also renamed the abstract feature columns — A_network_feature_1 became A_degree, A_network_feature_2 became A_betweenness, and so on — because interpretability matters.

The Three Centrality Metrics (And Why They Matter)

The network features aren’t arbitrary. Each captures something distinct about a person’s position in the social graph:

Degree Centrality — How many direct connections does this person have? High degree = broad reach.

Betweenness Centrality — How often does this person sit on the shortest path between two other users? High betweenness = information broker. Ideas pass through this person.

Closeness Centrality — How quickly can this person reach everyone else in the network? High closeness = fast information spreader.

Of these three, betweenness centrality emerged as the strongest predictor of influence in our model. Raw follower counts were surprisingly weaker predictors — which challenges the conventional “more followers = more influential” assumption.

The Model

After normalizing all features to [0, 1] and running logistic regression, the confusion matrix showed strong classification performance on the held-out data. More importantly, the model’s coefficients told a clear story:

Network position (especially betweenness) matters more than volume metrics
Retweet activity is a stronger signal than follower count
The ratio of followers-to-following is a meaningful indicator of influence asymmetry

Putting a Dollar Figure on It

This is where the project goes beyond academic exercise.

Without the model: A retailer pays $5 to every person (both A and B) to tweet a promotion once. No discrimination.

With the model: Pay $10 only to the predicted influencer in each pair to tweet twice.

The math:

If an influencer tweets once: 0.01% chance each follower buys → $10 profit per unit
If an influencer tweets twice: 0.015% chance → higher expected return
Non-influencers selected without analytics: zero return on the $5 spent

The lift in expected net profit from using the analytic model vs. random targeting was substantial — the model earned back its complexity many times over by concentrating spending on users who actually move the needle.

A perfect model (theoretical upper bound) provided even higher lift, giving us a ceiling to benchmark against.

From Model to Real Influencer Discovery

Part II of the project applied the trained model to actual Twitter data. We scraped tweets from a domain of our choice, extracted the same 11 network features for real accounts, and used the classifier to rank the top 20 influencers.

This is the full pipeline: train on labeled data → validate with financial analysis → deploy on real-world users.

What This Means for Businesses

The takeaway isn’t just “use ML for influencer marketing.” It’s more specific:

Follower count is a weak proxy — betweenness and retweet behavior predict influence more reliably
Pairwise comparison framing works — relative features outperform absolute ones for classification
Analytics has measurable financial value — not just directional improvement, but quantifiable lift

The next time someone tells you an influencer is valuable because they have a million followers, ask: where do they sit in the network? That’s the question that actually matters.

→ View on GitHub: Social-Influencer-Detection