How I Combined NLP, Freshness & Wilson Score to Build a Better Ranking System

We live in a world where nearly everything is ranked from Amazon products, Airbnb listings or Google search results. These rankings shape what we watch, buy, trust or avoid. But I’ve always carried a quiet skepticism about them. They appear objective presented as stars, numbers and charts — but look a little closer, and you’ll often find they’re built on surprisingly simplistic assumptions.

This became crystal clear to me while working on a project involving rental listings. I needed to sort them in a way that felt fair, intelligent and reflective of real quality — not just a gameable average rating.

Sounds simple, right? It wasn’t.

The Problem: Rankings That Lie

The issue appeared subtle at first. Some listings with only a few glowing reviews were sitting at the top, while others with hundreds of solid ratings were buried below. Why?

Because the naive approach relay on metrics like:

Average rating (which ignores volume and recency)
Number of reviews (which ignores sentiment)
Star counts (which ignores textual nuance)

And none of them consider how recent or reliable those reviews are.

I realized: if I wanted users to trust the ranking, it had to account for much more than just stars.

Starting with the Obvious: Sentiment in the Text

The first thing I looked at was the actual text of the reviews.

Instead of just counting stars, I wanted to understand how people felt. Were they complaining about cleanliness? Praising the host? Subtly disappointed?

For this, I reached for VADER — a lightweight sentiment analysis model licensed under the MIT, tailored for short, informal text like reviews. It gives each comment a score between -1 (negative) and +1 (positive). I simplified it further anything above 0.5 counted as positive, everything else as negative.

For example:

“The landlord was very responsive and kind.” → ✅ Positive
“The flat smelled awful and nobody responded to my calls.” → ❌ Negative

I used this to compute a positive sentiment ratio per listing. This gave me a much more nuanced picture than stars alone.

def predict_sentiment(comments):
    model = getModel()  # singleton loader
    return np.array(model.predict(comments))

Then I Asked: Does Time Matter?

Yes. A lot.

What’s the point of a glowing review from two years ago if the property has gone downhill since?

To handle this, I borrowed an idea from physics and marketing models: exponential decay. Essentially, every review has a “half-life” — its influence fades over time.

I set the half-life to 30 days, meaning a review’s impact is halved after a month. A fresh review is gold. An old one? It’s just a whisper.

def compute_freshness_score(reviews):
    now = datetime.now()
    scores = []
    for r in reviews:
        created_at = datetime.fromisoformat(r["createdAt"].replace("Z", ""))
        days_old = (now - created_at).days
        decay = 0.5 ** (days_old / 30)
        scores.append(decay)
    return np.mean(scores) if scores else 0

Confidence Isn’t Optional

At this point, the system was starting to feel smarter — but it still had a bias.

Some listings with just 3 or 4 reviews were getting ranked above others with 100+. Why? Because of perfect sentiment ratios — but based on tiny sample sizes.

This was a statistical illusion.

So I turned to a more robust method: the Wilson Score Interval. It’s a formula that says: “Even if this listing has 100% positive reviews, how confident are we in that number?”

It gives us a lower-bound estimate, adjusted for small sample sizes. This helped penalize listings with too little data — not harshly, but enough to avoid overconfidence.

def wilson_score(pos, total, z=1.96):
    if total == 0:
        return 0
    phat = pos / total
    return (
        (phat + z*z/(2*total) - z * math.sqrt((phat*(1-phat) + z*z/(4*total)) / total))
        / (1 + z*z/total)
    )

Sometimes, Quality Drops Fast

Here’s something the data doesn’t always tell you unless you zoom in quality can suddenly crash.

One property had stellar reviews until the last two weeks. Then suddenly noise complaints, unclean bathrooms, rude host behavior.

That’s when I realized I needed a time-sensitive penalty. So I added a simple rule:

If more than 60% of reviews in the last 30 days are negative, apply a 30% penalty to the final score.

This helped flag sudden issues early — even if the overall score still looked decent.

def compute_recent_negative_penalty(reviews):
    now = datetime.now(UTC)
    recent_reviews = [
        r for r in reviews
        if (now - datetime.fromisoformat(r["createdAt"].replace("Z", "")).replace(tzinfo=UTC)).days <= 30
    ]
    if not recent_reviews:
        return 1.0
    neg_recent = sum(
        1 for r in recent_reviews
        if predict_sentiment([r["reviewComment"]])[0] == 0
    )
    return 0.7 if neg_recent / len(recent_reviews) > 0.6 else 1.0

Stitching It All Together

Now I had multiple signals:

Sentiment Score → How users feel
Freshness Score → How recent the feedback is
Wilson Confidence → How reliable the sentiment is
Average Rating → How people rated the listing
Recent Negative Penalty → Is something wrong right now?

I assigned weights based on what I felt mattered most:

These weights can be tuned according to needs

Finally, I combined everything into a single score:

def rank_property(property_data):
    reviews = property_data.get("reviews", [])
    if not reviews:
        return 0

    sentiment_score, sentiment_preds = compute_sentiment_score(reviews)
    avg_rating = compute_average_rating(reviews)
    freshness_score = compute_freshness_score(reviews)
    conf_score = compute_confidence_score(sentiment_preds, len(reviews))
    penalty = compute_recent_negative_penalty(reviews)

    final_score = (
        0.4 * sentiment_score +
        0.3 * avg_rating +
        0.2 * freshness_score +
        0.1 * conf_score
    ) * penalty

    return round(final_score, 3)

Sorting properties was now just a matter of computing scores and ranking them:

def rank_properties(property_list):
    scored = [(prop, rank_property(prop)) for prop in property_list]
    scored.sort(key=lambda x: x[1], reverse=True)
    return scored

🚀 What’s Next?

There’s still more to explore. Some ideas on my list:

Extracting topics from reviews (e.g. noise, location, host responsiveness)
Adding personalization based on user preferences
Prioritizing listings with verified photos or host replies

But for now, this hybrid model already feels dramatically more robust than the average-rating-only world most platforms live in.

🧾 Final Thoughts

Ranking systems shape the decisions people make what to trust, what to skip, where to stay. But stars alone are just the surface. Underneath lies nuance: tone, recency, trustworthiness and even warning signs.

By blending NLP, time decay, statistical rigor and anomaly detection. I built a system that doesn’t just sort listings — it understands them.

If you’re working on something similar — or thinking of improving how things are ranked — feel free to borrow ideas from this. The system’s modular, explainable and surprisingly adaptable to almost any domain where trust and timing matter.

🧪 Try It Yourself

The full implementation is live and open-source — built as a complete stack with:

🧠 Python for sentiment analysis and scoring
🧰 Spring Boot as the backend API
⚛️ React for the frontend
⚡ Redis for fast in-memory ranking cache

👉 Check it out on GitHub:

🔗 Room Bridge — Intelligent Ranking & Chat System for Property Rentals

Feel free to explore the code, fork it, or contribute.

🙌 Thanks for Reading

Hope this post gave you some ideas on how to rethink review-based rankings. If you enjoyed this, feel free to connect, share feedback, or star the repo!