import polars as pl
import pandas as pd
import numpy as np
from datetime import datetime, date
import pickle
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from sentence_transformers import SentenceTransformer
import faissLe Monde, 1 y. of comments - Ukraine War
As a (partial) proxy to measure people engagement
Inside
As a reader of Le Monde —and the comments section ;) I would regularly encounter familiar subscribers’ names. One in particular –more on that later, would manually keep track, count & cite “pro-russian” contributors, directly in the comments. That triggered my need to collect and perform some analysis in a bit more data-science oriented fashion.
After our initial data collection1 we use Polars as an alternative to Pandas (still used in some parts when we had to code faster) to perform aggregations and Plotly to visualize2.
1 Custom API, dataset & scope on my other project
2 This article itself = ipynb (source notebook) -> qmd -> html , via Quarto
The analysis focuses on comments/authors (big numbers, activity over time, cohort analysis…) rather than on articles & titles. To this end, we also lay the foundations to go deeper in the semantic analysis through semantic search on comments via Sbert embedding + a Faiss index.
Some Polars / Plotly config. to better render our data.
Code
# Polars, render text columns nicer when printing / displaying df
pl.Config.set_fmt_str_lengths(50)
pl.Config.set_tbl_cols(10)
pl.Config.set_tbl_width_chars(120)
pl.Config.set_tbl_rows(10)
pl.Config.set_tbl_hide_dataframe_shape(True) # prevents systematic display of df/table shape
# change default plotly express theme
import plotly.io as pio
print(f" reminder plotly quick templates : {pio.templates}")
template = "simple_white"Load data
236k comments collected, with associated articles & titles | 24th feb 2022 - 24 feb 2023
Reminder : conflict starts Febr the 24th 2022, if we exclude the prior Dombass “events”.
Load our .parquet dataset3 using Polars.
3 Used keywords, scope and limitations of our dataset
# Read parquet using Polars. Could also use scan + collect syntax for lazy execution
# If interested, I did some speed benchmark in the dataset project (lmd_ukr).
filepath = "data/lmd_ukraine.parquet"
coms = pl.read_parquet(filepath)shape: (236643, 12)
┌──────────┬───────────┬───────────┬──────────┬──────────┬───┬────────────┬────────────┬─────────┬──────────┬──────────┐
│ article_ ┆ url ┆ title ┆ desc ┆ content ┆ … ┆ article_ty ┆ allow_comm ┆ premium ┆ author ┆ comment │
│ id ┆ --- ┆ --- ┆ --- ┆ --- ┆ ┆ pe ┆ ents ┆ --- ┆ --- ┆ --- │
│ --- ┆ str ┆ str ┆ str ┆ str ┆ ┆ --- ┆ --- ┆ bool ┆ str ┆ str │
│ i64 ┆ ┆ ┆ ┆ ┆ ┆ cat ┆ bool ┆ ┆ ┆ │
╞══════════╪═══════════╪═══════════╪══════════╪══════════╪═══╪════════════╪════════════╪═════════╪══════════╪══════════╡
│ 3259703 ┆ https://w ┆ Le ┆ Au ┆ Parce ┆ … ┆ Factuel ┆ true ┆ false ┆ Ricardo ┆ La │
│ ┆ ww.lemond ┆ conflit ┆ Festival ┆ qu’elle ┆ ┆ ┆ ┆ ┆ Uztarroz ┆ question │
│ ┆ e.fr/actu ┆ russo-ukr ┆ de journ ┆ est ┆ ┆ ┆ ┆ ┆ ┆ qui │
│ ┆ alite-med ┆ ainien, ┆ alisme ┆ revenue ┆ ┆ ┆ ┆ ┆ ┆ vaille │
│ ┆ ias/artic ┆ qui ┆ de Couth ┆ frapper ┆ ┆ ┆ ┆ ┆ ┆ et qui │
│ ┆ le/20… ┆ mobilise ┆ ures : ┆ à nos ┆ ┆ ┆ ┆ ┆ ┆ n'est │
│ ┆ ┆ les médi… ┆ la ┆ portes, ┆ ┆ ┆ ┆ ┆ ┆ pas │
│ ┆ ┆ ┆ guerr… ┆ q… ┆ ┆ ┆ ┆ ┆ ┆ posée │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ dan… │
└──────────┴───────────┴───────────┴──────────┴──────────┴───┴────────────┴────────────┴─────────┴──────────┴──────────┘
Order of magnitude
One always likes to know the “how big”. For instance, Le Monde regularly get asked how many questions people would send during live sessions etc. We do not have those numbers, but articles / coms counts are still nice to have. Curious of which metrics (for comments) are available to Le Monde behind the scene. Probably a lot.
Editorial & comment activity, elements of comparison
Our dataset excludes the “Lives” that represent a substancial coverage effort from Le Monde. But just by reading the newspaper (or any, really) we know that they have been mobilizing a lot of resources. Also, regularly lurking into comments section, I know the topic is rather engaging. Now we have the accurate numbers, at least.
Code
# Activity avg posts or comments per day
days = 365
total_articles = 2590
total_comments = 236643
# articles, (excluding lives/blogs posts) per day
print(f"Theme, Ukraine conflict:")
print(f" - avg articles per day: {total_articles/days:.2f}")
# comments per day
print(f" - avg comments per day: {total_comments/days:.2f}")
# avg n comments per article
print(f" - avg comments per article: {total_comments/total_articles:.2f}")Theme, Ukraine conflict:
- avg articles per day: 7.10
- avg comments per day: 648.34
- avg comments per article: 91.37
Imagine publishing 7 articles a day on a topic, for one year. To put some perspective on our data, I performed a side & quick additional scraping on two additional topics. Articles count is exhaustive (on a given 1 month period) whereas comments activity was sampled from a random selection of articles for each topic to advance quickly.
- “Réforme retraites” : very hot topic during with “hottest” (demonstrations, strikes) month coverage happening in the same time span as the conflict.
- “Trump” : an always hot/engaging topic in any media, though a bit out of fashion nowadays.
Code
# Collected benchmark data
themes = ["réforme retraites", "Donald trump"]
n_articles = [
374,
66,
] # obtained on 1 month data (jan/febr 2023, exhaustive/no sampling)
n_days = 31
from_sample_avg_comments_per_articles = [
124,
40,
] # obtained from a sample of 20 articles for each theme.
for idx, theme in enumerate(themes):
print(f"Theme, {theme}:")
print(f" - avg articles per day: {n_articles[idx]/n_days:.2f}")
print(
f" - avg comments per article: {from_sample_avg_comments_per_articles[idx]:.2f}\n"
)Theme, réforme retraites:
- avg articles per day: 12.06
- avg comments per article: 124.00
Theme, Donald trump:
- avg articles per day: 2.13
- avg comments per article: 40.00
Ukraine coverage have been a continuous, long term effort, by Le Monde, with high interest from the public. Whereas I selected the most active month as a benchmark for Retraites, it has a very similar coverage/engagement to Ukraine where numbers are on a one year period. Re. Trump, which is not the least engaging topic, subscribers engagement level is more than twice as low than Ukraine.
People engagement on war over time, subscribers’ comment activity as a proxy
TLDR, people activity keeps being high independently of Le Monde articles frequency. Some peaks in activity would need further investigations but prob. tied to the usual remarkable events (offensives, nukes…). Before our analysis I would think that engagement would decrease, even slightly, over time but it seems not. Also, we’re aware that comments as a proxy has biases4.
4 Le Monde 500k subscribers is a particular demographic, and only a handful of them are active authors.
Code wise, our general workflow revolves around time series groupby / various aggregations using Polars, then convert back the results to Pandas for a quicker viz via Plotly. We will experiment with diverse metrics/windows to better render subscribers activity over this first year : coms daily count, week / month avg, lines vs. hist, coms per article : daily / weekly, moving (rolling) mean on a 30 days period.
Daily number of comments, weekly, monthly averages
# 1. first thing first, number of comms per day
coms_daily_count = coms.groupby("date").count().sort("date", descending=False)
# 2. average number of comms per week (groupby window, using groupby_dynamic method in Polars)
weekly_avg = coms_daily_count.groupby_dynamic("date", every="1w").agg(
[pl.col("count").mean()]
)
# 3.same as above but per month
monthly_avg = coms_daily_count.groupby_dynamic("date", every="1mo").agg(
[pl.col("count").mean()]
)
# from left to right - average number of comments :
# weekly avg (line), weekly avg (bars), monthly avg (line), monthly avg (bar)┌────────────┬─────────────┐
│ date ┆ count │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪═════════════╡
│ 2022-02-24 ┆ 1520.0 │
│ 2022-03-03 ┆ 1743.666667 │
│ 2022-03-10 ┆ 1363.0 │
└────────────┴─────────────┘
┌────────────┬─────────────┐
│ date ┆ count │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪═════════════╡
│ 2022-02-01 ┆ 2427.333333 │
│ 2022-03-01 ┆ 1431.833333 │
│ 2022-04-01 ┆ 837.4 │
└────────────┴─────────────┘
Code
fig1 = px.line(
weekly_avg.to_pandas(),
x="date",
y="count",
#width=200,
height=300,
template=template,
)
fig2 = px.bar(
weekly_avg.to_pandas(),
x="date",
y="count",
#width=200,
height=300,
template=template,
)
fig3 = px.line(
monthly_avg.to_pandas(),
x="date",
y="count",
#width=600,
height=300,
template=template,
)
fig4 = px.bar(
monthly_avg.to_pandas(),
x="date",
y="count",
#width=600,
height=300,
template=template,
)
fig1.show()
fig2.show()
fig3.show()
fig4.show()Lower the impact of articles publication freq : ratio comm / articles, rolling mean
When plotting the weekly / monthly avg of comments (above), we clearly distinguish 3 periods of high activity (start of conflict + 2 others), still with a sustained, constant readers involvement.
But due to the number of comments prob. being tied to how many articles Le Monde published in the same time, lets visualize comments activity with normalization : coms per articles (removes articles frequency effect) and rolling mean (smoothen things out).
# daily ratio comms per article. Still using Polars syntax >.<
# 1. group by dates (daily), agg count articles, count comments
daily_coms_per_articles = (
coms.groupby(by="date")
.agg(
[
pl.col("article_id").n_unique().alias("count_articles"),
pl.col("comment").count().alias("count_comments"),
]
)
.sort("date", descending=False)
)
# 2. then calculate coms per articles
daily_coms_per_articles = daily_coms_per_articles.with_columns(
(pl.col("count_comments") / pl.col("count_articles")).alias("coms_per_article")
)┌────────────┬────────────────┬────────────────┬──────────────────┐
│ date ┆ count_articles ┆ count_comments ┆ coms_per_article │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ u32 ┆ u32 ┆ f64 │
╞════════════╪════════════════╪════════════════╪══════════════════╡
│ 2022-02-24 ┆ 36 ┆ 3762 ┆ 104.5 │
│ 2022-02-25 ┆ 30 ┆ 2735 ┆ 91.166667 │
│ 2022-02-26 ┆ 7 ┆ 785 ┆ 112.142857 │
└────────────┴────────────────┴────────────────┴──────────────────┘
# weekly ratio coms per article. Polars method is .groupby_dynamic()
weekly_coms_per_articles = (
coms.sort("date", descending=False)
.groupby_dynamic("date", every="1w")
.agg(
[
pl.col("article_id").n_unique().alias("count_articles"),
pl.col("comment").count().alias("count_comments"),
]
)
.sort("date", descending=False)
)
weekly_coms_per_articles = weekly_coms_per_articles.with_columns(
(pl.col("count_comments") / pl.col("count_articles")).alias("coms_per_article")
)┌────────────┬────────────────┬────────────────┬──────────────────┐
│ date ┆ count_articles ┆ count_comments ┆ coms_per_article │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ u32 ┆ u32 ┆ f64 │
╞════════════╪════════════════╪════════════════╪══════════════════╡
│ 2022-02-24 ┆ 78 ┆ 7600 ┆ 97.435897 │
│ 2022-03-03 ┆ 130 ┆ 10462 ┆ 80.476923 │
│ 2022-03-10 ┆ 125 ┆ 9541 ┆ 76.328 │
└────────────┴────────────────┴────────────────┴──────────────────┘
Comments activity keeps being high throughout the “first” year of conflict whatever the articles publication rhythm, with even a bigger number of comments per articles in the end period than in the very start.
Some context : first two weeks of September : Ukraine counter-offensive in Karkhiv & Russian mobilization. January 2023 : battle tanks ?
Code
px.bar(
weekly_coms_per_articles.to_pandas(),
x="date",
y="coms_per_article",
#width=600,
height=400,
template=template,
)Moving (rolling) mean, another way to –kind of, smoothen out articles frequency, without the hassle above.
moving_mean = coms_daily_count.with_columns(
pl.col("count").rolling_mean(window_size=30).alias("moving_mean")
)┌────────────┬───────┬─────────────┐
│ date ┆ count ┆ moving_mean │
│ --- ┆ --- ┆ --- │
│ date ┆ u32 ┆ f64 │
╞════════════╪═══════╪═════════════╡
│ 2023-02-24 ┆ 2322 ┆ 819.466667 │
│ 2023-02-25 ┆ 899 ┆ 777.233333 │
│ 2023-02-26 ┆ 288 ┆ 775.3 │
└────────────┴───────┴─────────────┘
Code
px.bar(
moving_mean.to_pandas(),
x="date",
y="moving_mean",
#width=600,
height=400,
template=template,
)Who are the most active contributors ? Hardcore posters vs. the silent crowd
Fun fact: goupil_hardi acts as a true “sentinel” of the comments section, to a point where he manually counts & regularly cite the pro russian contributions under the articles. He is the one that made me decide to get the dataset and build this notebook.
Could also do a lot of interesting stuff on trolls detection (if any, access to comments is pretty restricted) but we focused our efforts elsewhere.
Contribution shape, as expected, hardcore posters vs. the rest
10 700 authors, average of 20 comments a year but the median is 4 coms only. Two authors with more than 2K comments. See how the top authors skew the distribution below.
authors.describe()┌────────────┬────────┬───────────┐
│ describe ┆ author ┆ count │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞════════════╪════════╪═══════════╡
│ count ┆ 10700 ┆ 10700.0 │
│ null_count ┆ 0 ┆ 0.0 │
│ mean ┆ null ┆ 22.116168 │
│ std ┆ null ┆ 77.995757 │
│ min ┆ ┆ 1.0 │
│ max ┆ ㅤㅤ ┆ 2087.0 │
│ median ┆ null ┆ 4.0 │
│ 25% ┆ null ┆ 1.0 │
│ 75% ┆ null ┆ 12.0 │
└────────────┴────────┴───────────┘
Code
fig_violin = px.violin(
authors.to_pandas(),
y="count",
box=True,
points="all", # add data points
#width=600,
height=400,
template=template,
)
fig_violin.show()Names that ring a bell
If you had time to spare , you could do some semantic search / analysis on the arguments of each side. E.g the dissemination of pro-Russia arguments. But here a simple overview of selected authors & comments.
“Goupil Hardi”, second top poster with 2034 comments in 365 days (5 a days, on Ukraine only). Also not that the comment section is limited to one comment per author, per article + 3 replies-to-comment.
# in Polars, unless I'm doing it wrong, it's harder than with Pandas to extract a col values.
selected_coms = coms.select(["author", "date", "comment"]).filter(
pl.col("author") == "goupil hardi").sample(n=3, seed=42).get_column("comment").to_list()
for i, com in enumerate(selected_coms):
print(f"({i+1}) {com[0:90]}...")- les PERLES de DENIS MONOD-BROCA (ci-dessous à 22h39 – heure de Moscou 🦊) 12/3/22 «L’Europ…
- J’attends toujours vos excuses, Benjamin Valberg, pour avoir falsifié mes propos et m’avo…
- Bien sûr qu’on finira par négocier, mais l’objectif poursuivi par tous les ennemis de la…
What about “Lux”, the top poster ? Don’t remember seeing his name, but a similar profile –with less dedication ;)
- @Eric j’ai surpris plusieurs fois ce citoyen qui falsifiait ses sources soit en inventant …
- Non j’ai rien modifié, les chiffres sont issus de la propagande d’Alexandre Latsa sur son…
- Plf on vous a reconnu. Mais il faut vous rendre à l’évidence…
“Monod-Broca”. Well, to each their own.
- En l’affaire je vois plutôt un David russe face à un Goliath otanien si sûr de sa supre…
- @ kaiwin. Durant l’Antiquité le Kosovo fait partie de l’empire romain puis de l’Empire Se…
- Ce n’est pas un jeu, nous ne l’avons seulement « énervé », nous l’avons délibérément,…
Engagement through cohort analysis, what’s about the retention rate ?
A fancier way to analyze people engagement, over time, on the topic.
Would be interesting to perform some benchmark on other topics.
Steps
1. add/get comment month -> month of the comment for each author
2. add/get cohort month (first month that user posted a comment)
-> first month the authors commented = cohort creation
3. add/get cohort index for each row
# clone data to avoid recursive edit of our dataset
cohort = coms.clone()Reminder on how our original data looks like :
┌──────────┬───────────┬───────────┬──────────┬──────────┬───┬────────────┬────────────┬─────────┬──────────┬──────────┐
│ article_ ┆ url ┆ title ┆ desc ┆ content ┆ … ┆ article_ty ┆ allow_comm ┆ premium ┆ author ┆ comment │
│ id ┆ --- ┆ --- ┆ --- ┆ --- ┆ ┆ pe ┆ ents ┆ --- ┆ --- ┆ --- │
│ --- ┆ str ┆ str ┆ str ┆ str ┆ ┆ --- ┆ --- ┆ bool ┆ str ┆ str │
│ i64 ┆ ┆ ┆ ┆ ┆ ┆ cat ┆ bool ┆ ┆ ┆ │
╞══════════╪═══════════╪═══════════╪══════════╪══════════╪═══╪════════════╪════════════╪═════════╪══════════╪══════════╡
│ 3259703 ┆ https://w ┆ Le ┆ Au ┆ Parce ┆ … ┆ Factuel ┆ true ┆ false ┆ Ricardo ┆ La │
│ ┆ ww.lemond ┆ conflit ┆ Festival ┆ qu’elle ┆ ┆ ┆ ┆ ┆ Uztarroz ┆ question │
│ ┆ e.fr/actu ┆ russo-ukr ┆ de journ ┆ est ┆ ┆ ┆ ┆ ┆ ┆ qui │
│ ┆ alite-med ┆ ainien, ┆ alisme ┆ revenue ┆ ┆ ┆ ┆ ┆ ┆ vaille │
│ ┆ ias/artic ┆ qui ┆ de Couth ┆ frapper ┆ ┆ ┆ ┆ ┆ ┆ et qui │
│ ┆ le/20… ┆ mobilise ┆ ures : ┆ à nos ┆ ┆ ┆ ┆ ┆ ┆ n'est │
│ ┆ ┆ les médi… ┆ la ┆ portes, ┆ ┆ ┆ ┆ ┆ ┆ pas │
│ ┆ ┆ ┆ guerr… ┆ q… ┆ ┆ ┆ ┆ ┆ ┆ posée │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ dan… │
└──────────┴───────────┴───────────┴──────────┴──────────┴───┴────────────┴────────────┴─────────┴──────────┴──────────┘
# We will only use authors, date, number of comments to render our cohort
# Also, switch to Pandas, more familiar with it for the following operations
relevant_columns = ["author", "date", "article_id"]
cohort = cohort.select(relevant_columns)
cohort = cohort.to_pandas()
cohort.head(2)| author | date | article_id | |
|---|---|---|---|
| 0 | Ricardo Uztarroz | 2022-07-16 | 3259703 |
| 1 | Ricardo Uztarroz | 2022-07-16 | 3259703 |
Shape to cohort (Pandas)
# 1. comment month
# tip : map faster than apply, we can use it cause we're dealin with one col at a time
cohort["comment_month"] = cohort["date"].map(lambda x: datetime(x.year, x.month, 1))
display(cohort.head(2))
# 2. cohort month
# tip : transform after a groupby,return a df with the same length
# and here return the min for each entry
cohort["cohort_month"] = cohort.groupby("author")["comment_month"].transform("min")
display(cohort.head(2))| author | date | article_id | comment_month | |
|---|---|---|---|---|
| 0 | Ricardo Uztarroz | 2022-07-16 | 3259703 | 2022-07-01 |
| 1 | Ricardo Uztarroz | 2022-07-16 | 3259703 | 2022-07-01 |
| author | date | article_id | comment_month | cohort_month | |
|---|---|---|---|---|---|
| 0 | Ricardo Uztarroz | 2022-07-16 | 3259703 | 2022-07-01 | 2022-02-01 |
| 1 | Ricardo Uztarroz | 2022-07-16 | 3259703 | 2022-07-01 | 2022-02-01 |
# 3. cohort index : for each row, difference in months,
# between first comment month and cohort month
def get_date(df, column):
year = df[column].dt.year
month = df[column].dt.month
day = df[column].dt.day
return year, month, day
comment_year, comment_month, _ = get_date(cohort, "comment_month")
cohort_year, cohort_month, _ = get_date(cohort, "cohort_month")
year_diff = comment_year - cohort_year
month_diff = comment_month - cohort_month
cohort["cohort_index"] = year_diff * 12 + month_diff + 1
display(cohort.head(4))| author | date | article_id | comment_month | cohort_month | cohort_index | |
|---|---|---|---|---|---|---|
| 0 | Ricardo Uztarroz | 2022-07-16 | 3259703 | 2022-07-01 | 2022-02-01 | 6 |
| 1 | Ricardo Uztarroz | 2022-07-16 | 3259703 | 2022-07-01 | 2022-02-01 | 6 |
| 2 | Correcteur | 2022-07-16 | 3259703 | 2022-07-01 | 2022-02-01 | 6 |
| 3 | Jean-Doute | 2022-07-16 | 3259703 | 2022-07-01 | 2022-02-01 | 6 |
Cohort percentage active users (retention rate, in %)
Code
# Clone previous dataframe (cohort users, count)
active_authors_pct = active_authors.copy(deep=True)
# get %
active_authors_pct = active_authors.copy(deep=True)
for col in active_authors_pct.columns[1:]:
active_authors_pct[col] = round(
active_authors_pct[col] / active_authors_pct[1] * 100, 2
)
active_authors_pct[1] = 100
# Generate heatmap (cohort users, %)
labels = {"x": "n months", "y": "cohort (by month)", "color": "% author"}
fig_pct = px.imshow(active_authors_pct, text_auto=True, labels=labels)
fig_pct= fig_pct.update_xaxes(side="top", ticks="outside", tickson="boundaries", ticklen=5)
fig_pct = fig_pct.update_yaxes(showgrid=False)
fig_pct = fig_pct.update_layout(
{
"xaxis": {"tickmode": "linear", "showgrid": False},
#"width": 800,
"height": 500,
"plot_bgcolor": "rgba(0, 0, 0, 0)",
"paper_bgcolor": "rgba(0, 2, 0, 0)",
}
)| cohort_index | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cohort_month | |||||||||||||
| 2022-02-01 | 100 | 76.79 | 64.65 | 56.32 | 48.12 | 43.41 | 49.45 | 55.04 | 49.18 | 38.83 | 38.51 | 45.38 | 48.08 |
| 2022-03-01 | 100 | 37.87 | 29.88 | 22.93 | 19.19 | 24.85 | 28.97 | 25.49 | 16.68 | 17.99 | 21.46 | 24.05 | NaN |
| 2022-04-01 | 100 | 20.39 | 13.95 | 12.88 | 15.45 | 20.60 | 17.70 | 11.70 | 13.09 | 17.38 | 16.09 | NaN | NaN |
| 2022-05-01 | 100 | 12.06 | 9.58 | 14.99 | 18.24 | 13.91 | 9.74 | 9.27 | 13.29 | 15.15 | NaN | NaN | NaN |
| 2022-06-01 | 100 | 16.57 | 21.60 | 21.60 | 17.16 | 12.72 | 13.91 | 15.09 | 20.41 | NaN | NaN | NaN | NaN |
| 2022-07-01 | 100 | 20.97 | 23.22 | 17.23 | 11.24 | 13.11 | 17.98 | 18.35 | NaN | NaN | NaN | NaN | NaN |
| 2022-08-01 | 100 | 21.05 | 14.17 | 9.11 | 7.69 | 12.35 | 11.94 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2022-09-01 | 100 | 12.09 | 10.53 | 9.55 | 14.42 | 12.28 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2022-10-01 | 100 | 8.10 | 7.26 | 13.13 | 15.36 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2022-11-01 | 100 | 6.34 | 11.27 | 14.79 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2022-12-01 | 100 | 18.22 | 16.36 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2023-01-01 | 100 | 20.29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2023-02-01 | 100 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Comments embedding & fast retrieval using SBERT, Faiss
Example of use : if we wanted to retrieve similar arguments / check propaganda –in an efficient way, with Faiss.
Semantic search : resources & performance overview of our curated models
Notes and resources I found to be useful
Model choice, must read : symmetric vs. asymmetric semantic search, language, tuning : Sbert is all u need
Models available and metrics: Sbert doc
Misc
Models trained for cosine prefer shorter document retrieval, vs. dot product (longer)
Faiss uses inner product (=dot product ; += cosine score if vectors are normalized e.g using faiss.normalize_L2 or L2 to measure distances (more here).
In the first place, we were not sure of our typical use case : short query => long comment (asymmetric), or comment => similar comment (symmetric).
Candidate models we curated & tested :
Example here (pytorch)
Post run evaluation and remarks :
mpnet-base-v2,distiluse,quora: fast encoding (20k documents < 1mn), results quite similar between models, each one finds our test query and pertinent results. A very good baseline.mpnet-base-v2,distiluse,quora: with a flat inner product faiss index, no difference if we perform vectors normalization or not, maybe because they’re optimized for cosine already?camembertis a bigger model (1024 dimension), slower encoding (20k docs = 5mn), nice (better?) results (spoiler alert : it is optimized for French). With a flat IP index, if normalize = False, retrieve similar, short documents. If we normalize our embeddings, it retrieves our initial query + longer, similar documents.Fasten our models evaluation through prior rdm sampling, notes on speed.
To speed our experiments up, we will work with a sample of comments (around 10% : 236k -> 20k)
FYI,embedding of all comments (all 236k), takes approx 10mn on a 1080ti, i7700k, 32gb RAM, with curated models ; *1.5 to *2 when using the biggest model (
Camembert).Encoding on our sample (10k) is < 1mn.
No detailed measure on inference speed, but very fast with Faiss. Might want to try different –optimized, indexes typeswith a bigger dataset.
Convenience functions to repeat our experiments with different indexes / models
Load model, encode comments
Create (or load) our Faiss index, here a flat index
Let’s find similar comments
Maybe later just for fun : 0 shot “tone” classification tests using OpenAI API