import polars as pl
import pandas as pd
import numpy as np
from datetime import datetime, date
import pickle
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from sentence_transformers import SentenceTransformer
import faiss
Le Monde, 1 y. of comments - Ukraine War
As a (partial) proxy to measure people engagement
Inside
As a reader of Le Monde —and the comments section ;) I would regularly encounter familiar subscribers’ names. One in particular –more on that later, would manually keep track, count & cite “pro-russian” contributors, directly in the comments. That triggered my need to collect and perform some analysis in a bit more data-science oriented fashion.
After our initial data collection1 we use Polars
as an alternative to Pandas
(still used in some parts when we had to code faster) to perform aggregations and Plotly
to visualize2.
1 Custom API, dataset & scope on my other project
2 This article itself = ipynb (source notebook) -> qmd -> html , via Quarto
The analysis focuses on comments/authors (big numbers, activity over time, cohort analysis…) rather than on articles & titles. To this end, we also lay the foundations to go deeper in the semantic analysis through semantic search on comments via Sbert
embedding + a Faiss
index.
Some Polars / Plotly config. to better render our data.
Code
# Polars, render text columns nicer when printing / displaying df
50)
pl.Config.set_fmt_str_lengths(10)
pl.Config.set_tbl_cols(120)
pl.Config.set_tbl_width_chars(10)
pl.Config.set_tbl_rows(True) # prevents systematic display of df/table shape
pl.Config.set_tbl_hide_dataframe_shape(
# change default plotly express theme
import plotly.io as pio
print(f" reminder plotly quick templates : {pio.templates}")
= "simple_white" template
Load data
236k comments collected, with associated articles & titles | 24th feb 2022 - 24 feb 2023
Reminder : conflict starts Febr the 24th 2022, if we exclude the prior Dombass “events”.
Load our .parquet dataset3 using Polars.
3 Used keywords, scope and limitations of our dataset
# Read parquet using Polars. Could also use scan + collect syntax for lazy execution
# If interested, I did some speed benchmark in the dataset project (lmd_ukr).
= "data/lmd_ukraine.parquet"
filepath = pl.read_parquet(filepath) coms
shape: (236643, 12)
┌──────────┬───────────┬───────────┬──────────┬──────────┬───┬────────────┬────────────┬─────────┬──────────┬──────────┐
│ article_ ┆ url ┆ title ┆ desc ┆ content ┆ … ┆ article_ty ┆ allow_comm ┆ premium ┆ author ┆ comment │
│ id ┆ --- ┆ --- ┆ --- ┆ --- ┆ ┆ pe ┆ ents ┆ --- ┆ --- ┆ --- │
│ --- ┆ str ┆ str ┆ str ┆ str ┆ ┆ --- ┆ --- ┆ bool ┆ str ┆ str │
│ i64 ┆ ┆ ┆ ┆ ┆ ┆ cat ┆ bool ┆ ┆ ┆ │
╞══════════╪═══════════╪═══════════╪══════════╪══════════╪═══╪════════════╪════════════╪═════════╪══════════╪══════════╡
│ 3259703 ┆ https://w ┆ Le ┆ Au ┆ Parce ┆ … ┆ Factuel ┆ true ┆ false ┆ Ricardo ┆ La │
│ ┆ ww.lemond ┆ conflit ┆ Festival ┆ qu’elle ┆ ┆ ┆ ┆ ┆ Uztarroz ┆ question │
│ ┆ e.fr/actu ┆ russo-ukr ┆ de journ ┆ est ┆ ┆ ┆ ┆ ┆ ┆ qui │
│ ┆ alite-med ┆ ainien, ┆ alisme ┆ revenue ┆ ┆ ┆ ┆ ┆ ┆ vaille │
│ ┆ ias/artic ┆ qui ┆ de Couth ┆ frapper ┆ ┆ ┆ ┆ ┆ ┆ et qui │
│ ┆ le/20… ┆ mobilise ┆ ures : ┆ à nos ┆ ┆ ┆ ┆ ┆ ┆ n'est │
│ ┆ ┆ les médi… ┆ la ┆ portes, ┆ ┆ ┆ ┆ ┆ ┆ pas │
│ ┆ ┆ ┆ guerr… ┆ q… ┆ ┆ ┆ ┆ ┆ ┆ posée │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ dan… │
└──────────┴───────────┴───────────┴──────────┴──────────┴───┴────────────┴────────────┴─────────┴──────────┴──────────┘
Order of magnitude
One always likes to know the “how big”. For instance, Le Monde regularly get asked how many questions people would send during live sessions etc. We do not have those numbers, but articles / coms counts are still nice to have. Curious of which metrics (for comments) are available to Le Monde behind the scene. Probably a lot.
Editorial & comment activity, elements of comparison
Our dataset excludes the “Lives” that represent a substancial coverage effort from Le Monde. But just by reading the newspaper (or any, really) we know that they have been mobilizing a lot of resources. Also, regularly lurking into comments section, I know the topic is rather engaging. Now we have the accurate numbers, at least.
Code
# Activity avg posts or comments per day
= 365
days = 2590
total_articles = 236643
total_comments
# articles, (excluding lives/blogs posts) per day
print(f"Theme, Ukraine conflict:")
print(f" - avg articles per day: {total_articles/days:.2f}")
# comments per day
print(f" - avg comments per day: {total_comments/days:.2f}")
# avg n comments per article
print(f" - avg comments per article: {total_comments/total_articles:.2f}")
Theme, Ukraine conflict:
- avg articles per day: 7.10
- avg comments per day: 648.34
- avg comments per article: 91.37
Imagine publishing 7 articles a day on a topic, for one year. To put some perspective on our data, I performed a side & quick additional scraping on two additional topics. Articles count is exhaustive (on a given 1 month period) whereas comments activity was sampled from a random selection of articles for each topic to advance quickly.
- “Réforme retraites” : very hot topic during with “hottest” (demonstrations, strikes) month coverage happening in the same time span as the conflict.
- “Trump” : an always hot/engaging topic in any media, though a bit out of fashion nowadays.
Code
# Collected benchmark data
= ["réforme retraites", "Donald trump"]
themes = [
n_articles 374,
66,
# obtained on 1 month data (jan/febr 2023, exhaustive/no sampling)
] = 31
n_days = [
from_sample_avg_comments_per_articles 124,
40,
# obtained from a sample of 20 articles for each theme.
]
for idx, theme in enumerate(themes):
print(f"Theme, {theme}:")
print(f" - avg articles per day: {n_articles[idx]/n_days:.2f}")
print(
f" - avg comments per article: {from_sample_avg_comments_per_articles[idx]:.2f}\n"
)
Theme, réforme retraites:
- avg articles per day: 12.06
- avg comments per article: 124.00
Theme, Donald trump:
- avg articles per day: 2.13
- avg comments per article: 40.00
Ukraine coverage have been a continuous, long term effort, by Le Monde, with high interest from the public. Whereas I selected the most active month as a benchmark for Retraites, it has a very similar coverage/engagement to Ukraine where numbers are on a one year period. Re. Trump, which is not the least engaging topic, subscribers engagement level is more than twice as low than Ukraine.
People engagement on war over time, subscribers’ comment activity as a proxy
TLDR, people activity keeps being high independently of Le Monde articles frequency. Some peaks in activity would need further investigations but prob. tied to the usual remarkable events (offensives, nukes…). Before our analysis I would think that engagement would decrease, even slightly, over time but it seems not. Also, we’re aware that comments as a proxy has biases4.
4 Le Monde 500k subscribers is a particular demographic, and only a handful of them are active authors.
Code wise, our general workflow revolves around time series groupby / various aggregations using Polars, then convert back the results to Pandas for a quicker viz via Plotly. We will experiment with diverse metrics/windows to better render subscribers activity over this first year : coms daily count, week / month avg, lines vs. hist, coms per article : daily / weekly, moving (rolling) mean on a 30 days period.
Daily number of comments, weekly, monthly averages
# 1. first thing first, number of comms per day
= coms.groupby("date").count().sort("date", descending=False)
coms_daily_count
# 2. average number of comms per week (groupby window, using groupby_dynamic method in Polars)
= coms_daily_count.groupby_dynamic("date", every="1w").agg(
weekly_avg "count").mean()]
[pl.col(
)
# 3.same as above but per month
= coms_daily_count.groupby_dynamic("date", every="1mo").agg(
monthly_avg "count").mean()]
[pl.col(
)# from left to right - average number of comments :
# weekly avg (line), weekly avg (bars), monthly avg (line), monthly avg (bar)
┌────────────┬─────────────┐
│ date ┆ count │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪═════════════╡
│ 2022-02-24 ┆ 1520.0 │
│ 2022-03-03 ┆ 1743.666667 │
│ 2022-03-10 ┆ 1363.0 │
└────────────┴─────────────┘
┌────────────┬─────────────┐
│ date ┆ count │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪═════════════╡
│ 2022-02-01 ┆ 2427.333333 │
│ 2022-03-01 ┆ 1431.833333 │
│ 2022-04-01 ┆ 837.4 │
└────────────┴─────────────┘
Code
= px.line(
fig1
weekly_avg.to_pandas(),="date",
x="count",
y#width=200,
=300,
height=template,
template
)= px.bar(
fig2
weekly_avg.to_pandas(),="date",
x="count",
y#width=200,
=300,
height=template,
template
)= px.line(
fig3
monthly_avg.to_pandas(),="date",
x="count",
y#width=600,
=300,
height=template,
template
)= px.bar(
fig4
monthly_avg.to_pandas(),="date",
x="count",
y#width=600,
=300,
height=template,
template
)
fig1.show()
fig2.show()
fig3.show() fig4.show()
Lower the impact of articles publication freq : ratio comm / articles, rolling mean
When plotting the weekly / monthly avg of comments (above), we clearly distinguish 3 periods of high activity (start of conflict + 2 others), still with a sustained, constant readers involvement.
But due to the number of comments prob. being tied to how many articles Le Monde published in the same time, lets visualize comments activity with normalization : coms per articles (removes articles frequency effect) and rolling mean (smoothen things out).
# daily ratio comms per article. Still using Polars syntax >.<
# 1. group by dates (daily), agg count articles, count comments
= (
daily_coms_per_articles ="date")
coms.groupby(by
.agg(
["article_id").n_unique().alias("count_articles"),
pl.col("comment").count().alias("count_comments"),
pl.col(
]
)"date", descending=False)
.sort(
)
# 2. then calculate coms per articles
= daily_coms_per_articles.with_columns(
daily_coms_per_articles "count_comments") / pl.col("count_articles")).alias("coms_per_article")
(pl.col( )
┌────────────┬────────────────┬────────────────┬──────────────────┐
│ date ┆ count_articles ┆ count_comments ┆ coms_per_article │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ u32 ┆ u32 ┆ f64 │
╞════════════╪════════════════╪════════════════╪══════════════════╡
│ 2022-02-24 ┆ 36 ┆ 3762 ┆ 104.5 │
│ 2022-02-25 ┆ 30 ┆ 2735 ┆ 91.166667 │
│ 2022-02-26 ┆ 7 ┆ 785 ┆ 112.142857 │
└────────────┴────────────────┴────────────────┴──────────────────┘
# weekly ratio coms per article. Polars method is .groupby_dynamic()
= (
weekly_coms_per_articles "date", descending=False)
coms.sort("date", every="1w")
.groupby_dynamic(
.agg(
["article_id").n_unique().alias("count_articles"),
pl.col("comment").count().alias("count_comments"),
pl.col(
]
)"date", descending=False)
.sort(
)
= weekly_coms_per_articles.with_columns(
weekly_coms_per_articles "count_comments") / pl.col("count_articles")).alias("coms_per_article")
(pl.col( )
┌────────────┬────────────────┬────────────────┬──────────────────┐
│ date ┆ count_articles ┆ count_comments ┆ coms_per_article │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ u32 ┆ u32 ┆ f64 │
╞════════════╪════════════════╪════════════════╪══════════════════╡
│ 2022-02-24 ┆ 78 ┆ 7600 ┆ 97.435897 │
│ 2022-03-03 ┆ 130 ┆ 10462 ┆ 80.476923 │
│ 2022-03-10 ┆ 125 ┆ 9541 ┆ 76.328 │
└────────────┴────────────────┴────────────────┴──────────────────┘
Comments activity keeps being high throughout the “first” year of conflict whatever the articles publication rhythm, with even a bigger number of comments per articles in the end period than in the very start.
Some context : first two weeks of September : Ukraine counter-offensive in Karkhiv & Russian mobilization. January 2023 : battle tanks ?
Code
px.bar(
weekly_coms_per_articles.to_pandas(),="date",
x="coms_per_article",
y#width=600,
=400,
height=template,
template )
Moving (rolling) mean, another way to –kind of, smoothen out articles frequency, without the hassle above.
= coms_daily_count.with_columns(
moving_mean "count").rolling_mean(window_size=30).alias("moving_mean")
pl.col( )
┌────────────┬───────┬─────────────┐
│ date ┆ count ┆ moving_mean │
│ --- ┆ --- ┆ --- │
│ date ┆ u32 ┆ f64 │
╞════════════╪═══════╪═════════════╡
│ 2023-02-24 ┆ 2322 ┆ 819.466667 │
│ 2023-02-25 ┆ 899 ┆ 777.233333 │
│ 2023-02-26 ┆ 288 ┆ 775.3 │
└────────────┴───────┴─────────────┘
Code
px.bar(
moving_mean.to_pandas(),="date",
x="moving_mean",
y#width=600,
=400,
height=template,
template )
Who are the most active contributors ? Hardcore posters vs. the silent crowd
Fun fact: goupil_hardi acts as a true “sentinel” of the comments section, to a point where he manually counts & regularly cite the pro russian contributions under the articles. He is the one that made me decide to get the dataset and build this notebook.
Could also do a lot of interesting stuff on trolls detection (if any, access to comments is pretty restricted) but we focused our efforts elsewhere.
Contribution shape, as expected, hardcore posters vs. the rest
10 700 authors, average of 20 comments a year but the median is 4 coms only. Two authors with more than 2K comments. See how the top authors skew the distribution below.
authors.describe()
┌────────────┬────────┬───────────┐
│ describe ┆ author ┆ count │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞════════════╪════════╪═══════════╡
│ count ┆ 10700 ┆ 10700.0 │
│ null_count ┆ 0 ┆ 0.0 │
│ mean ┆ null ┆ 22.116168 │
│ std ┆ null ┆ 77.995757 │
│ min ┆ ┆ 1.0 │
│ max ┆ ㅤㅤ ┆ 2087.0 │
│ median ┆ null ┆ 4.0 │
│ 25% ┆ null ┆ 1.0 │
│ 75% ┆ null ┆ 12.0 │
└────────────┴────────┴───────────┘
Code
= px.violin(
fig_violin
authors.to_pandas(),="count",
y=True,
box="all", # add data points
points#width=600,
=400,
height=template,
template
) fig_violin.show()
Names that ring a bell
If you had time to spare , you could do some semantic search / analysis on the arguments of each side. E.g the dissemination of pro-Russia arguments. But here a simple overview of selected authors & comments.
“Goupil Hardi”, second top poster with 2034 comments in 365 days (5 a days, on Ukraine only). Also not that the comment section is limited to one comment per author, per article + 3 replies-to-comment.
# in Polars, unless I'm doing it wrong, it's harder than with Pandas to extract a col values.
= coms.select(["author", "date", "comment"]).filter(
selected_coms "author") == "goupil hardi").sample(n=3, seed=42).get_column("comment").to_list()
pl.col(
for i, com in enumerate(selected_coms):
print(f"({i+1}) {com[0:90]}...")
- les PERLES de DENIS MONOD-BROCA (ci-dessous à 22h39 – heure de Moscou 🦊) 12/3/22 «L’Europ…
- J’attends toujours vos excuses, Benjamin Valberg, pour avoir falsifié mes propos et m’avo…
- Bien sûr qu’on finira par négocier, mais l’objectif poursuivi par tous les ennemis de la…
What about “Lux”, the top poster ? Don’t remember seeing his name, but a similar profile –with less dedication ;)
- @Eric j’ai surpris plusieurs fois ce citoyen qui falsifiait ses sources soit en inventant …
- Non j’ai rien modifié, les chiffres sont issus de la propagande d’Alexandre Latsa sur son…
- Plf on vous a reconnu. Mais il faut vous rendre à l’évidence…
“Monod-Broca”. Well, to each their own.
- En l’affaire je vois plutôt un David russe face à un Goliath otanien si sûr de sa supre…
- @ kaiwin. Durant l’Antiquité le Kosovo fait partie de l’empire romain puis de l’Empire Se…
- Ce n’est pas un jeu, nous ne l’avons seulement « énervé », nous l’avons délibérément,…
Engagement through cohort analysis, what’s about the retention rate ?
A fancier way to analyze people engagement, over time, on the topic.
Would be interesting to perform some benchmark on other topics.
Steps
1. add/get comment month -> month of the comment for each author
2. add/get cohort month (first month that user posted a comment)
-> first month the authors commented = cohort creation
3. add/get cohort index for each row
# clone data to avoid recursive edit of our dataset
= coms.clone() cohort
Reminder on how our original data looks like :
┌──────────┬───────────┬───────────┬──────────┬──────────┬───┬────────────┬────────────┬─────────┬──────────┬──────────┐
│ article_ ┆ url ┆ title ┆ desc ┆ content ┆ … ┆ article_ty ┆ allow_comm ┆ premium ┆ author ┆ comment │
│ id ┆ --- ┆ --- ┆ --- ┆ --- ┆ ┆ pe ┆ ents ┆ --- ┆ --- ┆ --- │
│ --- ┆ str ┆ str ┆ str ┆ str ┆ ┆ --- ┆ --- ┆ bool ┆ str ┆ str │
│ i64 ┆ ┆ ┆ ┆ ┆ ┆ cat ┆ bool ┆ ┆ ┆ │
╞══════════╪═══════════╪═══════════╪══════════╪══════════╪═══╪════════════╪════════════╪═════════╪══════════╪══════════╡
│ 3259703 ┆ https://w ┆ Le ┆ Au ┆ Parce ┆ … ┆ Factuel ┆ true ┆ false ┆ Ricardo ┆ La │
│ ┆ ww.lemond ┆ conflit ┆ Festival ┆ qu’elle ┆ ┆ ┆ ┆ ┆ Uztarroz ┆ question │
│ ┆ e.fr/actu ┆ russo-ukr ┆ de journ ┆ est ┆ ┆ ┆ ┆ ┆ ┆ qui │
│ ┆ alite-med ┆ ainien, ┆ alisme ┆ revenue ┆ ┆ ┆ ┆ ┆ ┆ vaille │
│ ┆ ias/artic ┆ qui ┆ de Couth ┆ frapper ┆ ┆ ┆ ┆ ┆ ┆ et qui │
│ ┆ le/20… ┆ mobilise ┆ ures : ┆ à nos ┆ ┆ ┆ ┆ ┆ ┆ n'est │
│ ┆ ┆ les médi… ┆ la ┆ portes, ┆ ┆ ┆ ┆ ┆ ┆ pas │
│ ┆ ┆ ┆ guerr… ┆ q… ┆ ┆ ┆ ┆ ┆ ┆ posée │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ dan… │
└──────────┴───────────┴───────────┴──────────┴──────────┴───┴────────────┴────────────┴─────────┴──────────┴──────────┘
# We will only use authors, date, number of comments to render our cohort
# Also, switch to Pandas, more familiar with it for the following operations
= ["author", "date", "article_id"]
relevant_columns = cohort.select(relevant_columns)
cohort = cohort.to_pandas()
cohort 2) cohort.head(
author | date | article_id | |
---|---|---|---|
0 | Ricardo Uztarroz | 2022-07-16 | 3259703 |
1 | Ricardo Uztarroz | 2022-07-16 | 3259703 |
Shape to cohort (Pandas)
# 1. comment month
# tip : map faster than apply, we can use it cause we're dealin with one col at a time
"comment_month"] = cohort["date"].map(lambda x: datetime(x.year, x.month, 1))
cohort[2))
display(cohort.head(
# 2. cohort month
# tip : transform after a groupby,return a df with the same length
# and here return the min for each entry
"cohort_month"] = cohort.groupby("author")["comment_month"].transform("min")
cohort[2)) display(cohort.head(
author | date | article_id | comment_month | |
---|---|---|---|---|
0 | Ricardo Uztarroz | 2022-07-16 | 3259703 | 2022-07-01 |
1 | Ricardo Uztarroz | 2022-07-16 | 3259703 | 2022-07-01 |
author | date | article_id | comment_month | cohort_month | |
---|---|---|---|---|---|
0 | Ricardo Uztarroz | 2022-07-16 | 3259703 | 2022-07-01 | 2022-02-01 |
1 | Ricardo Uztarroz | 2022-07-16 | 3259703 | 2022-07-01 | 2022-02-01 |
# 3. cohort index : for each row, difference in months,
# between first comment month and cohort month
def get_date(df, column):
= df[column].dt.year
year = df[column].dt.month
month = df[column].dt.day
day return year, month, day
= get_date(cohort, "comment_month")
comment_year, comment_month, _ = get_date(cohort, "cohort_month")
cohort_year, cohort_month, _ = comment_year - cohort_year
year_diff = comment_month - cohort_month
month_diff "cohort_index"] = year_diff * 12 + month_diff + 1
cohort[4)) display(cohort.head(
author | date | article_id | comment_month | cohort_month | cohort_index | |
---|---|---|---|---|---|---|
0 | Ricardo Uztarroz | 2022-07-16 | 3259703 | 2022-07-01 | 2022-02-01 | 6 |
1 | Ricardo Uztarroz | 2022-07-16 | 3259703 | 2022-07-01 | 2022-02-01 | 6 |
2 | Correcteur | 2022-07-16 | 3259703 | 2022-07-01 | 2022-02-01 | 6 |
3 | Jean-Doute | 2022-07-16 | 3259703 | 2022-07-01 | 2022-02-01 | 6 |
Cohort percentage active users (retention rate, in %)
Code
# Clone previous dataframe (cohort users, count)
= active_authors.copy(deep=True)
active_authors_pct
# get %
= active_authors.copy(deep=True)
active_authors_pct for col in active_authors_pct.columns[1:]:
= round(
active_authors_pct[col] / active_authors_pct[1] * 100, 2
active_authors_pct[col]
)1] = 100
active_authors_pct[
# Generate heatmap (cohort users, %)
= {"x": "n months", "y": "cohort (by month)", "color": "% author"}
labels
= px.imshow(active_authors_pct, text_auto=True, labels=labels)
fig_pct = fig_pct.update_xaxes(side="top", ticks="outside", tickson="boundaries", ticklen=5)
fig_pct= fig_pct.update_yaxes(showgrid=False)
fig_pct
= fig_pct.update_layout(
fig_pct
{"xaxis": {"tickmode": "linear", "showgrid": False},
#"width": 800,
"height": 500,
"plot_bgcolor": "rgba(0, 0, 0, 0)",
"paper_bgcolor": "rgba(0, 2, 0, 0)",
} )
cohort_index | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cohort_month | |||||||||||||
2022-02-01 | 100 | 76.79 | 64.65 | 56.32 | 48.12 | 43.41 | 49.45 | 55.04 | 49.18 | 38.83 | 38.51 | 45.38 | 48.08 |
2022-03-01 | 100 | 37.87 | 29.88 | 22.93 | 19.19 | 24.85 | 28.97 | 25.49 | 16.68 | 17.99 | 21.46 | 24.05 | NaN |
2022-04-01 | 100 | 20.39 | 13.95 | 12.88 | 15.45 | 20.60 | 17.70 | 11.70 | 13.09 | 17.38 | 16.09 | NaN | NaN |
2022-05-01 | 100 | 12.06 | 9.58 | 14.99 | 18.24 | 13.91 | 9.74 | 9.27 | 13.29 | 15.15 | NaN | NaN | NaN |
2022-06-01 | 100 | 16.57 | 21.60 | 21.60 | 17.16 | 12.72 | 13.91 | 15.09 | 20.41 | NaN | NaN | NaN | NaN |
2022-07-01 | 100 | 20.97 | 23.22 | 17.23 | 11.24 | 13.11 | 17.98 | 18.35 | NaN | NaN | NaN | NaN | NaN |
2022-08-01 | 100 | 21.05 | 14.17 | 9.11 | 7.69 | 12.35 | 11.94 | NaN | NaN | NaN | NaN | NaN | NaN |
2022-09-01 | 100 | 12.09 | 10.53 | 9.55 | 14.42 | 12.28 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2022-10-01 | 100 | 8.10 | 7.26 | 13.13 | 15.36 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2022-11-01 | 100 | 6.34 | 11.27 | 14.79 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2022-12-01 | 100 | 18.22 | 16.36 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2023-01-01 | 100 | 20.29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2023-02-01 | 100 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Comments embedding & fast retrieval using SBERT, Faiss
Example of use : if we wanted to retrieve similar arguments / check propaganda –in an efficient way, with Faiss.
Semantic search : resources & performance overview of our curated models
Notes and resources I found to be useful
Model choice, must read : symmetric vs. asymmetric semantic search, language, tuning : Sbert is all u need
Models available and metrics: Sbert doc
Misc
Models trained for cosine prefer shorter document retrieval, vs. dot product (longer)
Faiss uses inner product (=dot product ; += cosine score if vectors are normalized e.g using faiss.normalize_L2 or L2 to measure distances (more here).
In the first place, we were not sure of our typical use case : short query => long comment (asymmetric), or comment => similar comment (symmetric).
Candidate models we curated & tested :
Example here (pytorch)
Post run evaluation and remarks :
mpnet-base-v2
,distiluse
,quora
: fast encoding (20k documents < 1mn), results quite similar between models, each one finds our test query and pertinent results. A very good baseline.mpnet-base-v2
,distiluse
,quora
: with a flat inner product faiss index, no difference if we perform vectors normalization or not, maybe because they’re optimized for cosine already?camembert
is a bigger model (1024 dimension), slower encoding (20k docs = 5mn), nice (better?) results (spoiler alert : it is optimized for French). With a flat IP index, if normalize = False, retrieve similar, short documents. If we normalize our embeddings, it retrieves our initial query + longer, similar documents.Fasten our models evaluation through prior rdm sampling, notes on speed.
To speed our experiments up, we will work with a sample of comments (around 10% : 236k -> 20k)
FYI,embedding of all comments (all 236k), takes approx 10mn on a 1080ti, i7700k, 32gb RAM, with curated models ; *1.5 to *2 when using the biggest model (
Camembert
).Encoding on our sample (10k) is < 1mn.
No detailed measure on inference speed, but very fast with Faiss. Might want to try different –optimized, indexes typeswith a bigger dataset.
Convenience functions to repeat our experiments with different indexes / models
Load model, encode comments
Create (or load) our Faiss index, here a flat index
Let’s find similar comments
Maybe later just for fun : 0 shot “tone” classification tests using OpenAI API