Le Monde, 1 y. of comments - Ukraine War

As a (partial) proxy to measure people engagement

Author

Inside

As a reader of Le Monde —and the comments section ;) I would regularly encounter familiar subscribers’ names. One in particular –more on that later, would manually keep track, count & cite “pro-russian” contributors, directly in the comments. That triggered my need to collect and perform some analysis in a bit more data-science oriented fashion.

After our initial data collection¹ we use Polars as an alternative to Pandas (still used in some parts when we had to code faster) to perform aggregations and Plotly to visualize².

¹ Custom API, dataset & scope on my other project

² This article itself = ipynb (source notebook) -> qmd -> html , via Quarto

The analysis focuses on comments/authors (big numbers, activity over time, cohort analysis…) rather than on articles & titles. To this end, we also lay the foundations to go deeper in the semantic analysis through semantic search on comments via Sbert embedding + a Faiss index.

import polars as pl
import pandas as pd
import numpy as np
from datetime import datetime, date
import pickle

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

from sentence_transformers import SentenceTransformer
import faiss

Some Polars / Plotly config. to better render our data.

Code

# Polars, render text columns nicer when printing / displaying df
pl.Config.set_fmt_str_lengths(50)
pl.Config.set_tbl_cols(10)
pl.Config.set_tbl_width_chars(120)
pl.Config.set_tbl_rows(10)
pl.Config.set_tbl_hide_dataframe_shape(True) # prevents systematic display of df/table shape

# change default plotly express theme
import plotly.io as pio

print(f" reminder plotly quick templates : {pio.templates}")
template = "simple_white"

Load data

236k comments collected, with associated articles & titles | 24th feb 2022 - 24 feb 2023
Reminder : conflict starts Febr the 24th 2022, if we exclude the prior Dombass “events”.
Load our .parquet dataset³ using Polars.

³ Used keywords, scope and limitations of our dataset

# Read parquet using Polars. Could also use scan + collect syntax for lazy execution
# If interested, I did some speed benchmark in the dataset project (lmd_ukr).
filepath = "data/lmd_ukraine.parquet"
coms = pl.read_parquet(filepath)

shape: (236643, 12)
┌──────────┬───────────┬───────────┬──────────┬──────────┬───┬────────────┬────────────┬─────────┬──────────┬──────────┐
│ article_ ┆ url       ┆ title     ┆ desc     ┆ content  ┆ … ┆ article_ty ┆ allow_comm ┆ premium ┆ author   ┆ comment  │
│ id       ┆ ---       ┆ ---       ┆ ---      ┆ ---      ┆   ┆ pe         ┆ ents       ┆ ---     ┆ ---      ┆ ---      │
│ ---      ┆ str       ┆ str       ┆ str      ┆ str      ┆   ┆ ---        ┆ ---        ┆ bool    ┆ str      ┆ str      │
│ i64      ┆           ┆           ┆          ┆          ┆   ┆ cat        ┆ bool       ┆         ┆          ┆          │
╞══════════╪═══════════╪═══════════╪══════════╪══════════╪═══╪════════════╪════════════╪═════════╪══════════╪══════════╡
│ 3259703  ┆ https://w ┆ Le        ┆ Au       ┆ Parce    ┆ … ┆ Factuel    ┆ true       ┆ false   ┆ Ricardo  ┆ La       │
│          ┆ ww.lemond ┆ conflit   ┆ Festival ┆ qu’elle  ┆   ┆            ┆            ┆         ┆ Uztarroz ┆ question │
│          ┆ e.fr/actu ┆ russo-ukr ┆ de journ ┆ est      ┆   ┆            ┆            ┆         ┆          ┆ qui      │
│          ┆ alite-med ┆ ainien,   ┆ alisme   ┆ revenue  ┆   ┆            ┆            ┆         ┆          ┆ vaille   │
│          ┆ ias/artic ┆ qui       ┆ de Couth ┆ frapper  ┆   ┆            ┆            ┆         ┆          ┆ et qui   │
│          ┆ le/20…    ┆ mobilise  ┆ ures :   ┆ à nos    ┆   ┆            ┆            ┆         ┆          ┆ n'est    │
│          ┆           ┆ les médi… ┆ la       ┆ portes,  ┆   ┆            ┆            ┆         ┆          ┆ pas      │
│          ┆           ┆           ┆ guerr…   ┆ q…       ┆   ┆            ┆            ┆         ┆          ┆ posée    │
│          ┆           ┆           ┆          ┆          ┆   ┆            ┆            ┆         ┆          ┆ dan…     │
└──────────┴───────────┴───────────┴──────────┴──────────┴───┴────────────┴────────────┴─────────┴──────────┴──────────┘

Order of magnitude

One always likes to know the “how big”. For instance, Le Monde regularly get asked how many questions people would send during live sessions etc. We do not have those numbers, but articles / coms counts are still nice to have. Curious of which metrics (for comments) are available to Le Monde behind the scene. Probably a lot.

Unique articles, comments count, unique authors

236k comments from 10 500 unique subscribers, under 2600 articles.

# Polars methods are quite similar to Pandas except for the slicing.

# number of comments; could also simply use .shape
count = coms.select([pl.col("comment").count()]).to_series()[0]

# number of unique articles
nunique_articles = coms.select("article_id").n_unique()

# n unique comments' authors
nunique_authors = coms.select("author").n_unique()

print(f"Number of comments: {count},\nUnique articles: {nunique_articles},\nUnique authors: {nunique_authors}")

Number of comments: 236643,
Unique articles: 2590,
Unique authors: 10700

After some googling : Le Monde has +-450k online subscribers (+-540k total). The comment section is open to subscribers only.
10 500 unique authors means that around 2,3% of the reader base have engaged in the comment section during the year, on this topic. Not surprising but not bad either, purely by rule of thumb – a bench. would be interesting.

Editorial & comment activity, elements of comparison

Our dataset excludes the “Lives” that represent a substancial coverage effort from Le Monde. But just by reading the newspaper (or any, really) we know that they have been mobilizing a lot of resources. Also, regularly lurking into comments section, I know the topic is rather engaging. Now we have the accurate numbers, at least.

Code

# Activity avg posts or comments per day
days = 365
total_articles = 2590
total_comments = 236643

# articles, (excluding lives/blogs posts) per day
print(f"Theme, Ukraine conflict:")
print(f" - avg articles per day: {total_articles/days:.2f}")

# comments per day
print(f" - avg comments per day: {total_comments/days:.2f}")

# avg n comments per article
print(f" - avg comments per article: {total_comments/total_articles:.2f}")

Theme, Ukraine conflict:
 - avg articles per day: 7.10
 - avg comments per day: 648.34
 - avg comments per article: 91.37

Imagine publishing 7 articles a day on a topic, for one year. To put some perspective on our data, I performed a side & quick additional scraping on two additional topics. Articles count is exhaustive (on a given 1 month period) whereas comments activity was sampled from a random selection of articles for each topic to advance quickly.

“Réforme retraites” : very hot topic during with “hottest” (demonstrations, strikes) month coverage happening in the same time span as the conflict.
“Trump” : an always hot/engaging topic in any media, though a bit out of fashion nowadays.

Code

# Collected benchmark data
themes = ["réforme retraites", "Donald trump"]
n_articles = [
    374,
    66,
]  # obtained on 1 month data (jan/febr 2023, exhaustive/no sampling)
n_days = 31
from_sample_avg_comments_per_articles = [
    124,
    40,
]  # obtained from a sample of 20 articles for each theme.

for idx, theme in enumerate(themes):
    print(f"Theme, {theme}:")
    print(f" - avg articles per day: {n_articles[idx]/n_days:.2f}")
    print(
        f" - avg comments per article: {from_sample_avg_comments_per_articles[idx]:.2f}\n"
    )

Theme, réforme retraites:
 - avg articles per day: 12.06
 - avg comments per article: 124.00

Theme, Donald trump:
 - avg articles per day: 2.13
 - avg comments per article: 40.00

Ukraine coverage have been a continuous, long term effort, by Le Monde, with high interest from the public. Whereas I selected the most active month as a benchmark for Retraites, it has a very similar coverage/engagement to Ukraine where numbers are on a one year period. Re. Trump, which is not the least engaging topic, subscribers engagement level is more than twice as low than Ukraine.

Misc : editorial share (type of articles)

Excluding Lives. Didn’t want to focus on editorial coverage, but here the share of articles types.
I believe that “factuel” is the AFP news feed, but it needs confirmation.

# our dataset has one row per comments
# we're interested in share of article types only
# Polars groupby + first() to keep article type value, then count()
article_types = (
    coms.groupby(by="article_id").first().select(["article_id", "article_type"])
)
editorial_share = article_types.groupby(by="article_type").count()

┌────────────┬──────────────┐
│ article_id ┆ article_type │
│ ---        ┆ ---          │
│ i64        ┆ cat          │
╞════════════╪══════════════╡
│ 3272736    ┆ Factuel      │
│ 3266176    ┆ Reportage    │
│ 3308424    ┆ Décryptages  │
│ 3263664    ┆ Factuel      │
│ 3265152    ┆ Reportage    │
│ …          ┆ …            │
│ 3270535    ┆ Tribune      │
│ 3280223    ┆ Factuel      │
│ 3307215    ┆ Factuel      │
│ 3270279    ┆ Reportage    │
│ 3279383    ┆ Entretien    │
└────────────┴──────────────┘
┌────────────────┬───────┐
│ article_type   ┆ count │
│ ---            ┆ ---   │
│ cat            ┆ u32   │
╞════════════════╪═══════╡
│ Factuel        ┆ 940   │
│ Décryptages    ┆ 310   │
│ Reportage      ┆ 303   │
│ Tribune        ┆ 225   │
│ Récit          ┆ 115   │
│ …              ┆ …     │
│ Lettre de…     ┆ 3     │
│ Série          ┆ 1     │
│ Brève de liens ┆ 1     │
│ Nécrologie     ┆ 1     │
│ Archive        ┆ 1     │
└────────────────┴───────┘

# Filter out least represented categories to lighten our pie chart viz
editorial_share = editorial_share.filter(pl.col("count") >= 40) 

labels = editorial_share.to_pandas()["article_type"]
values = editorial_share.to_pandas()["count"]
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_traces(
    hoverinfo="label+percent+name",
    textinfo="value",
    textfont_size=12,
    hole=0.3,
    marker=dict(line=dict(color="#000000", width=1)),
)
fig.update_layout(
    height=400,
    #width=600
)
# title_text="Ukraine coverage (Febr 2022 - Febr 2023), "articles types", 
fig.show()

Figure 1: Ukraine coverage, articles types (total: 2590 articles)

People engagement on war over time, subscribers’ comment activity as a proxy

TLDR, people activity keeps being high independently of Le Monde articles frequency. Some peaks in activity would need further investigations but prob. tied to the usual remarkable events (offensives, nukes…). Before our analysis I would think that engagement would decrease, even slightly, over time but it seems not. Also, we’re aware that comments as a proxy has biases⁴.

⁴ Le Monde 500k subscribers is a particular demographic, and only a handful of them are active authors.

Code wise, our general workflow revolves around time series groupby / various aggregations using Polars, then convert back the results to Pandas for a quicker viz via Plotly. We will experiment with diverse metrics/windows to better render subscribers activity over this first year : coms daily count, week / month avg, lines vs. hist, coms per article : daily / weekly, moving (rolling) mean on a 30 days period.

Daily number of comments, weekly, monthly averages

# 1. first thing first, number of comms per day
coms_daily_count = coms.groupby("date").count().sort("date", descending=False)


# 2. average number of comms per week (groupby window, using groupby_dynamic method in Polars)
weekly_avg = coms_daily_count.groupby_dynamic("date", every="1w").agg(
    [pl.col("count").mean()]
)

# 3.same as above but per month
monthly_avg = coms_daily_count.groupby_dynamic("date", every="1mo").agg(
    [pl.col("count").mean()]
)
# from left to right - average number of comments :
# weekly avg (line), weekly avg (bars), monthly avg (line), monthly avg (bar)

┌────────────┬─────────────┐
│ date       ┆ count       │
│ ---        ┆ ---         │
│ date       ┆ f64         │
╞════════════╪═════════════╡
│ 2022-02-24 ┆ 1520.0      │
│ 2022-03-03 ┆ 1743.666667 │
│ 2022-03-10 ┆ 1363.0      │
└────────────┴─────────────┘
┌────────────┬─────────────┐
│ date       ┆ count       │
│ ---        ┆ ---         │
│ date       ┆ f64         │
╞════════════╪═════════════╡
│ 2022-02-01 ┆ 2427.333333 │
│ 2022-03-01 ┆ 1431.833333 │
│ 2022-04-01 ┆ 837.4       │
└────────────┴─────────────┘

Code

fig1 = px.line(
    weekly_avg.to_pandas(),
    x="date",
    y="count",
    #width=200,
    height=300,
    template=template,
)
fig2 = px.bar(
    weekly_avg.to_pandas(),
    x="date",
    y="count",
    #width=200,
    height=300,
    template=template,
)
fig3 = px.line(
    monthly_avg.to_pandas(),
    x="date",
    y="count",
    #width=600,
    height=300,
    template=template,
)
fig4 = px.bar(
    monthly_avg.to_pandas(),
    x="date",
    y="count",
    #width=600,
    height=300,
    template=template,
)

fig1.show()
fig2.show()
fig3.show()
fig4.show()

Lower the impact of articles publication freq : ratio comm / articles, rolling mean

When plotting the weekly / monthly avg of comments (above), we clearly distinguish 3 periods of high activity (start of conflict + 2 others), still with a sustained, constant readers involvement.
But due to the number of comments prob. being tied to how many articles Le Monde published in the same time, lets visualize comments activity with normalization : coms per articles (removes articles frequency effect) and rolling mean (smoothen things out).

# daily ratio comms per article. Still using Polars syntax >.<

# 1. group by dates (daily), agg count articles, count comments
daily_coms_per_articles = (
    coms.groupby(by="date")
    .agg(
        [
            pl.col("article_id").n_unique().alias("count_articles"),
            pl.col("comment").count().alias("count_comments"),
        ]
    )
    .sort("date", descending=False)
)

# 2. then calculate coms per articles
daily_coms_per_articles = daily_coms_per_articles.with_columns(
    (pl.col("count_comments") / pl.col("count_articles")).alias("coms_per_article")
)

┌────────────┬────────────────┬────────────────┬──────────────────┐
│ date       ┆ count_articles ┆ count_comments ┆ coms_per_article │
│ ---        ┆ ---            ┆ ---            ┆ ---              │
│ date       ┆ u32            ┆ u32            ┆ f64              │
╞════════════╪════════════════╪════════════════╪══════════════════╡
│ 2022-02-24 ┆ 36             ┆ 3762           ┆ 104.5            │
│ 2022-02-25 ┆ 30             ┆ 2735           ┆ 91.166667        │
│ 2022-02-26 ┆ 7              ┆ 785            ┆ 112.142857       │
└────────────┴────────────────┴────────────────┴──────────────────┘

# weekly ratio coms per article. Polars method is .groupby_dynamic()

weekly_coms_per_articles = (
    coms.sort("date", descending=False)
    .groupby_dynamic("date", every="1w")
    .agg(
        [
            pl.col("article_id").n_unique().alias("count_articles"),
            pl.col("comment").count().alias("count_comments"),
        ]
    )
    .sort("date", descending=False)
)

weekly_coms_per_articles = weekly_coms_per_articles.with_columns(
    (pl.col("count_comments") / pl.col("count_articles")).alias("coms_per_article")
)

┌────────────┬────────────────┬────────────────┬──────────────────┐
│ date       ┆ count_articles ┆ count_comments ┆ coms_per_article │
│ ---        ┆ ---            ┆ ---            ┆ ---              │
│ date       ┆ u32            ┆ u32            ┆ f64              │
╞════════════╪════════════════╪════════════════╪══════════════════╡
│ 2022-02-24 ┆ 78             ┆ 7600           ┆ 97.435897        │
│ 2022-03-03 ┆ 130            ┆ 10462          ┆ 80.476923        │
│ 2022-03-10 ┆ 125            ┆ 9541           ┆ 76.328           │
└────────────┴────────────────┴────────────────┴──────────────────┘

Comments activity keeps being high throughout the “first” year of conflict whatever the articles publication rhythm, with even a bigger number of comments per articles in the end period than in the very start.
Some context : first two weeks of September : Ukraine counter-offensive in Karkhiv & Russian mobilization. January 2023 : battle tanks ?

Code

px.bar(
    weekly_coms_per_articles.to_pandas(),
    x="date",
    y="coms_per_article",
    #width=600,
    height=400,
    template=template,
)

Figure 2: Weekly comments per article

Moving (rolling) mean, another way to –kind of, smoothen out articles frequency, without the hassle above.

moving_mean = coms_daily_count.with_columns(
    pl.col("count").rolling_mean(window_size=30).alias("moving_mean")
)

┌────────────┬───────┬─────────────┐
│ date       ┆ count ┆ moving_mean │
│ ---        ┆ ---   ┆ ---         │
│ date       ┆ u32   ┆ f64         │
╞════════════╪═══════╪═════════════╡
│ 2023-02-24 ┆ 2322  ┆ 819.466667  │
│ 2023-02-25 ┆ 899   ┆ 777.233333  │
│ 2023-02-26 ┆ 288   ┆ 775.3       │
└────────────┴───────┴─────────────┘

Code

px.bar(
    moving_mean.to_pandas(),
    x="date",
    y="moving_mean",
    #width=600,
    height=400,
    template=template,
)

Figure 3: Rolling mean (windows_size=30) of daily count of comments

Who are the most active contributors ? Hardcore posters vs. the silent crowd

Fun fact: goupil_hardi acts as a true “sentinel” of the comments section, to a point where he manually counts & regularly cite the pro russian contributions under the articles. He is the one that made me decide to get the dataset and build this notebook.

Could also do a lot of interesting stuff on trolls detection (if any, access to comments is pretty restricted) but we focused our efforts elsewhere.

Top authors ; glad everyone is using a pseudonym ;)

# top commentators
authors = coms.groupby("author").count().sort("count", descending=True)

┌───────────────────┬───────┐
│ author            ┆ count │
│ ---               ┆ ---   │
│ str               ┆ u32   │
╞═══════════════════╪═══════╡
│ Lux               ┆ 2087  │
│ goupil hardi      ┆ 2034  │
│ Kaiwin            ┆ 1592  │
│ Bandera           ┆ 1572  │
│ Alexandre Kastals ┆ 1544  │
└───────────────────┴───────┘

Contribution shape, as expected, hardcore posters vs. the rest

10 700 authors, average of 20 comments a year but the median is 4 coms only. Two authors with more than 2K comments. See how the top authors skew the distribution below.

authors.describe()

┌────────────┬────────┬───────────┐
│ describe   ┆ author ┆ count     │
│ ---        ┆ ---    ┆ ---       │
│ str        ┆ str    ┆ f64       │
╞════════════╪════════╪═══════════╡
│ count      ┆ 10700  ┆ 10700.0   │
│ null_count ┆ 0      ┆ 0.0       │
│ mean       ┆ null   ┆ 22.116168 │
│ std        ┆ null   ┆ 77.995757 │
│ min        ┆        ┆ 1.0       │
│ max        ┆ ㅤㅤ   ┆ 2087.0    │
│ median     ┆ null   ┆ 4.0       │
│ 25%        ┆ null   ┆ 1.0       │
│ 75%        ┆ null   ┆ 12.0      │
└────────────┴────────┴───────────┘

Code

fig_violin = px.violin(
    authors.to_pandas(),
    y="count",
    box=True,
    points="all", # add data points
    #width=600,
    height=400,
    template=template,
)
fig_violin.show()

Figure 4: violin plot of authors, with all data points on the left side

Names that ring a bell

If you had time to spare , you could do some semantic search / analysis on the arguments of each side. E.g the dissemination of pro-Russia arguments. But here a simple overview of selected authors & comments.

“Goupil Hardi”, second top poster with 2034 comments in 365 days (5 a days, on Ukraine only). Also not that the comment section is limited to one comment per author, per article + 3 replies-to-comment.

# in Polars, unless I'm doing it wrong, it's harder than with Pandas to extract a col values.
selected_coms = coms.select(["author", "date", "comment"]).filter(
    pl.col("author") == "goupil hardi").sample(n=3, seed=42).get_column("comment").to_list()

for i, com in enumerate(selected_coms):
    print(f"({i+1}) {com[0:90]}...")

les PERLES de DENIS MONOD-BROCA (ci-dessous à 22h39 – heure de Moscou 🦊) 12/3/22 «L’Europ…
J’attends toujours vos excuses, Benjamin Valberg, pour avoir falsifié mes propos et m’avo…
Bien sûr qu’on finira par négocier, mais l’objectif poursuivi par tous les ennemis de la…

What about “Lux”, the top poster ? Don’t remember seeing his name, but a similar profile –with less dedication ;)

@Eric j’ai surpris plusieurs fois ce citoyen qui falsifiait ses sources soit en inventant …
Non j’ai rien modifié, les chiffres sont issus de la propagande d’Alexandre Latsa sur son…
Plf on vous a reconnu. Mais il faut vous rendre à l’évidence…

“Monod-Broca”. Well, to each their own.

En l’affaire je vois plutôt un David russe face à un Goliath otanien si sûr de sa supre…
@ kaiwin. Durant l’Antiquité le Kosovo fait partie de l’empire romain puis de l’Empire Se…
Ce n’est pas un jeu, nous ne l’avons seulement « énervé », nous l’avons délibérément,…

Engagement through cohort analysis, what’s about the retention rate ?

A fancier way to analyze people engagement, over time, on the topic.
Would be interesting to perform some benchmark on other topics.

Steps
1. add/get comment month -> month of the comment for each author
2. add/get cohort month (first month that user posted a comment) 
    -> first month the authors commented = cohort creation
3. add/get cohort index for each row

# clone data to avoid recursive edit of our dataset
cohort = coms.clone()

Reminder on how our original data looks like :

┌──────────┬───────────┬───────────┬──────────┬──────────┬───┬────────────┬────────────┬─────────┬──────────┬──────────┐
│ article_ ┆ url       ┆ title     ┆ desc     ┆ content  ┆ … ┆ article_ty ┆ allow_comm ┆ premium ┆ author   ┆ comment  │
│ id       ┆ ---       ┆ ---       ┆ ---      ┆ ---      ┆   ┆ pe         ┆ ents       ┆ ---     ┆ ---      ┆ ---      │
│ ---      ┆ str       ┆ str       ┆ str      ┆ str      ┆   ┆ ---        ┆ ---        ┆ bool    ┆ str      ┆ str      │
│ i64      ┆           ┆           ┆          ┆          ┆   ┆ cat        ┆ bool       ┆         ┆          ┆          │
╞══════════╪═══════════╪═══════════╪══════════╪══════════╪═══╪════════════╪════════════╪═════════╪══════════╪══════════╡
│ 3259703  ┆ https://w ┆ Le        ┆ Au       ┆ Parce    ┆ … ┆ Factuel    ┆ true       ┆ false   ┆ Ricardo  ┆ La       │
│          ┆ ww.lemond ┆ conflit   ┆ Festival ┆ qu’elle  ┆   ┆            ┆            ┆         ┆ Uztarroz ┆ question │
│          ┆ e.fr/actu ┆ russo-ukr ┆ de journ ┆ est      ┆   ┆            ┆            ┆         ┆          ┆ qui      │
│          ┆ alite-med ┆ ainien,   ┆ alisme   ┆ revenue  ┆   ┆            ┆            ┆         ┆          ┆ vaille   │
│          ┆ ias/artic ┆ qui       ┆ de Couth ┆ frapper  ┆   ┆            ┆            ┆         ┆          ┆ et qui   │
│          ┆ le/20…    ┆ mobilise  ┆ ures :   ┆ à nos    ┆   ┆            ┆            ┆         ┆          ┆ n'est    │
│          ┆           ┆ les médi… ┆ la       ┆ portes,  ┆   ┆            ┆            ┆         ┆          ┆ pas      │
│          ┆           ┆           ┆ guerr…   ┆ q…       ┆   ┆            ┆            ┆         ┆          ┆ posée    │
│          ┆           ┆           ┆          ┆          ┆   ┆            ┆            ┆         ┆          ┆ dan…     │
└──────────┴───────────┴───────────┴──────────┴──────────┴───┴────────────┴────────────┴─────────┴──────────┴──────────┘

# We will only use authors, date, number of comments to render our cohort
# Also, switch to Pandas, more familiar with it for the following operations
relevant_columns = ["author", "date", "article_id"]
cohort = cohort.select(relevant_columns)
cohort = cohort.to_pandas()
cohort.head(2)

	author	date	article_id
0	Ricardo Uztarroz	2022-07-16	3259703
1	Ricardo Uztarroz	2022-07-16	3259703

Shape to cohort (Pandas)

# 1. comment month
# tip : map faster than apply, we can use it cause we're dealin with one col at a time
cohort["comment_month"] = cohort["date"].map(lambda x: datetime(x.year, x.month, 1))
display(cohort.head(2))

# 2. cohort month
# tip : transform after a groupby,return a df with the same length
# and here return the min for each entry
cohort["cohort_month"] = cohort.groupby("author")["comment_month"].transform("min")
display(cohort.head(2))

	author	date	article_id	comment_month
0	Ricardo Uztarroz	2022-07-16	3259703	2022-07-01
1	Ricardo Uztarroz	2022-07-16	3259703	2022-07-01

	author	date	article_id	comment_month	cohort_month
0	Ricardo Uztarroz	2022-07-16	3259703	2022-07-01	2022-02-01
1	Ricardo Uztarroz	2022-07-16	3259703	2022-07-01	2022-02-01

# 3. cohort index : for each row, difference in months,
# between first comment month and cohort month
def get_date(df, column):
    year = df[column].dt.year
    month = df[column].dt.month
    day = df[column].dt.day
    return year, month, day


comment_year, comment_month, _ = get_date(cohort, "comment_month")
cohort_year, cohort_month, _ = get_date(cohort, "cohort_month")
year_diff = comment_year - cohort_year
month_diff = comment_month - cohort_month
cohort["cohort_index"] = year_diff * 12 + month_diff + 1
display(cohort.head(4))

	author	date	article_id	comment_month	cohort_month	cohort_index
0	Ricardo Uztarroz	2022-07-16	3259703	2022-07-01	2022-02-01	6
1	Ricardo Uztarroz	2022-07-16	3259703	2022-07-01	2022-02-01	6
2	Correcteur	2022-07-16	3259703	2022-07-01	2022-02-01	6
3	Jean-Doute	2022-07-16	3259703	2022-07-01	2022-02-01	6

Cohort active users (retention rate of authors)

# final shaping groupby cohort_month * cohort_index, count (unique) authors
# cohort active users (active authors / retention rate)
active_authors = (
    cohort.groupby(["cohort_month", "cohort_index"])["author"]
    .apply(pd.Series.nunique)
    .reset_index()
)
active_authors = active_authors.pivot_table(
    index="cohort_month", columns="cohort_index", values="author"
)

# generate cohort with Plotly, as a heatmap
fig = px.imshow(
    active_authors,
    text_auto=True,
    #width=1000,
    height=500
    )

Figure 5: From left to right : our first cohort of authors (febr.) counts 2184 users. After 1 month 1677 are still active and after 12 months (first line last cell), 1050 are still commenting. Sept. cohort (row 8) : 513 new unique authors of which 63 only are active after 5 months

cohort_index	1	2	3	4	5	6	7	8	9	10	11	12	13
cohort_month
2022-02-01	2184.0	1677.0	1412.0	1230.0	1051.0	948.0	1080.0	1202.0	1074.0	848.0	841.0	991.0	1050.0
2022-03-01	3742.0	1417.0	1118.0	858.0	718.0	930.0	1084.0	954.0	624.0	673.0	803.0	900.0	NaN
2022-04-01	932.0	190.0	130.0	120.0	144.0	192.0	165.0	109.0	122.0	162.0	150.0	NaN	NaN
2022-05-01	647.0	78.0	62.0	97.0	118.0	90.0	63.0	60.0	86.0	98.0	NaN	NaN	NaN
2022-06-01	338.0	56.0	73.0	73.0	58.0	43.0	47.0	51.0	69.0	NaN	NaN	NaN	NaN
2022-07-01	267.0	56.0	62.0	46.0	30.0	35.0	48.0	49.0	NaN	NaN	NaN	NaN	NaN
2022-08-01	494.0	104.0	70.0	45.0	38.0	61.0	59.0	NaN	NaN	NaN	NaN	NaN	NaN
2022-09-01	513.0	62.0	54.0	49.0	74.0	63.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2022-10-01	358.0	29.0	26.0	47.0	55.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2022-11-01	284.0	18.0	32.0	42.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2022-12-01	214.0	39.0	35.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-01-01	350.0	71.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-02-01	377.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Cohort percentage active users (retention rate, in %)

Code

# Clone previous dataframe (cohort users, count)
active_authors_pct = active_authors.copy(deep=True)

# get %
active_authors_pct = active_authors.copy(deep=True)
for col in active_authors_pct.columns[1:]:
    active_authors_pct[col] = round(
        active_authors_pct[col] / active_authors_pct[1] * 100, 2
    )
active_authors_pct[1] = 100

# Generate heatmap (cohort users, %)
labels = {"x": "n months", "y": "cohort (by month)", "color": "% author"}

fig_pct = px.imshow(active_authors_pct, text_auto=True, labels=labels)
fig_pct= fig_pct.update_xaxes(side="top", ticks="outside", tickson="boundaries", ticklen=5)
fig_pct = fig_pct.update_yaxes(showgrid=False)

fig_pct = fig_pct.update_layout(
    {
        "xaxis": {"tickmode": "linear", "showgrid": False},
        #"width": 800,
        "height": 500,
        "plot_bgcolor": "rgba(0, 0, 0, 0)",
        "paper_bgcolor": "rgba(0, 2, 0, 0)",
    }
)

Figure 6: Globally the retention rate is lower in later cohorts (authors that start posted later this year)

cohort_index	1	2	3	4	5	6	7	8	9	10	11	12	13
cohort_month
2022-02-01	100	76.79	64.65	56.32	48.12	43.41	49.45	55.04	49.18	38.83	38.51	45.38	48.08
2022-03-01	100	37.87	29.88	22.93	19.19	24.85	28.97	25.49	16.68	17.99	21.46	24.05	NaN
2022-04-01	100	20.39	13.95	12.88	15.45	20.60	17.70	11.70	13.09	17.38	16.09	NaN	NaN
2022-05-01	100	12.06	9.58	14.99	18.24	13.91	9.74	9.27	13.29	15.15	NaN	NaN	NaN
2022-06-01	100	16.57	21.60	21.60	17.16	12.72	13.91	15.09	20.41	NaN	NaN	NaN	NaN
2022-07-01	100	20.97	23.22	17.23	11.24	13.11	17.98	18.35	NaN	NaN	NaN	NaN	NaN
2022-08-01	100	21.05	14.17	9.11	7.69	12.35	11.94	NaN	NaN	NaN	NaN	NaN	NaN
2022-09-01	100	12.09	10.53	9.55	14.42	12.28	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2022-10-01	100	8.10	7.26	13.13	15.36	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2022-11-01	100	6.34	11.27	14.79	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2022-12-01	100	18.22	16.36	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-01-01	100	20.29	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-02-01	100	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Comments embedding & fast retrieval using SBERT, Faiss

Example of use : if we wanted to retrieve similar arguments / check propaganda –in an efficient way, with Faiss.

Semantic search : resources & performance overview of our curated models

Notes and resources I found to be useful
Model choice, must read : symmetric vs. asymmetric semantic search, language, tuning : Sbert is all u need
Models available and metrics: Sbert doc
Misc
Models trained for cosine prefer shorter document retrieval, vs. dot product (longer)
Faiss uses inner product (=dot product ; += cosine score if vectors are normalized e.g using faiss.normalize_L2 or L2 to measure distances (more here).
In the first place, we were not sure of our typical use case : short query => long comment (asymmetric), or comment => similar comment (symmetric).

Candidate models we curated & tested :

Curated models
Models	Quick notes
paraphrase-multilingual-mpnet-base-v2	multi languages, suitable score : optimized for cosine, max seq len 128
distiluse-base-multilingual-cased-v1	symmetric, multi lang., max seq len 128, optimized for cosine
quora-distilbert-multilingual	multilanguages, short text (questions) closest to our symm use case? Example here (pytorch)
dangvantuan/sentence-camembert-large	Bigger model, symmetric ?, french, optimized l2 + tbc others ? Size embed 1024

Post run evaluation and remarks :
mpnet-base-v2, distiluse, quora : fast encoding (20k documents < 1mn), results quite similar between models, each one finds our test query and pertinent results. A very good baseline.
mpnet-base-v2, distiluse, quora : with a flat inner product faiss index, no difference if we perform vectors normalization or not, maybe because they’re optimized for cosine already?
camembert is a bigger model (1024 dimension), slower encoding (20k docs = 5mn), nice (better?) results (spoiler alert : it is optimized for French). With a flat IP index, if normalize = False, retrieve similar, short documents. If we normalize our embeddings, it retrieves our initial query + longer, similar documents.

Fasten our models evaluation through prior rdm sampling, notes on speed.

To speed our experiments up, we will work with a sample of comments (around 10% : 236k -> 20k)
FYI,embedding of all comments (all 236k), takes approx 10mn on a 1080ti, i7700k, 32gb RAM, with curated models ; *1.5 to *2 when using the biggest model (Camembert).
Encoding on our sample (10k) is < 1mn.
No detailed measure on inference speed, but very fast with Faiss. Might want to try different –optimized, indexes typeswith a bigger dataset.

# Remember, our dataset was loaded as a Polars dataframe,
# sample method is very similar in Pandas though.
coms_sample = coms.sample(seed=42, n=20000, shuffle=True)
# We're removing articles content and some other cols we won't work with
keep_cols = [
    "article_id",
    "url",
    "title",
    "desc",
    "date",
    "keywords",
    "author",
    "comment",
]
coms_sample = coms_sample.select(pl.col(keep_cols))

print(coms_sample.shape)

(20000, 8)

┌────────────┬───────────────┬───────────────┬───────────────┬────────────┬───────────────┬─────────────┬──────────────┐
│ article_id ┆ url           ┆ title         ┆ desc          ┆ date       ┆ keywords      ┆ author      ┆ comment      │
│ ---        ┆ ---           ┆ ---           ┆ ---           ┆ ---        ┆ ---           ┆ ---         ┆ ---          │
│ i64        ┆ str           ┆ str           ┆ str           ┆ date       ┆ list[str]     ┆ str         ┆ str          │
╞════════════╪═══════════════╪═══════════════╪═══════════════╪════════════╪═══════════════╪═════════════╪══════════════╡
│ 3265305    ┆ https://www.l ┆ Trois         ┆ « Poutine,    ┆ 2022-03-17 ┆ ["idees", "id ┆ Strasgorod  ┆ Nos          │
│            ┆ emonde.fr/ide ┆ semaines      ┆ l’agresseur   ┆            ┆ ees-chronique ┆             ┆ désormais    │
│            ┆ es/article/20 ┆ après le      ┆ de l’Ukraine, ┆            ┆ s", … "crise- ┆             ┆ dissidents   │
│            ┆ 22/03/17/po…  ┆ début de      ┆ n’est pas le  ┆            ┆ ukrainienne…  ┆             ┆ russes (édit │
│            ┆               ┆ l’agression   ┆ …             ┆            ┆               ┆             ┆ orialistes … │
│            ┆               ┆ rus…          ┆               ┆            ┆               ┆             ┆              │
│ 3263263    ┆ https://www.l ┆ L’UE a décidé ┆ Avec les      ┆ 2022-03-04 ┆ ["internation ┆ Anthemius   ┆ A Ryu7:      │
│            ┆ emonde.fr/int ┆ d’appliquer   ┆ réfugiés      ┆            ┆ al",          ┆             ┆ l’Europe,    │
│            ┆ ernational/ar ┆ pour la       ┆ ukrainiens,   ┆            ┆ "europe", …   ┆             ┆ notamment    │
│            ┆ ticle/2022/…  ┆ première      ┆ les Européens ┆            ┆ "crise-ukrain ┆             ┆ l’Allemagne, │
│            ┆               ┆ fois…         ┆ ret…          ┆            ┆ ienne"]       ┆             ┆ a accueil…   │
└────────────┴───────────────┴───────────────┴───────────────┴────────────┴───────────────┴─────────────┴──────────────┘

# Quick cleaning, typically remove comments with 3 emojis only
# Just filter out small comments
coms_sample = coms_sample.filter(pl.col("comment").str.n_chars() >= 45)

# Finally, convert back to Pandas, for a "better" (re use of code;) workflow with Sbert and FAISS
coms_sample = coms_sample.to_pandas()

print(coms_sample.shape)

(19014, 8)

Convenience functions to repeat our experiments with different indexes / models

""" comments embedding """

def comments_to_list(df, column: str) -> list[str]:
    """Extract documents from dataframe"""
    return df[column].values.tolist()


def load_model(model_name: str):
    """Convenience fonction to load SBERT model"""
    return SentenceTransformer(model_name)


def encode_comments(model, comments):
    """Encode comments using previously loaded model"""
    return model.encode(comments, show_progress_bar=True)

""" create (a flat) Faiss index """

def create_faiss_index(embeddings, normalize: bool, index_type: str):
    """
    Create a flat index in Faiss of index_type "IP" or "L2"
    Index_types and prior vectors normalization varies
    according model output optimization and task.
    """
    dimension = embeddings.shape[1]
    embeddings = np.array(embeddings).astype("float32")
    if normalize:
        faiss.normalize_L2(embeddings)
    if index_type == "ip":
        index = faiss.IndexFlatIP(dimension)
        index.add(embeddings)
    else:
        index = faiss.IndexFlatL2(dimension)
        index.add(embeddings)
    return index


def save_index(index, filename: str):
    """Optional, save index to disk"""
    faiss.write_index(index, f"{filename}.index")


def load_index(filename):
    """Optional, load index from disk"""
    return faiss.read_index(filename)

""" query index """

def search_index(index, model, query:str, normalize: bool, top_k: int):
    # encode query
    vector = model.encode([query])
    if normalize:
        faiss.normalize_L2(vector)

    # search with Faiss
    Distances, Indexes = index.search(vector, top_k)
    # Distances, Indexes = index.search(np.array(vector).astype("float32"), top_k)
    return Distances, Indexes


def index_to_comments(df, column:str, Indexes):
    """Convenience function to retrieve top K comments
    from our original Dataframe
    """
    return df.iloc[Indexes[0]][column].tolist()

Load model, encode comments

# load comments (a list), pick our candidate model, load it
comments = comments_to_list(coms_sample, "comment")
model_name = "paraphrase-multilingual-mpnet-base-v2"
model = load_model(model_name)
normalize = False

# Encode comments. See notes above for elements of performance / speed
embeddings = encode_comments(model, comments)

Create (or load) our Faiss index, here a flat index

# create Faiss Index, here Flat Innner Product 
# (exhaustive search, "no" optimization)
index = create_faiss_index(embeddings, normalize, index_type="ip")

# optional : save Faiss index to disk
filename = "mpnet"
save_index(index, filename)

# Optional, load from disk the previously saved Faiss index
# so we do not rerun embeddings everytime we're executing the notebook
# we found mpnet (multilang) to be a very good baseline for our dataset.
filename = "mpnet.index"
index = load_index(filename)

Let’s find similar comments

# extract an existing comment (= will be our input query) from dataset
print(coms_sample["comment"].tolist()[1300])

Quelle arrogance et quel cynisme. Qu'y a t-il de plus terroriste que la Russie d'aujourd'hui?

# encode query, query index, retrieve top_k --here 8, nearest comments
query = "Quelle arrogance et quel cynisme. Qu'y a t-il de plus terroriste que la Russie d'aujourd'hui?"
top_k = 8
Distances, Indexes = search_index(index, model, query, normalize, top_k)

# display top similar comments
results = index_to_comments(coms_sample, "comment", Indexes)
for i, result in enumerate(results):
    print(f"{i+1}| : {result}")

1| : Quelle arrogance et quel cynisme. Qu'y a t-il de plus terroriste que la Russie d'aujourd'hui?
2| : Quand la Russie, cet Etat terroriste va-t-il être reconnu comme tel par la communauté internationale?
3| : Il ne s'agit pas de Poutine. Il s'agit de la Russie.
4| : Oui, ce que les russes ont fait à Hiroshima, Nagasaki, Bagdad, Vietnam... une honte.
5| : Vous avez raison : les russes sont nos ennemis. Depuis des dizaines d'années.
6| : Cet article confirme que la russie est devenue un territoire terroriste. Elle utilise toutes les outrances pour justifier ses actes, fait taire sa population, arme des régiments de trolls(cf contributions sur cet article) et envoie sa population à la boucherie sans états d’âmes. Même catégorie que l’Afghanistan et la Corée du Nord.
7| : Vous êtes naif si vous croyez que seuls les Russes ont ce genre de comportement en tant de guerre...vous devez être de ceux qui croient en la guerre "propre" que les Occidentaux prétendre faire depuis 40 ans (parfois avec les Russes comme alliés d'ailleurs).
8| : Les Russes sont des colonisateurs dangereux. Demandez aux pays de l'Europe de l'est, aux pays baltes, aux pays d'Asie centrale, à la Syrie, à l'Afghanistan, au Mali, au Soudan, à la Centre Afrique. Faut se protéger contre les nazis russes.

Maybe later just for fun : 0 shot “tone” classification tests using OpenAI API