Le Monde, 1 y. of comments - Ukraine War

As a (partial) proxy to measure people engagement

Inside

As a reader of Le Monde —and the comments section ;) I would regularly encounter familiar subscribers’ names. One in particular –more on that later, would manually keep track, count & cite “pro-russian” contributors, directly in the comments. That triggered my need to collect and perform some analysis in a bit more data-science oriented fashion.

After our initial data collection1 we use Polars as an alternative to Pandas (still used in some parts when we had to code faster) to perform aggregations and Plotly to visualize2.

  • 1 Custom API, dataset & scope on my other project

  • 2 This article itself = ipynb (source notebook) -> qmd -> html , via Quarto

  • The analysis focuses on comments/authors (big numbers, activity over time, cohort analysis…) rather than on articles & titles. To this end, we also lay the foundations to go deeper in the semantic analysis through semantic search on comments via Sbert embedding + a Faiss index.

    import polars as pl
    import pandas as pd
    import numpy as np
    from datetime import datetime, date
    import pickle
    
    import matplotlib.pyplot as plt
    import plotly.express as px
    import plotly.graph_objects as go
    
    from sentence_transformers import SentenceTransformer
    import faiss

    Some Polars / Plotly config. to better render our data.

    Code
    # Polars, render text columns nicer when printing / displaying df
    pl.Config.set_fmt_str_lengths(50)
    pl.Config.set_tbl_cols(10)
    pl.Config.set_tbl_width_chars(120)
    pl.Config.set_tbl_rows(10)
    pl.Config.set_tbl_hide_dataframe_shape(True) # prevents systematic display of df/table shape
    
    # change default plotly express theme
    import plotly.io as pio
    
    print(f" reminder plotly quick templates : {pio.templates}")
    template = "simple_white"

    Load data

    236k comments collected, with associated articles & titles | 24th feb 2022 - 24 feb 2023
    Reminder : conflict starts Febr the 24th 2022, if we exclude the prior Dombass “events”.
    Load our .parquet dataset3 using Polars.

  • 3 Used keywords, scope and limitations of our dataset

  • # Read parquet using Polars. Could also use scan + collect syntax for lazy execution
    # If interested, I did some speed benchmark in the dataset project (lmd_ukr).
    filepath = "data/lmd_ukraine.parquet"
    coms = pl.read_parquet(filepath)
    shape: (236643, 12)
    ┌──────────┬───────────┬───────────┬──────────┬──────────┬───┬────────────┬────────────┬─────────┬──────────┬──────────┐
    │ article_ ┆ url       ┆ title     ┆ desc     ┆ content  ┆ … ┆ article_ty ┆ allow_comm ┆ premium ┆ author   ┆ comment  │
    │ id       ┆ ---       ┆ ---       ┆ ---      ┆ ---      ┆   ┆ pe         ┆ ents       ┆ ---     ┆ ---      ┆ ---      │
    │ ---      ┆ str       ┆ str       ┆ str      ┆ str      ┆   ┆ ---        ┆ ---        ┆ bool    ┆ str      ┆ str      │
    │ i64      ┆           ┆           ┆          ┆          ┆   ┆ cat        ┆ bool       ┆         ┆          ┆          │
    ╞══════════╪═══════════╪═══════════╪══════════╪══════════╪═══╪════════════╪════════════╪═════════╪══════════╪══════════╡
    │ 3259703  ┆ https://w ┆ Le        ┆ Au       ┆ Parce    ┆ … ┆ Factuel    ┆ true       ┆ false   ┆ Ricardo  ┆ La       │
    │          ┆ ww.lemond ┆ conflit   ┆ Festival ┆ qu’elle  ┆   ┆            ┆            ┆         ┆ Uztarroz ┆ question │
    │          ┆ e.fr/actu ┆ russo-ukr ┆ de journ ┆ est      ┆   ┆            ┆            ┆         ┆          ┆ qui      │
    │          ┆ alite-med ┆ ainien,   ┆ alisme   ┆ revenue  ┆   ┆            ┆            ┆         ┆          ┆ vaille   │
    │          ┆ ias/artic ┆ qui       ┆ de Couth ┆ frapper  ┆   ┆            ┆            ┆         ┆          ┆ et qui   │
    │          ┆ le/20…    ┆ mobilise  ┆ ures :   ┆ à nos    ┆   ┆            ┆            ┆         ┆          ┆ n'est    │
    │          ┆           ┆ les médi… ┆ la       ┆ portes,  ┆   ┆            ┆            ┆         ┆          ┆ pas      │
    │          ┆           ┆           ┆ guerr…   ┆ q…       ┆   ┆            ┆            ┆         ┆          ┆ posée    │
    │          ┆           ┆           ┆          ┆          ┆   ┆            ┆            ┆         ┆          ┆ dan…     │
    └──────────┴───────────┴───────────┴──────────┴──────────┴───┴────────────┴────────────┴─────────┴──────────┴──────────┘

    Order of magnitude

    One always likes to know the “how big”. For instance, Le Monde regularly get asked how many questions people would send during live sessions etc. We do not have those numbers, but articles / coms counts are still nice to have. Curious of which metrics (for comments) are available to Le Monde behind the scene. Probably a lot.

    Unique articles, comments count, unique authors

    236k comments from 10 500 unique subscribers, under 2600 articles.

    # Polars methods are quite similar to Pandas except for the slicing.
    
    # number of comments; could also simply use .shape
    count = coms.select([pl.col("comment").count()]).to_series()[0]
    
    # number of unique articles
    nunique_articles = coms.select("article_id").n_unique()
    
    # n unique comments' authors
    nunique_authors = coms.select("author").n_unique()
    
    print(f"Number of comments: {count},\nUnique articles: {nunique_articles},\nUnique authors: {nunique_authors}")
    Number of comments: 236643,
    Unique articles: 2590,
    Unique authors: 10700

    After some googling : Le Monde has +-450k online subscribers (+-540k total). The comment section is open to subscribers only.
    10 500 unique authors means that around 2,3% of the reader base have engaged in the comment section during the year, on this topic. Not surprising but not bad either, purely by rule of thumb – a bench. would be interesting.

    Editorial & comment activity, elements of comparison

    Our dataset excludes the “Lives” that represent a substancial coverage effort from Le Monde. But just by reading the newspaper (or any, really) we know that they have been mobilizing a lot of resources. Also, regularly lurking into comments section, I know the topic is rather engaging. Now we have the accurate numbers, at least.

    Code
    # Activity avg posts or comments per day
    days = 365
    total_articles = 2590
    total_comments = 236643
    
    # articles, (excluding lives/blogs posts) per day
    print(f"Theme, Ukraine conflict:")
    print(f" - avg articles per day: {total_articles/days:.2f}")
    
    # comments per day
    print(f" - avg comments per day: {total_comments/days:.2f}")
    
    # avg n comments per article
    print(f" - avg comments per article: {total_comments/total_articles:.2f}")
    Theme, Ukraine conflict:
     - avg articles per day: 7.10
     - avg comments per day: 648.34
     - avg comments per article: 91.37

    Imagine publishing 7 articles a day on a topic, for one year. To put some perspective on our data, I performed a side & quick additional scraping on two additional topics. Articles count is exhaustive (on a given 1 month period) whereas comments activity was sampled from a random selection of articles for each topic to advance quickly.

    • “Réforme retraites” : very hot topic during with “hottest” (demonstrations, strikes) month coverage happening in the same time span as the conflict.
    • “Trump” : an always hot/engaging topic in any media, though a bit out of fashion nowadays.
    Code
    # Collected benchmark data
    themes = ["réforme retraites", "Donald trump"]
    n_articles = [
        374,
        66,
    ]  # obtained on 1 month data (jan/febr 2023, exhaustive/no sampling)
    n_days = 31
    from_sample_avg_comments_per_articles = [
        124,
        40,
    ]  # obtained from a sample of 20 articles for each theme.
    
    for idx, theme in enumerate(themes):
        print(f"Theme, {theme}:")
        print(f" - avg articles per day: {n_articles[idx]/n_days:.2f}")
        print(
            f" - avg comments per article: {from_sample_avg_comments_per_articles[idx]:.2f}\n"
        )
    Theme, réforme retraites:
     - avg articles per day: 12.06
     - avg comments per article: 124.00
    
    Theme, Donald trump:
     - avg articles per day: 2.13
     - avg comments per article: 40.00
    

    Ukraine coverage have been a continuous, long term effort, by Le Monde, with high interest from the public. Whereas I selected the most active month as a benchmark for Retraites, it has a very similar coverage/engagement to Ukraine where numbers are on a one year period. Re. Trump, which is not the least engaging topic, subscribers engagement level is more than twice as low than Ukraine.

    Misc : editorial share (type of articles)

    Excluding Lives. Didn’t want to focus on editorial coverage, but here the share of articles types.
    I believe that “factuel” is the AFP news feed, but it needs confirmation.

    # our dataset has one row per comments
    # we're interested in share of article types only
    # Polars groupby + first() to keep article type value, then count()
    article_types = (
        coms.groupby(by="article_id").first().select(["article_id", "article_type"])
    )
    editorial_share = article_types.groupby(by="article_type").count()
    ┌────────────┬──────────────┐
    │ article_id ┆ article_type │
    │ ---        ┆ ---          │
    │ i64        ┆ cat          │
    ╞════════════╪══════════════╡
    │ 3272736    ┆ Factuel      │
    │ 3266176    ┆ Reportage    │
    │ 3308424    ┆ Décryptages  │
    │ 3263664    ┆ Factuel      │
    │ 3265152    ┆ Reportage    │
    │ …          ┆ …            │
    │ 3270535    ┆ Tribune      │
    │ 3280223    ┆ Factuel      │
    │ 3307215    ┆ Factuel      │
    │ 3270279    ┆ Reportage    │
    │ 3279383    ┆ Entretien    │
    └────────────┴──────────────┘
    ┌────────────────┬───────┐
    │ article_type   ┆ count │
    │ ---            ┆ ---   │
    │ cat            ┆ u32   │
    ╞════════════════╪═══════╡
    │ Factuel        ┆ 940   │
    │ Décryptages    ┆ 310   │
    │ Reportage      ┆ 303   │
    │ Tribune        ┆ 225   │
    │ Récit          ┆ 115   │
    │ …              ┆ …     │
    │ Lettre de…     ┆ 3     │
    │ Série          ┆ 1     │
    │ Brève de liens ┆ 1     │
    │ Nécrologie     ┆ 1     │
    │ Archive        ┆ 1     │
    └────────────────┴───────┘
    # Filter out least represented categories to lighten our pie chart viz
    editorial_share = editorial_share.filter(pl.col("count") >= 40) 
    
    labels = editorial_share.to_pandas()["article_type"]
    values = editorial_share.to_pandas()["count"]
    fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
    fig.update_traces(
        hoverinfo="label+percent+name",
        textinfo="value",
        textfont_size=12,
        hole=0.3,
        marker=dict(line=dict(color="#000000", width=1)),
    )
    fig.update_layout(
        height=400,
        #width=600
    )
    # title_text="Ukraine coverage (Febr 2022 - Febr 2023), "articles types", 
    fig.show()
    Figure 1: Ukraine coverage, articles types (total: 2590 articles)

    People engagement on war over time, subscribers’ comment activity as a proxy

    TLDR, people activity keeps being high independently of Le Monde articles frequency. Some peaks in activity would need further investigations but prob. tied to the usual remarkable events (offensives, nukes…). Before our analysis I would think that engagement would decrease, even slightly, over time but it seems not. Also, we’re aware that comments as a proxy has biases4.

  • 4 Le Monde 500k subscribers is a particular demographic, and only a handful of them are active authors.

  • Code wise, our general workflow revolves around time series groupby / various aggregations using Polars, then convert back the results to Pandas for a quicker viz via Plotly. We will experiment with diverse metrics/windows to better render subscribers activity over this first year : coms daily count, week / month avg, lines vs. hist, coms per article : daily / weekly, moving (rolling) mean on a 30 days period.

    Daily number of comments, weekly, monthly averages

    # 1. first thing first, number of comms per day
    coms_daily_count = coms.groupby("date").count().sort("date", descending=False)
    
    
    # 2. average number of comms per week (groupby window, using groupby_dynamic method in Polars)
    weekly_avg = coms_daily_count.groupby_dynamic("date", every="1w").agg(
        [pl.col("count").mean()]
    )
    
    # 3.same as above but per month
    monthly_avg = coms_daily_count.groupby_dynamic("date", every="1mo").agg(
        [pl.col("count").mean()]
    )
    # from left to right - average number of comments :
    # weekly avg (line), weekly avg (bars), monthly avg (line), monthly avg (bar)
    ┌────────────┬─────────────┐
    │ date       ┆ count       │
    │ ---        ┆ ---         │
    │ date       ┆ f64         │
    ╞════════════╪═════════════╡
    │ 2022-02-24 ┆ 1520.0      │
    │ 2022-03-03 ┆ 1743.666667 │
    │ 2022-03-10 ┆ 1363.0      │
    └────────────┴─────────────┘
    ┌────────────┬─────────────┐
    │ date       ┆ count       │
    │ ---        ┆ ---         │
    │ date       ┆ f64         │
    ╞════════════╪═════════════╡
    │ 2022-02-01 ┆ 2427.333333 │
    │ 2022-03-01 ┆ 1431.833333 │
    │ 2022-04-01 ┆ 837.4       │
    └────────────┴─────────────┘
    Code
    fig1 = px.line(
        weekly_avg.to_pandas(),
        x="date",
        y="count",
        #width=200,
        height=300,
        template=template,
    )
    fig2 = px.bar(
        weekly_avg.to_pandas(),
        x="date",
        y="count",
        #width=200,
        height=300,
        template=template,
    )
    fig3 = px.line(
        monthly_avg.to_pandas(),
        x="date",
        y="count",
        #width=600,
        height=300,
        template=template,
    )
    fig4 = px.bar(
        monthly_avg.to_pandas(),
        x="date",
        y="count",
        #width=600,
        height=300,
        template=template,
    )
    
    fig1.show()
    fig2.show()
    fig3.show()
    fig4.show()

    Lower the impact of articles publication freq : ratio comm / articles, rolling mean

    When plotting the weekly / monthly avg of comments (above), we clearly distinguish 3 periods of high activity (start of conflict + 2 others), still with a sustained, constant readers involvement.
    But due to the number of comments prob. being tied to how many articles Le Monde published in the same time, lets visualize comments activity with normalization : coms per articles (removes articles frequency effect) and rolling mean (smoothen things out).

    # daily ratio comms per article. Still using Polars syntax >.<
    
    # 1. group by dates (daily), agg count articles, count comments
    daily_coms_per_articles = (
        coms.groupby(by="date")
        .agg(
            [
                pl.col("article_id").n_unique().alias("count_articles"),
                pl.col("comment").count().alias("count_comments"),
            ]
        )
        .sort("date", descending=False)
    )
    
    # 2. then calculate coms per articles
    daily_coms_per_articles = daily_coms_per_articles.with_columns(
        (pl.col("count_comments") / pl.col("count_articles")).alias("coms_per_article")
    )
    ┌────────────┬────────────────┬────────────────┬──────────────────┐
    │ date       ┆ count_articles ┆ count_comments ┆ coms_per_article │
    │ ---        ┆ ---            ┆ ---            ┆ ---              │
    │ date       ┆ u32            ┆ u32            ┆ f64              │
    ╞════════════╪════════════════╪════════════════╪══════════════════╡
    │ 2022-02-24 ┆ 36             ┆ 3762           ┆ 104.5            │
    │ 2022-02-25 ┆ 30             ┆ 2735           ┆ 91.166667        │
    │ 2022-02-26 ┆ 7              ┆ 785            ┆ 112.142857       │
    └────────────┴────────────────┴────────────────┴──────────────────┘
    # weekly ratio coms per article. Polars method is .groupby_dynamic()
    
    weekly_coms_per_articles = (
        coms.sort("date", descending=False)
        .groupby_dynamic("date", every="1w")
        .agg(
            [
                pl.col("article_id").n_unique().alias("count_articles"),
                pl.col("comment").count().alias("count_comments"),
            ]
        )
        .sort("date", descending=False)
    )
    
    weekly_coms_per_articles = weekly_coms_per_articles.with_columns(
        (pl.col("count_comments") / pl.col("count_articles")).alias("coms_per_article")
    )
    ┌────────────┬────────────────┬────────────────┬──────────────────┐
    │ date       ┆ count_articles ┆ count_comments ┆ coms_per_article │
    │ ---        ┆ ---            ┆ ---            ┆ ---              │
    │ date       ┆ u32            ┆ u32            ┆ f64              │
    ╞════════════╪════════════════╪════════════════╪══════════════════╡
    │ 2022-02-24 ┆ 78             ┆ 7600           ┆ 97.435897        │
    │ 2022-03-03 ┆ 130            ┆ 10462          ┆ 80.476923        │
    │ 2022-03-10 ┆ 125            ┆ 9541           ┆ 76.328           │
    └────────────┴────────────────┴────────────────┴──────────────────┘

    Comments activity keeps being high throughout the “first” year of conflict whatever the articles publication rhythm, with even a bigger number of comments per articles in the end period than in the very start.
    Some context : first two weeks of September : Ukraine counter-offensive in Karkhiv & Russian mobilization. January 2023 : battle tanks ?

    Code
    px.bar(
        weekly_coms_per_articles.to_pandas(),
        x="date",
        y="coms_per_article",
        #width=600,
        height=400,
        template=template,
    )
    Figure 2: Weekly comments per article

    Moving (rolling) mean, another way to –kind of, smoothen out articles frequency, without the hassle above.

    moving_mean = coms_daily_count.with_columns(
        pl.col("count").rolling_mean(window_size=30).alias("moving_mean")
    )
    ┌────────────┬───────┬─────────────┐
    │ date       ┆ count ┆ moving_mean │
    │ ---        ┆ ---   ┆ ---         │
    │ date       ┆ u32   ┆ f64         │
    ╞════════════╪═══════╪═════════════╡
    │ 2023-02-24 ┆ 2322  ┆ 819.466667  │
    │ 2023-02-25 ┆ 899   ┆ 777.233333  │
    │ 2023-02-26 ┆ 288   ┆ 775.3       │
    └────────────┴───────┴─────────────┘
    Code
    px.bar(
        moving_mean.to_pandas(),
        x="date",
        y="moving_mean",
        #width=600,
        height=400,
        template=template,
    )
    Figure 3: Rolling mean (windows_size=30) of daily count of comments

    Who are the most active contributors ? Hardcore posters vs. the silent crowd

    Fun fact: goupil_hardi acts as a true “sentinel” of the comments section, to a point where he manually counts & regularly cite the pro russian contributions under the articles. He is the one that made me decide to get the dataset and build this notebook.

    Could also do a lot of interesting stuff on trolls detection (if any, access to comments is pretty restricted) but we focused our efforts elsewhere.

    Top authors ; glad everyone is using a pseudonym ;)

    # top commentators
    authors = coms.groupby("author").count().sort("count", descending=True)
    ┌───────────────────┬───────┐
    │ author            ┆ count │
    │ ---               ┆ ---   │
    │ str               ┆ u32   │
    ╞═══════════════════╪═══════╡
    │ Lux               ┆ 2087  │
    │ goupil hardi      ┆ 2034  │
    │ Kaiwin            ┆ 1592  │
    │ Bandera           ┆ 1572  │
    │ Alexandre Kastals ┆ 1544  │
    └───────────────────┴───────┘

    Contribution shape, as expected, hardcore posters vs. the rest

    10 700 authors, average of 20 comments a year but the median is 4 coms only. Two authors with more than 2K comments. See how the top authors skew the distribution below.

    authors.describe()
    ┌────────────┬────────┬───────────┐
    │ describe   ┆ author ┆ count     │
    │ ---        ┆ ---    ┆ ---       │
    │ str        ┆ str    ┆ f64       │
    ╞════════════╪════════╪═══════════╡
    │ count      ┆ 10700  ┆ 10700.0   │
    │ null_count ┆ 0      ┆ 0.0       │
    │ mean       ┆ null   ┆ 22.116168 │
    │ std        ┆ null   ┆ 77.995757 │
    │ min        ┆        ┆ 1.0       │
    │ max        ┆ ㅤㅤ   ┆ 2087.0    │
    │ median     ┆ null   ┆ 4.0       │
    │ 25%        ┆ null   ┆ 1.0       │
    │ 75%        ┆ null   ┆ 12.0      │
    └────────────┴────────┴───────────┘
    Code
    fig_violin = px.violin(
        authors.to_pandas(),
        y="count",
        box=True,
        points="all", # add data points
        #width=600,
        height=400,
        template=template,
    )
    fig_violin.show()
    Figure 4: violin plot of authors, with all data points on the left side

    Names that ring a bell

    If you had time to spare , you could do some semantic search / analysis on the arguments of each side. E.g the dissemination of pro-Russia arguments. But here a simple overview of selected authors & comments.

    “Goupil Hardi”, second top poster with 2034 comments in 365 days (5 a days, on Ukraine only). Also not that the comment section is limited to one comment per author, per article + 3 replies-to-comment.

    # in Polars, unless I'm doing it wrong, it's harder than with Pandas to extract a col values.
    selected_coms = coms.select(["author", "date", "comment"]).filter(
        pl.col("author") == "goupil hardi").sample(n=3, seed=42).get_column("comment").to_list()
    
    for i, com in enumerate(selected_coms):
        print(f"({i+1}) {com[0:90]}...")
    1. les PERLES de DENIS MONOD-BROCA (ci-dessous à 22h39 – heure de Moscou 🦊) 12/3/22 «L’Europ…
    2. J’attends toujours vos excuses, Benjamin Valberg, pour avoir falsifié mes propos et m’avo…
    3. Bien sûr qu’on finira par négocier, mais l’objectif poursuivi par tous les ennemis de la…

    What about “Lux”, the top poster ? Don’t remember seeing his name, but a similar profile –with less dedication ;)

    1. @Eric j’ai surpris plusieurs fois ce citoyen qui falsifiait ses sources soit en inventant …
    2. Non j’ai rien modifié, les chiffres sont issus de la propagande d’Alexandre Latsa sur son…
    3. Plf on vous a reconnu. Mais il faut vous rendre à l’évidence…

    “Monod-Broca”. Well, to each their own.

    1. En l’affaire je vois plutôt un David russe face à un Goliath otanien si sûr de sa supre…
    2. @ kaiwin. Durant l’Antiquité le Kosovo fait partie de l’empire romain puis de l’Empire Se…
    3. Ce n’est pas un jeu, nous ne l’avons seulement « énervé », nous l’avons délibérément,…

    Engagement through cohort analysis, what’s about the retention rate ?

    A fancier way to analyze people engagement, over time, on the topic.
    Would be interesting to perform some benchmark on other topics.

    Steps
    1. add/get comment month -> month of the comment for each author
    2. add/get cohort month (first month that user posted a comment) 
        -> first month the authors commented = cohort creation
    3. add/get cohort index for each row
    # clone data to avoid recursive edit of our dataset
    cohort = coms.clone()

    Reminder on how our original data looks like :

    ┌──────────┬───────────┬───────────┬──────────┬──────────┬───┬────────────┬────────────┬─────────┬──────────┬──────────┐
    │ article_ ┆ url       ┆ title     ┆ desc     ┆ content  ┆ … ┆ article_ty ┆ allow_comm ┆ premium ┆ author   ┆ comment  │
    │ id       ┆ ---       ┆ ---       ┆ ---      ┆ ---      ┆   ┆ pe         ┆ ents       ┆ ---     ┆ ---      ┆ ---      │
    │ ---      ┆ str       ┆ str       ┆ str      ┆ str      ┆   ┆ ---        ┆ ---        ┆ bool    ┆ str      ┆ str      │
    │ i64      ┆           ┆           ┆          ┆          ┆   ┆ cat        ┆ bool       ┆         ┆          ┆          │
    ╞══════════╪═══════════╪═══════════╪══════════╪══════════╪═══╪════════════╪════════════╪═════════╪══════════╪══════════╡
    │ 3259703  ┆ https://w ┆ Le        ┆ Au       ┆ Parce    ┆ … ┆ Factuel    ┆ true       ┆ false   ┆ Ricardo  ┆ La       │
    │          ┆ ww.lemond ┆ conflit   ┆ Festival ┆ qu’elle  ┆   ┆            ┆            ┆         ┆ Uztarroz ┆ question │
    │          ┆ e.fr/actu ┆ russo-ukr ┆ de journ ┆ est      ┆   ┆            ┆            ┆         ┆          ┆ qui      │
    │          ┆ alite-med ┆ ainien,   ┆ alisme   ┆ revenue  ┆   ┆            ┆            ┆         ┆          ┆ vaille   │
    │          ┆ ias/artic ┆ qui       ┆ de Couth ┆ frapper  ┆   ┆            ┆            ┆         ┆          ┆ et qui   │
    │          ┆ le/20…    ┆ mobilise  ┆ ures :   ┆ à nos    ┆   ┆            ┆            ┆         ┆          ┆ n'est    │
    │          ┆           ┆ les médi… ┆ la       ┆ portes,  ┆   ┆            ┆            ┆         ┆          ┆ pas      │
    │          ┆           ┆           ┆ guerr…   ┆ q…       ┆   ┆            ┆            ┆         ┆          ┆ posée    │
    │          ┆           ┆           ┆          ┆          ┆   ┆            ┆            ┆         ┆          ┆ dan…     │
    └──────────┴───────────┴───────────┴──────────┴──────────┴───┴────────────┴────────────┴─────────┴──────────┴──────────┘
    # We will only use authors, date, number of comments to render our cohort
    # Also, switch to Pandas, more familiar with it for the following operations
    relevant_columns = ["author", "date", "article_id"]
    cohort = cohort.select(relevant_columns)
    cohort = cohort.to_pandas()
    cohort.head(2)
    author date article_id
    0 Ricardo Uztarroz 2022-07-16 3259703
    1 Ricardo Uztarroz 2022-07-16 3259703

    Shape to cohort (Pandas)

    # 1. comment month
    # tip : map faster than apply, we can use it cause we're dealin with one col at a time
    cohort["comment_month"] = cohort["date"].map(lambda x: datetime(x.year, x.month, 1))
    display(cohort.head(2))
    
    # 2. cohort month
    # tip : transform after a groupby,return a df with the same length
    # and here return the min for each entry
    cohort["cohort_month"] = cohort.groupby("author")["comment_month"].transform("min")
    display(cohort.head(2))
    author date article_id comment_month
    0 Ricardo Uztarroz 2022-07-16 3259703 2022-07-01
    1 Ricardo Uztarroz 2022-07-16 3259703 2022-07-01
    author date article_id comment_month cohort_month
    0 Ricardo Uztarroz 2022-07-16 3259703 2022-07-01 2022-02-01
    1 Ricardo Uztarroz 2022-07-16 3259703 2022-07-01 2022-02-01
    # 3. cohort index : for each row, difference in months,
    # between first comment month and cohort month
    def get_date(df, column):
        year = df[column].dt.year
        month = df[column].dt.month
        day = df[column].dt.day
        return year, month, day
    
    
    comment_year, comment_month, _ = get_date(cohort, "comment_month")
    cohort_year, cohort_month, _ = get_date(cohort, "cohort_month")
    year_diff = comment_year - cohort_year
    month_diff = comment_month - cohort_month
    cohort["cohort_index"] = year_diff * 12 + month_diff + 1
    display(cohort.head(4))
    author date article_id comment_month cohort_month cohort_index
    0 Ricardo Uztarroz 2022-07-16 3259703 2022-07-01 2022-02-01 6
    1 Ricardo Uztarroz 2022-07-16 3259703 2022-07-01 2022-02-01 6
    2 Correcteur 2022-07-16 3259703 2022-07-01 2022-02-01 6
    3 Jean-Doute 2022-07-16 3259703 2022-07-01 2022-02-01 6

    Cohort active users (retention rate of authors)

    # final shaping groupby cohort_month * cohort_index, count (unique) authors
    # cohort active users (active authors / retention rate)
    active_authors = (
        cohort.groupby(["cohort_month", "cohort_index"])["author"]
        .apply(pd.Series.nunique)
        .reset_index()
    )
    active_authors = active_authors.pivot_table(
        index="cohort_month", columns="cohort_index", values="author"
    )
    # generate cohort with Plotly, as a heatmap
    fig = px.imshow(
        active_authors,
        text_auto=True,
        #width=1000,
        height=500
        )
    Figure 5: From left to right : our first cohort of authors (febr.) counts 2184 users. After 1 month 1677 are still active and after 12 months (first line last cell), 1050 are still commenting. Sept. cohort (row 8) : 513 new unique authors of which 63 only are active after 5 months
    cohort_index 1 2 3 4 5 6 7 8 9 10 11 12 13
    cohort_month
    2022-02-01 2184.0 1677.0 1412.0 1230.0 1051.0 948.0 1080.0 1202.0 1074.0 848.0 841.0 991.0 1050.0
    2022-03-01 3742.0 1417.0 1118.0 858.0 718.0 930.0 1084.0 954.0 624.0 673.0 803.0 900.0 NaN
    2022-04-01 932.0 190.0 130.0 120.0 144.0 192.0 165.0 109.0 122.0 162.0 150.0 NaN NaN
    2022-05-01 647.0 78.0 62.0 97.0 118.0 90.0 63.0 60.0 86.0 98.0 NaN NaN NaN
    2022-06-01 338.0 56.0 73.0 73.0 58.0 43.0 47.0 51.0 69.0 NaN NaN NaN NaN
    2022-07-01 267.0 56.0 62.0 46.0 30.0 35.0 48.0 49.0 NaN NaN NaN NaN NaN
    2022-08-01 494.0 104.0 70.0 45.0 38.0 61.0 59.0 NaN NaN NaN NaN NaN NaN
    2022-09-01 513.0 62.0 54.0 49.0 74.0 63.0 NaN NaN NaN NaN NaN NaN NaN
    2022-10-01 358.0 29.0 26.0 47.0 55.0 NaN NaN NaN NaN NaN NaN NaN NaN
    2022-11-01 284.0 18.0 32.0 42.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
    2022-12-01 214.0 39.0 35.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    2023-01-01 350.0 71.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    2023-02-01 377.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

    Cohort percentage active users (retention rate, in %)

    Code
    # Clone previous dataframe (cohort users, count)
    active_authors_pct = active_authors.copy(deep=True)
    
    # get %
    active_authors_pct = active_authors.copy(deep=True)
    for col in active_authors_pct.columns[1:]:
        active_authors_pct[col] = round(
            active_authors_pct[col] / active_authors_pct[1] * 100, 2
        )
    active_authors_pct[1] = 100
    
    # Generate heatmap (cohort users, %)
    labels = {"x": "n months", "y": "cohort (by month)", "color": "% author"}
    
    fig_pct = px.imshow(active_authors_pct, text_auto=True, labels=labels)
    fig_pct= fig_pct.update_xaxes(side="top", ticks="outside", tickson="boundaries", ticklen=5)
    fig_pct = fig_pct.update_yaxes(showgrid=False)
    
    fig_pct = fig_pct.update_layout(
        {
            "xaxis": {"tickmode": "linear", "showgrid": False},
            #"width": 800,
            "height": 500,
            "plot_bgcolor": "rgba(0, 0, 0, 0)",
            "paper_bgcolor": "rgba(0, 2, 0, 0)",
        }
    )
    Figure 6: Globally the retention rate is lower in later cohorts (authors that start posted later this year)
    cohort_index 1 2 3 4 5 6 7 8 9 10 11 12 13
    cohort_month
    2022-02-01 100 76.79 64.65 56.32 48.12 43.41 49.45 55.04 49.18 38.83 38.51 45.38 48.08
    2022-03-01 100 37.87 29.88 22.93 19.19 24.85 28.97 25.49 16.68 17.99 21.46 24.05 NaN
    2022-04-01 100 20.39 13.95 12.88 15.45 20.60 17.70 11.70 13.09 17.38 16.09 NaN NaN
    2022-05-01 100 12.06 9.58 14.99 18.24 13.91 9.74 9.27 13.29 15.15 NaN NaN NaN
    2022-06-01 100 16.57 21.60 21.60 17.16 12.72 13.91 15.09 20.41 NaN NaN NaN NaN
    2022-07-01 100 20.97 23.22 17.23 11.24 13.11 17.98 18.35 NaN NaN NaN NaN NaN
    2022-08-01 100 21.05 14.17 9.11 7.69 12.35 11.94 NaN NaN NaN NaN NaN NaN
    2022-09-01 100 12.09 10.53 9.55 14.42 12.28 NaN NaN NaN NaN NaN NaN NaN
    2022-10-01 100 8.10 7.26 13.13 15.36 NaN NaN NaN NaN NaN NaN NaN NaN
    2022-11-01 100 6.34 11.27 14.79 NaN NaN NaN NaN NaN NaN NaN NaN NaN
    2022-12-01 100 18.22 16.36 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    2023-01-01 100 20.29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    2023-02-01 100 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

    Comments embedding & fast retrieval using SBERT, Faiss

    Example of use : if we wanted to retrieve similar arguments / check propaganda –in an efficient way, with Faiss.

    Semantic search : resources & performance overview of our curated models

    • Notes and resources I found to be useful
      Model choice, must read : symmetric vs. asymmetric semantic search, language, tuning : Sbert is all u need
      Models available and metrics: Sbert doc

    • Misc
      Models trained for cosine prefer shorter document retrieval, vs. dot product (longer)
      Faiss uses inner product (=dot product ; += cosine score if vectors are normalized e.g using faiss.normalize_L2 or L2 to measure distances (more here).
      In the first place, we were not sure of our typical use case : short query => long comment (asymmetric), or comment => similar comment (symmetric).

    • Candidate models we curated & tested :

      Curated models
      Models Quick notes
      paraphrase-multilingual-mpnet-base-v2 multi languages, suitable score : optimized for cosine, max seq len 128
      distiluse-base-multilingual-cased-v1 symmetric, multi lang., max seq len 128, optimized for cosine
      quora-distilbert-multilingual multilanguages, short text (questions) closest to our symm use case?
      Example here (pytorch)
      dangvantuan/sentence-camembert-large Bigger model, symmetric ?, french, optimized l2 + tbc others ? Size embed 1024
    • Post run evaluation and remarks :
      mpnet-base-v2, distiluse, quora : fast encoding (20k documents < 1mn), results quite similar between models, each one finds our test query and pertinent results. A very good baseline.
      mpnet-base-v2, distiluse, quora : with a flat inner product faiss index, no difference if we perform vectors normalization or not, maybe because they’re optimized for cosine already?
      camembert is a bigger model (1024 dimension), slower encoding (20k docs = 5mn), nice (better?) results (spoiler alert : it is optimized for French). With a flat IP index, if normalize = False, retrieve similar, short documents. If we normalize our embeddings, it retrieves our initial query + longer, similar documents.

    Fasten our models evaluation through prior rdm sampling, notes on speed.

    To speed our experiments up, we will work with a sample of comments (around 10% : 236k -> 20k)
    FYI,embedding of all comments (all 236k), takes approx 10mn on a 1080ti, i7700k, 32gb RAM, with curated models ; *1.5 to *2 when using the biggest model (Camembert).
    Encoding on our sample (10k) is < 1mn.
    No detailed measure on inference speed, but very fast with Faiss. Might want to try different –optimized, indexes typeswith a bigger dataset.

    # Remember, our dataset was loaded as a Polars dataframe,
    # sample method is very similar in Pandas though.
    coms_sample = coms.sample(seed=42, n=20000, shuffle=True)
    # We're removing articles content and some other cols we won't work with
    keep_cols = [
        "article_id",
        "url",
        "title",
        "desc",
        "date",
        "keywords",
        "author",
        "comment",
    ]
    coms_sample = coms_sample.select(pl.col(keep_cols))
    print(coms_sample.shape)
    (20000, 8)
    ┌────────────┬───────────────┬───────────────┬───────────────┬────────────┬───────────────┬─────────────┬──────────────┐
    │ article_id ┆ url           ┆ title         ┆ desc          ┆ date       ┆ keywords      ┆ author      ┆ comment      │
    │ ---        ┆ ---           ┆ ---           ┆ ---           ┆ ---        ┆ ---           ┆ ---         ┆ ---          │
    │ i64        ┆ str           ┆ str           ┆ str           ┆ date       ┆ list[str]     ┆ str         ┆ str          │
    ╞════════════╪═══════════════╪═══════════════╪═══════════════╪════════════╪═══════════════╪═════════════╪══════════════╡
    │ 3265305    ┆ https://www.l ┆ Trois         ┆ « Poutine,    ┆ 2022-03-17 ┆ ["idees", "id ┆ Strasgorod  ┆ Nos          │
    │            ┆ emonde.fr/ide ┆ semaines      ┆ l’agresseur   ┆            ┆ ees-chronique ┆             ┆ désormais    │
    │            ┆ es/article/20 ┆ après le      ┆ de l’Ukraine, ┆            ┆ s", … "crise- ┆             ┆ dissidents   │
    │            ┆ 22/03/17/po…  ┆ début de      ┆ n’est pas le  ┆            ┆ ukrainienne…  ┆             ┆ russes (édit │
    │            ┆               ┆ l’agression   ┆ …             ┆            ┆               ┆             ┆ orialistes … │
    │            ┆               ┆ rus…          ┆               ┆            ┆               ┆             ┆              │
    │ 3263263    ┆ https://www.l ┆ L’UE a décidé ┆ Avec les      ┆ 2022-03-04 ┆ ["internation ┆ Anthemius   ┆ A Ryu7:      │
    │            ┆ emonde.fr/int ┆ d’appliquer   ┆ réfugiés      ┆            ┆ al",          ┆             ┆ l’Europe,    │
    │            ┆ ernational/ar ┆ pour la       ┆ ukrainiens,   ┆            ┆ "europe", …   ┆             ┆ notamment    │
    │            ┆ ticle/2022/…  ┆ première      ┆ les Européens ┆            ┆ "crise-ukrain ┆             ┆ l’Allemagne, │
    │            ┆               ┆ fois…         ┆ ret…          ┆            ┆ ienne"]       ┆             ┆ a accueil…   │
    └────────────┴───────────────┴───────────────┴───────────────┴────────────┴───────────────┴─────────────┴──────────────┘
    # Quick cleaning, typically remove comments with 3 emojis only
    # Just filter out small comments
    coms_sample = coms_sample.filter(pl.col("comment").str.n_chars() >= 45)
    
    # Finally, convert back to Pandas, for a "better" (re use of code;) workflow with Sbert and FAISS
    coms_sample = coms_sample.to_pandas()
    print(coms_sample.shape)
    (19014, 8)

    Convenience functions to repeat our experiments with different indexes / models

    """ comments embedding """
    
    def comments_to_list(df, column: str) -> list[str]:
        """Extract documents from dataframe"""
        return df[column].values.tolist()
    
    
    def load_model(model_name: str):
        """Convenience fonction to load SBERT model"""
        return SentenceTransformer(model_name)
    
    
    def encode_comments(model, comments):
        """Encode comments using previously loaded model"""
        return model.encode(comments, show_progress_bar=True)
    
    """ create (a flat) Faiss index """
    
    def create_faiss_index(embeddings, normalize: bool, index_type: str):
        """
        Create a flat index in Faiss of index_type "IP" or "L2"
        Index_types and prior vectors normalization varies
        according model output optimization and task.
        """
        dimension = embeddings.shape[1]
        embeddings = np.array(embeddings).astype("float32")
        if normalize:
            faiss.normalize_L2(embeddings)
        if index_type == "ip":
            index = faiss.IndexFlatIP(dimension)
            index.add(embeddings)
        else:
            index = faiss.IndexFlatL2(dimension)
            index.add(embeddings)
        return index
    
    
    def save_index(index, filename: str):
        """Optional, save index to disk"""
        faiss.write_index(index, f"{filename}.index")
    
    
    def load_index(filename):
        """Optional, load index from disk"""
        return faiss.read_index(filename)
    
    """ query index """
    
    def search_index(index, model, query:str, normalize: bool, top_k: int):
        # encode query
        vector = model.encode([query])
        if normalize:
            faiss.normalize_L2(vector)
    
        # search with Faiss
        Distances, Indexes = index.search(vector, top_k)
        # Distances, Indexes = index.search(np.array(vector).astype("float32"), top_k)
        return Distances, Indexes
    
    
    def index_to_comments(df, column:str, Indexes):
        """Convenience function to retrieve top K comments
        from our original Dataframe
        """
        return df.iloc[Indexes[0]][column].tolist()

    Load model, encode comments

    # load comments (a list), pick our candidate model, load it
    comments = comments_to_list(coms_sample, "comment")
    model_name = "paraphrase-multilingual-mpnet-base-v2"
    model = load_model(model_name)
    normalize = False
    # Encode comments. See notes above for elements of performance / speed
    embeddings = encode_comments(model, comments)

    Create (or load) our Faiss index, here a flat index

    # create Faiss Index, here Flat Innner Product 
    # (exhaustive search, "no" optimization)
    index = create_faiss_index(embeddings, normalize, index_type="ip")
    # optional : save Faiss index to disk
    filename = "mpnet"
    save_index(index, filename)
    # Optional, load from disk the previously saved Faiss index
    # so we do not rerun embeddings everytime we're executing the notebook
    # we found mpnet (multilang) to be a very good baseline for our dataset.
    filename = "mpnet.index"
    index = load_index(filename)

    Let’s find similar comments

    # extract an existing comment (= will be our input query) from dataset
    print(coms_sample["comment"].tolist()[1300])
    Quelle arrogance et quel cynisme. Qu'y a t-il de plus terroriste que la Russie d'aujourd'hui?
    # encode query, query index, retrieve top_k --here 8, nearest comments
    query = "Quelle arrogance et quel cynisme. Qu'y a t-il de plus terroriste que la Russie d'aujourd'hui?"
    top_k = 8
    Distances, Indexes = search_index(index, model, query, normalize, top_k)
    # display top similar comments
    results = index_to_comments(coms_sample, "comment", Indexes)
    for i, result in enumerate(results):
        print(f"{i+1}| : {result}")
    1| : Quelle arrogance et quel cynisme. Qu'y a t-il de plus terroriste que la Russie d'aujourd'hui?
    2| : Quand la Russie, cet Etat terroriste va-t-il être reconnu comme tel par la communauté internationale?
    3| : Il ne s'agit pas de Poutine. Il s'agit de la Russie.
    4| : Oui, ce que les russes ont fait à Hiroshima, Nagasaki, Bagdad, Vietnam... une honte.
    5| : Vous avez raison : les russes sont nos ennemis. Depuis des dizaines d'années.
    6| : Cet article confirme que la russie est devenue un territoire terroriste. Elle utilise toutes les outrances pour justifier ses actes, fait taire sa population, arme des régiments de trolls(cf contributions sur cet article) et envoie sa population à la boucherie sans états d’âmes. Même catégorie que l’Afghanistan et la Corée du Nord.
    7| : Vous êtes naif si vous croyez que seuls les Russes ont ce genre de comportement en tant de guerre...vous devez être de ceux qui croient en la guerre "propre" que les Occidentaux prétendre faire depuis 40 ans (parfois avec les Russes comme alliés d'ailleurs).
    8| : Les Russes sont des colonisateurs dangereux. Demandez aux pays de l'Europe de l'est, aux pays baltes, aux pays d'Asie centrale, à la Syrie, à l'Afghanistan, au Mali, au Soudan, à la Centre Afrique. Faut se protéger contre les nazis russes.

    Maybe later just for fun : 0 shot “tone” classification tests using OpenAI API