Riddler Classic: Spam begets spam

Problem

Over the course of three days, suppose the probability of any spammer making a new comment on this week’s Riddler column over a very short time interval is proportional to the length of that time interval. (For those in the know, I’m saying that spammers follow a Poisson process.) On average, the column gets one brand-new comment of spam per day that is not a reply to any previous comments. Each spam comment or reply also gets its own spam reply at an average rate of one per day.
  For example, after three days, I might have four comments that were not replies to any previous comments, and each of them might have a few replies (and their replies might have replies, which might have further replies, etc.).
  After the three days are up, how many total spam posts (comments plus replies) can I expect to have?

Simulation

num_trials <- 100000
spam_comments <- function(hrs, mult) {
  # Given a number of hours, and a multiplier to decide whether new spam
  # comments seed their own comment/reply spam stream.
  arrival_time <- rexp(n = 1, rate = 1/24)
  if(arrival_time > hrs) {
    return(0)
  } 
  return(1 + mult * spam_comments(hrs - arrival_time, mult))
}

df <- crossing(trial = 1:num_trials, mult = 1:2) %>%
  mutate(hrs = 24 * 3) %>%
  mutate(num_comments = purrr::map2_dbl(hrs, mult, spam_comments))
options(digits = 3)
df %>%
  group_by(mult) %>%
  summarise(
    mean_num_comments = mean(num_comments),
    sd_num_comments = sd(num_comments))
## # A tibble: 2 x 3
##    mult mean_num_comments sd_num_comments
##   <int>             <dbl>           <dbl>
## 1     1              3.00            1.73
## 2     2             19.7           132.

Conclusion

We already knew that the mean number of comments without spawning new threads with a rate of 1 per day for 3 days was 3 comments, using a basic Poisson process (with a variance of 3 as well.)

With our simulation, we can see a much higher number of comments (~19 over 3 days time), and an extremely high variance.

Update

Revising with a couple of sources:

  • Notes from MIT’s OpenCourseWare CS 6.262
  • David Robinson’s screencast on this Riddler

The MIT chapter points to one reminder – if you have two Poisson processes at different rates \(\lambda_1\) and \(\lambda_2\), it forms an aggregate Poisson process with rate \(\lambda = \lambda_1 + \lambda_2\).

For the first waiting time \(X_1\), it follows a simple exponential distribution with \(\lambda = 1\). This correspondingly has a mean \(\lambda\). Fair enough.

But now we have the original post and the new post, both of which gather replies. So at this point, we can consider this as two Poisson processes, both with \(\lambda_1 = \lambda_2 = 1\). The aggregate of these two is a Poisson process \(\lambda = 2\). The wait time for \(X_2\) is then \(\frac{1}{2}\).

With the next comment, we now are combining another Poisson process, so \(\lambda = 2 + 1 = 3\). So, now we have \(X_3 \sim \exp(3)\) waiting time, with an expectation \(E[X_3] = \frac{1}{3}\).

\[ \begin{aligned} S_n & = X_1 + X_2 + X_3 + \dots + X_n \\ E[S_n] & = E[X_1] + E[X_2] + E[X_3] + \dots + E[X_n] \\ & = 1 + \frac{1}{2} + \frac{1}{3} + \dots + \frac{1}{n} \\ & = \int_1^n \frac{1}{x} \; dx \\ & = \ln{x} \; \big|_1^n \\ 3 & = \ln{n} \\ n & = e^3 \\ \end{aligned} \] There’s an off-by-one issue here, since you cannot fit all the comments exactly within the window, so, we’re looking at \(n - 1 = e^3 - 1 = 19.1\).