Estimating the average reading time for articles and posts in your Ruby on Rails blog

Speedy Readers
The Reading Time Formula
Ruby Implementation
How can we help?

If you are an avid reader of blogs, like I am, you may have come across the "estimated reading time" feature like the on you can find on Medium. Indicating the reading time for a blog post or article is a great way of letting your readers know how long it would take them to read and consume the content. Many people live busy (work-) lives and have limited time available, and if they see an article is going to take only a couple of minutes to read, they are so much more likely to dive in and get to the end of it.

Speedy Readers

According to a speed-reading test sponsored by Staples, the typical speeds at which humans read, and in theory comprehend, at various stages of educational development:

Third-grade students = 150 words per minute (wpm)
Eight grade students = 250
Average college student = 450
Average “high level exec” = 575
Average college professor = 675
Speed readers = 1,500
World speed reading champion = 4,700
Average adult: 300 wpm

This data tells us that the average reading speed for individuals at different stages of educational and professional development vary. So it's important to know your audience when estimate reading speeds - but calibrating for a 250-300 words per minute range is a good ballpark, since not all content is equal. We can leverage this data to estimate the time it would take the average reader to get through your blog post or article.

The Reading Time Formula

The formula for estimating reading time itself is incredibly simple: T = w / s where T denotes the estimated reading time, w is the total number of words in your article or post, and s is the reading speed (either in words per minute, or words per second) for your target audience.

Ruby Implementation

How can we put this formula into our Rails blog? Natural Language Processing (NLP) is a subfield of computational linguistics and gives us all the fundamental building blocks for processing and analyzing arbitrary amounts of digital natural language data. We will first start off with the notion of a Document:

class Document
  def initialize(text)
  end
end

A Document is constructed from a chunk of raw natural language text, in our case, the contents of your blog post or article. Through NLP methods, we can cut up this text into smaller and smaller pieces such as paragraphs, sentences, tokens, and finally words.

We first start with paragraphs - consecutive amounts of text that are separated by two or more line breaks. We can use a Regular Expression that splits the given text on two or more consecutive newline characters:

def paragraphs(text)
  text.split(/[\n\r]{2,}/)
end

Next, we can further split up each paragraph into sentences, again using a Regular Expression, albeit this time it's a little more complex:

def sentences(paragraph)
  paragraph.split(/((?<=[a-z0-9][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])/)
end

Each sentence can be further split into raw tokens, all the elements of the sentence that are separated by whitespace:

def raw_tokens(sentence)
  sentence.split(/\s+/)
end

Each raw token potentially contains punctuation characters, for example if we were to split the sentence "Two books, one tall, one short." with the method above, we'd obtain the raw tokens ["Two", "books,", "one", "tall,", "one", "short."]. Hence, we need to further separate punctuation from each raw token:

def split_with_punctuation(raw_token)
  return raw_token if raw_token.end_with?("'S")

  raw_token.split(/((?<=\p{P})|(?=\p{P}))/).map(&:strip)
end

For each resulting token, we can now determine whether that token comprises a word, or a punctuation element:

def punctuation?(token)
  (token =~ /\p{P}/) != nil
end

def word?(token)
  !punctuation?(token)
end

Putting it all together

Let's put all these pieces together in a complete solution:

module Language
  def self.paragraphs(text)
    text.split(/[\n\r]{2,}/)
  end

  def self.sentences(text)
    text.split(/((?<=[a-z0-9][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])/)
  end

  def self.tokenize(text)
    text.split(/\s+/)
  end

  def self.split_with_punctuation(text)
    return text if text.end_with?("'S")

    text.split(/((?<=\p{P})|(?=\p{P}))/).map(&:strip)
  end

  def self.punctuation?(token)
    (token =~ /\p{P}/) != nil
  end

  def self.word?(token)
    !punctuation?(token)
  end

  class Document
    attr_accessor :paragraphs, :sentences, :tokens

    def initialize(text)
      @text = text

      @paragraphs = Language.paragraphs(text)

      @sentences =
        @paragraphs
          .map { |paragraph| Language.sentences(paragraph) }
          .flatten
          .filter { |s| (s =~ /\A\s*\z/).nil? }

      @tokens =
        @sentences
          .map { |sentence| Language.tokenize(sentence) }
          .flatten
          .map(&:upcase)
          .map(&Language.method(:split_with_punctuation))
          .flatten
          .filter { |s| (s =~ /\A\s*\z/).nil? }
    end

    def words
      @tokens.filter(&Language.method(:word?))
    end

    def reading_time(speed = 3)
      n_words = words.count
      (n_words + speed / 2) / speed
    end
  end
end

Now we can construct a Document from any given text and estimate the reading time:

class Post < ApplicationRecord
  # other code ...
  def reading_time(speed = 200)
    Language::Document.new(content).reading_time(speed)
  end
end

For our KUY.io blog we are using a reading_speed of 4.1 words per second (or about 250 wpm), and display it on our blog posts like this:

  distance_of_time_in_words (@post.reading_time(4.1)).seconds

What if your Blog is not written in another language / framework?

There are a number of amazing NLP libraries out there for different languages and frameworks, that come with built-in support for text parsing, chunking and tokenization. Here is a round-up of the most popular libraries for different languages:

Python: is quickly becoming the standard language for data scientists, and also features on of the best NLP libraries out there: the Natural Language Toolkit (NLTK)
JavaScript: is another language that is very popular for web applications and blogs. Two NLP libraries stand out: Natural and NLP.js
Java: is the grandfather language for NLP processing, with one of the most mature libraries, the Stanford NLP library
Ruby: features a large collection of libraries for NLP tasks
Elixir: has an implementation of basic NLP tools with the Essence library

How can we help?

Natural Language Processing is an incredibly fun sub-field of Artificial Intelligence and a powerful tool for teaching machines to process natural language text. If you have any questions about KUY.io or are looking for advice with implementing NLP workloads in your next project, please feel free to reach out us.

👋 Cheers,

Nicolas Bettenburg, CEO and Founder of KUY.io