Automated Summary Experiment Pt.1

So, I decided to play around with Node.js and CoffeeScript. I thought it would be fun to program some toy command line script, so what toy problem would I solve?

Well, I’m sure you all heard about how Google and Yahoo have both acquired automated summary generation applications for big bucks. So I thought, why not write something similar so I can make the big bucks1?

I really wanted something like NumPy for Node. Luckily I found natural and Sylvester. “‘Natural’ is a general natural language facility for node.js.” Sylvester is a vector and matrix math library for JavaScript. I thought I’d have to write a lot more, but using these two libraries along with CoffeeScript, the code is around 100 lines.

The Idea

When generating summaries, we want to pick sentences that matter, but how do we decide what matters? Well, a long time ago, someone invented an algorithm called PageRank. PageRank takes a graph representation of a set of documents that link to each other. A transition matrix is generated where an element A[i,j] tells us the probability of a person on Page(i) going to Page(j). Using this, we can calculate a weight of how important a document is. The weight is something along the lines of “the probability a user will wind up on that page after a lot of clicks.”

Image by Felipe Micaroni Lalli / CC-BY-2.5

Someone smarter than me had the great idea of using this to rank sentences in text. But how do we take the idea of page links and use it on text?

TermFrequency-InverseDocumentFrequency vectors can be used to measure similarities between text. If we assign each sentence in the document a TF-IDF vector, then the “links” can be the amount of similarity between the documents. Luckily for us, natural can calculate the TF-IDF vector for each sentence for us. It doesn’t seem to have an easy way to calculate a similarity matrix, so we kind of hack something together that works okay. We’ll get into this later.

So using our Similarity Matrix, we can make each row add up to 1, and then run our PageRank algorithm on it. This will allow us to find important sentences. We can then pick the top rated sentences and treat this as our summary. Cool huh?

Let’s Dive In

The first thing I want to do is decide what type of documents I want to handle. I like writing in Markdown, but the extra characters might mess things up. So here’s a crappy function to strip out some of the markdown:

strip_markdown = (str) ->
  # Extra quotes
  str = str.replace /"/, ''

  # Links and Images
  str = str.replace /!\[(.*?)?\]\((.*?)?\)/g, '$1'
  str = str.replace /!?\[(.*?)?\]\((.*?)?\)/g, '$1'

  # Footnotes
  str = str.replace /\[\^(.*?)\]:?/g, ''

  # Bold and italics
  str = str.replace /\*{1,3}([^\*]*?)\*{1,3}/g, '$1'
  str = str.replace /_{1,3}([^\*]*?)_{1,3}/g, '$1'

  # List and blockquotes
  str = str.replace /^\s*[\*\s]+(.*?)/g, '$1'
  str = str.replace /^\s*[\>\s]*(.*?)/g, '$1'
  str = str.replace /^\s*[\*\s]+(.*?)/g, '$1'
  str = str.replace /^\s*[\>\s]*(.*?)/g, '$1'

  # Headings
  str = str.replace /#/g, ''

  # comments
  str = str.replace /<!--(.*?)-->/g, ''

  # Extra whitespace
  str = str.replace /\s+/g, ' '

  # Code
  str = str.replace /```(.*?)```/g, ''

The comments should explain what each replace call is trying to do. We remove links, strip out footnote tags, and attempt to remove other formatting characters.

Document Tokenizing

Before we can tokenize, we want to require natural and sylvester. Then, we want to read in 2 command line arguments: the file and the number of sentences in the summary.

natural = require 'natural'
sylvester = require 'sylvester'
fs = require 'fs'

if process.argv[2]? and process.argv[3]?
  # Read in the given file
  fs.readFile process.argv[2], (err, data) ->
    # Tokenize into sentences
    tokenizer = new natural.RegexpTokenizer({pattern: /[\.!\?\r\n#]+/})
    file = strip_markdown data.toString()
    sentences = tokenizer.tokenize file
    sentences = (sentence.trim() for sentence in sentences)

So this code makes sure we have both command line arguments. We then read in the file, convert the result to a string, and tokenize based on the standard punctuation marks ‘?’, ‘.’, and ‘!’. This results in an array where each element should be a sentence from the text.

Next, we want to take each sentence and load it into a TfIdf object. Then we see how each sentences compares to every other sentence. This isn’t the cosine similarity or anything like that. If I understand correctly, the resulting value returned from natural is the sum of the weights of every word in the sentence provided compared to every sentence in the corpus. Of course, the sentence compared to itself will have a very high score, but this is some alpha level stuff we’re coming up with, we can improve it later.

    # Take the sentences and generate the tf-idf vectors
    transition = [];
    TfIdf = natural.TfIdf
    tfidf = new TfIdf()

    tfidf.addDocument(sentence) for sentence in sentences
    for sentence in sentences
      row = [];
      tfidf.tfidfs sentence, (i, measure) ->
        row.push(measure + 1)

So what happens is we compare each sentence to every other sentence in the corpus and create a row in our transition matrix. We might want to add a check to prevent comparing the sentence to itself, but we can do that later. transition_modified(row) simply divides each element in the vector by the sum of all elements in the vector. The code looks like:

transition_modified = (r) ->
  divider = 0
  divider += score for score in r
  r = (score/divider for score in r)

Cool beans! So now we have a transition matrix.

Calculating the PageRanks

The way we calculate the PageRank is to take a rank vector, and multiply is against our transition matrix a bunch of times. Each time we multiply the rank vector against our transition matrix, it tells us the probability of being on a certain page after the number of clicks. So if we multiply our rank vector by our transition matrix 5 times, it would tell us the probability of winding up on each page after 5 clicks.

Sylvester doesn’t have an easy way to initialize our rank vector, so we write a function to do it for us.

make_vec = (len) ->
  new_vec = [];
  new_vec.push 1
  new_vec.push(0) for i in [1...len]

What this does is create a vector of zeros with a 1 in the first position. If we start on the first page, then the probability of us being on the first page after 0 clicks is 1.

So let’s use this to find our PageRanks:

    # Find the page rank off the given
    # Transition matrix...
    trans_mat = $M(transition)
    rank_vec = make_vec(sentences.length)
    for i in [1..1000]
      rank_vec = rank_vec.x(trans_mat)

I bet you thought it was going to be a lot tougher than that. We probably don’t have to multiply 1,000 times. Something about Markov Chains and other things I don’t understand too well myself. So it’s safe to decrease it if desired.

But using this, how do we get out the top rated sentences for our summary in the proper order? Well, JavaScript has a sort method ready for us. All we do is create some objects to hold each sentence, score, and original index. We then sort once by the score, take the top N sentences, then sort by the original index. It looks something like:

    # Create an array of objects with sentences and scores
    scored = []
    for i in [1..sentences.length]
      scored.push({score: rank_vec.e(1, i), sentence: sentences[i-1], idx: i})

    scored.sort compare_scores
    scored = scored.splice(0, process.argv[3])
    scored.sort compare_idx

    summary = []
    summary.push(good.sentence) for good in scored

    console.log(summary.join(".  ") + ".")

compare_idx = (a, b) ->
  a.idx - b.idx

compare_scores = (a, b) ->
  b.score - a.score

compare_idx and compare_scores are helper functions that tell the Arrays.sort method how to sort. We then join the sentences together to create an automated summary.

You can view the whole code here. To run it, simply type:

./ file.txt 10

And that will give you a 10 sentence summary.

How well does this work?

Probably not as well as I would have hoped, but exceeding my expectations. We’re not using a good similarity measure between sentences.

Movie Plots

Spoilers may be contained within.

The Amazing Spiderman

Let’s take some text from the Plot of The Amazing Spider-Man from Wikipedia. It has ~44 sentences. Using our CoffeeScript, we summarize the text in 10 sentences as

As a teenager, Peter is a student at Midtown Science High School, where he is bullied by Flash Thompson and is romantically interested in the beautiful Gwen Stacy, the daughter of police captain George Stacy. Peter finds an algorithm in his father’s documents and gives it to Connors−the missing key to his serum. In school, Peter damages school property during a confrontation with Flash, and Ben is forced to work late so that he can pick up Peter. Peter ignores Ben in favor of helping Connors test their serum on a three-legged mouse. At a grocery store, the clerk rudely refuses to let Peter buy a drink, and when a thief steals money from the register, Peter lets the thief escape. Peter later meets Connors in his office and suspects he is the Lizard, and later unsuccessfully confronts Connors’ Lizard form in the sewers, leaving behind his camera. Connors learns Peter’s identity via the name on the camera and pursues him to Midtown Science High School, where they fight. The police corner Spider-Man, and Captain Stacy discovers that he is Peter, but lets him escape to go save Gwen. Spider-Man manages to replace Connors’ serum with the antidote, reverting Connors to human form, but not before Connors mortally wounds Captain Stacy. In a post-credits scene, Connors, in a dark prison cell, is confronted by a man in the shadows who asks if Connors told Peter the truth about his father.

Pretty awesome. It’s almost like we choose this example first with 10 sentences because it looks great. Of course, we don’t learn Peter Parker is Spider-Man until the police do in this summary, but for the most part, I think it’s pretty good.

X-Men First Class

Plot summary from Wikipedia.

Simultaneously, Lensherr locates Shaw; Lensherr is initially defeated by Emma Frost, but his full magnetic power manifests in anger and he begins to destroy Shaw’s yacht. MacTaggert and Xavier find Shaw as Lensherr is attacking him, rescuing Lensherr from drowning as Shaw escapes in a concealed submarine. Shaw, now immensely powerful from having absorbed almost all the energy from the nuclear reactor, easily subdues Lensherr by throwing him around the reactor room with mere taps and absorbing the kinetic energy from the metal fragments Lensherr launches at him, insisting he wants to help Eric extend his powers further and questioning why he is bothering to fight for a ‘doomed race’ when he and Shaw could rise from the ashes of humanity. Lensherr tells Shaw that he shares Shaw’s exclusivist view of mutants but, to avenge his mother, kills Shaw — over Xavier’s objections — by forcing the Nazi coin of his youth through Shaw’s brain. In a struggle, Xavier keeps Lensherr from destroying the fleets with the missiles, but when MacTaggert shoots at Lensherr, a deflected bullet hits Xavier in the spine.

So this is only 5 sentences, but we can see that it starts off in the middle. A lot of things wouldn’t make sense if you didn’t see the movie (and still probably don’t make sense if you did).

Blog Posts

One of the more interesting applications (instead of movie summaries) would be news media or blogs posts. Reader’s want to know if something will interest them before they read the whole thing.

Matt Gemmell - Legacy

Matt Gemmell wrote an interesting piece called Legacy. I myself have never had such a health scare, but I do understand the idea of wanting to leave something behind. I believe it was Final Fantasy IX that introduced me to the quote “To be forgotten is worse than death”

I’ve always had an irrational fear of developing heart problems (I have no particular family history of it, and I hadn’t had any worrying symptoms before). We create software that will live only for so long, and whilst it indicates our design priorities and our insight into what other people need or want, we tend to see software as an entity unto itself. I think you can tell a lot about how a developer views their work, community and career by how much code they’ve released (assuming no artificial constraint exists to prevent them doing so). I was very nervous before I released my first piece of open source Cocoa code, years ago. It might even make others think of the author just a shade more than a software product itself. Everything I’ve mentioned – apps, open-source code, public speaking, and writing – has an element of ego. I’ve had conversations about it over the years, and I’ve learned that it’s not generally a motivator in people’s lives. I’m not searching for personal glory, but rather a marker that can endure more Winters than I will. Equally and consequently, the other authors whose work I most enjoy reading tend to talk about not only professional but also personal topics. My own fear of passing entirely from memory has been a companion for many years, but I’m sure I could adapt to its absence.

This summary is 10 sentences and doesn’t capture the entire essence of the post. I do believe that it peaks the readers interest enough to take them to the post. It does have some gaps that are hard to follow. I really should test this on articles I’ve never read before…

Summly 2011 Wired Article

Let’s try summarizing Wired’s Article on Summly in seven sentences.

Summly uses a more abstract method, starting with a special algorithm that extracts text from a web page using HTML processing. When you search for a topic using the app, it compiles results from different search engines, so you’ll notice it doesn’t deliver the same results as a Google search, or even a Bing search. D’Aloisio says that Summly works best with well-formulated articles that conform to a consistent structure. Tech articles and news articles tend to marry well with Summly’s algorithm, as does the consistently organized content from the New York Times and the BBC. The app doesn’t do quite as well with narrative text written in the third person, but D’Aloisio says that there are no areas that are seriously troublesome to his algorithm. D’Aloisio says that to get this number, they took a corpus of past documents and articles and compared the quality of human summaries to Summly’s output. D’Aloisio says he has other ideas and aspirations, but for now he’s happy to continue working on and improving Summly.

Why I Almost Cried During Wreck-It-Ralph

This is a blog post I wrote a while back.

The main issue–and it’s a big one–is that Disney hinted at a possible Happy Ending that didn’t have a wish come true feeling in the end. Since she is a glitch, she cannot leave the game and visit game central–she is trapped in her game. King Candy goes and informs Ralph that if Vanellope participates in the race and wins, she will be a playable character. If she is playable, then the players might not like her glitch and think the game is broken. The game could become unplugged, and Vanellope would be trapped in the game and trashed along with it. My tears almost came out of the concern that Disney may have made a movie where Ralph and Vanellope would find acceptance, but not get their goals (or other dreams) fulfilled. Their struggle throughout the movie would have led to another solution, one not on their path, one that is not the happy ending they were searching for. I don’t think any other Disney movies have suggested the possibility of a happy ending that the character didn’t yearn for. In real life, we can find happiness (not always as planned), but in Disney movies, Happy Endings always correspond with dreams come true. But, Disney movies always have a happy ending where dreams come true, and that’s why I only almost cried during ‘Wreck it Ralph’, but not ‘Wall-e’.

10 sentences that I think capture the essence of the blog post.

What about this blog post?

So I thought, why not write something similar so I can make the big bucks. So what happens is we compare each sentence to every other sentence in the corpus and create a row in our transition matrix. So if we multiply our rank vector by our transition matrix 5 times, it would tell us the probability of winding up on each page after 5 clicks. Using our CoffeeScript, we summarize the text in 10 sentences as: As a teenager, Peter is a student at Midtown Science High School, where he is bullied by Flash Thompson and is romantically interested in the beautiful Gwen Stacy, the daughter of police captain George Stacy. I think the downfall with this post is we have example where we’ve already choose some related summarized text, so some of the sentences creep up in score.

This is only 5 sentences. I think the downfall with this post is we have example where we’ve already choose some related summarized text, so some of the sentences creep up in score.

In Conclusion

I hope you found this post enlightening and useful. I wrote it to learn how to use CoffeeScript and some mathemagical libraries from npm. If you have any questions or feedback, please tweet me.

In the next few blog posts about automated summaries, we’ll work on making this smarter and look at some other techniques we can use.

  1. Of course, this is just for play and probably not as awesome as something worth $20 or $30 million.