rss2email - a Scheme program for converting RSS to email

The email that is sent to the mailing list on Technometria is generated automatically from the RSS feed for the "Newsletter" category on my blog. As I write my blog, I just properly categorize anything I want to be sent to the newsletter and it happens. The magic is a little python program called rss2email.py by Aaron Swartz. I modified the program to make it usable for my newsletter application.

The problem is that the program was designed for one feed and one email address. I've wanted to do regular mailings to the UtahPolitics.org mailing list in the same way, but that would require rewriting the Python script in some pretty serious ways. I decided that I'd rather rewrite it in Scheme, so I did. This page documents that program.

The Scheme program is also called rss2email let me use some of the things I've learned about as XML and URLs in Scheme and let me explore some things I didn't know very well like:

  • Perl regular expressions in Scheme
  • Using SMTP
  • Using command line arguments

Installing rss2email

Right now, making rss2email work requires more knowledge of Scheme than I'd like. For example, the configuration is stored in the program as an assoc list. I'd rather it be in an external file that people could edit and use. Also, the GDBM extension I use is not "just there" like the other libraries are.

Even so, if you're interested in making it work on your system, you would need system admin skills more than a knowledge of Scheme. installing MzScheme is fairly straightforward. So, if parentheses don't scare you too much, give it a go. I'm happy to answer questions about it.

Program Description

Almost every function in the program makes use of a structure containing the relevant information for the mailing list.

(define-struct mlist (name owner to url template gdbmfile))
                       
(define mailing-lists
  (list
   (cons "technometria"
         (make-mlist
          "Phil Windley's Technometria"    ; name
          phil                             ; owner's email
          `(,phil)                         ; to email
          "http://www.windley.com/rss.xml" ; rss url
          "/Users/pjw/prog/scheme/rss-to-email/technometria.tmpl" 
          "/Users/pjw/prog/scheme/rss-to-email/guid-technometria.db" 
          ))
   (cons "utahpolitics"
         (make-mlist
          "UtahPolitics.org"                      ; name
          phil                                    ; owner's email
          `(,phil)                                ; to email
          "http://www.utahpolitics.org/index.xml" ; rss url
          "/Users/pjw/prog/scheme/rss-to-email/utahpolitics.tmpl" 
          "/Users/pjw/prog/scheme/rss-to-email/guid-utahpolitics.db"
          ))
   ))

The structures representing the configuration are store in an association list so that we can use the key to select the right one later.

The main work of the program consists of

  • get the items from the RSS feed
  • filter them to remove any we've processed before
  • build a message from the filtered list
  • substitute it into the template
  • mail the result to the email in the mlist structure

For the first two tasks, I turned to a recipe for processing RSS from the Scheme Cookbook. The function uses Scheme pattern matching (which I'd never played with before):

(define (get-rss url)
  (xml->xexpr
   ((eliminate-whitespace '(rss channel item) (lambda (x) x))
    (document-element (call/input-url (string->url url) get-pure-port read-xml)))))


(define (rss->item rss)
  (letrec ((good-item (lambda (p) (and (pair? p) p))))
    (filter good-item
            (match rss
              (('rss _  ('channel _ . items))
               (map 
                (match-lambda
                    (('item _
                       ('title _ . titles) 
                       ('description _ . desc ) 
                       ('link _ link) . _)
                     (let ((title (apply string-append titles)))
                       (list link title desc)))
                  (('item
                     ('title _ . titles)
                     ('link _ link)
                     (_ . _)
                     ('body _ . body))
                   (let ((title (apply string-append titles)))
                       (list link title desc)))
                  (('item _                 ;;; old movable type
                     ('title _ . titles)
                     ('link _ link)
                     ('description _ . desc). _)
                   (let ((title (apply string-append titles)))
                     (list link title desc)))
                  (_ '()))
                items))
              (('rdf:RDF (_ ...) _ ('channel . _). items)
               (map 
                (match-lambda
                    (('item _
                       ('title _ title)
                       ('link _ link)
                       ('description _ . desc ) . _)
                     (list link title desc))
                  (_ '()))
                items))
              ))))

I had to modify this from what's in the cookbook to deal with some further variations I found in just the few RSS feeds I tried it with. It works pretty well for what I'm doing here, but really should be done using the SSAX parser and XPath for flexibility. The lesson: never send a state machine to do a parser's job.

If you don't keep track of which RSS items you've seen before, you end up sending the same thing over and over since RSS feeds usually contain things you've already seen. To to that, I needed some kind of persistence. I could have just used a file, but that won't scale well, so I decided to use a persistant hash. In fact, I revived the mzgdbm extension for this very purpose. The filter function takes a list, as formatted by rss->item and filters it to return only those items whose guid (actually the link) isn't a key in the GDBM hash given by the second argument:

(define (filter-item-list items guid-hash-db)
  (let loop ((entries items)
	     (new-items '())
	     )
    (if (null? entries)
	(reverse new-items)
	(let* ((entry (car entries))
	       (link (car entry))
	       )
	  (if (gdbm:exists? guid-hash-db link)
	      (loop (cdr entries) new-items)
	      (begin
		(gdbm:store guid-hash-db link 1)
		(loop (cdr entries) (cons entry new-items)))
	  )
	)
    )))

The next task is to build a message body from the list of RSS items. This requires recursing on the filtered list of RSS items and accumulating two things:

  • a string representing the formatted items
  • a string representing the titles to be used as an index

The function returns a single string that has the message string appended to the title index string. This represents a bit of hard-coded formatting that could be made more flexible, but suits my needs.

(define (build-message item-list mlist)
    (let loop ((entries item-list)
               (message (string-append line-sep "\n\n"))
	       (titles "In this issue: \n\n")
               )
      (if (null? entries)
          (string-append
	   titles "\n\n"
	   message)
          (let* ((entry (car entries))
                 (link (car entry))
                 (title (cadr entry))
                 (desc (car (caddr entry)))
                 )
            (loop 
             (cdr entries)
             (string-append message
                            title "\n"
                            link "\n"
                            (place-references desc) "\n"
                            line-sep "\n\n"
                            )
	     (string-append titles
			    "- " title "\n")
	     )
            )
          )
      ))

The descriptions in the RSS items are in HTML and I send my newsletters as plain text. I used Perl-style regular expressions to define a function for processing the descriptions. It doesn't just strip the HTML, but does a little processing on some HTMl tags to make the resulting text better. The biggest job is looping through the string, finding each anchor tag in turn, and turning them into a reference to the URL. The Anchor tag is replaced by a reference of the form [X] where X is a number and the URL in the tag is appended, with the right number to the end of the description:

;;; various regular expressions for matching pieces of HTML
(define anchor-regexp (pregexp "<(?i:a\\s*href)=\"(.*?)\".*?>"))
(define anchor-around-img-regexp 
        (pregexp "<[a|A].*?>\\s*<(?i:img)[^>]*>\\s*"))
(define anchor-close-regexp (pregexp ""))
(define list-item-regexp (pregexp "<(?i:li)>"))
(define break-regexp (pregexp "<(?i:br)[/]?>"))
(define quoteref-regexp (pregexp "<(?i:div class=\"quoteref\")>From"))
(define blockquote-regexp (pregexp "<(?i:blockquote)[^>]*>"))

(define (place-references text)
  (let* (;; get rid of image anchors
         (t1 (pregexp-replace* anchor-around-img-regexp text "")) 
         ;; replace list items with "-"
         (t2 (pregexp-replace* list-item-regexp t1 "- ")) 
         ;; replace "<br>" with newlines
         (t3 (pregexp-replace* break-regexp t2 "\n"))  
         ;; manage the "From"					
         (t4 (pregexp-replace* quoteref-regexp t3 "Quoted from"))  
         )
    (let loop ((str t4) 
               (cnt 1))
      (let ((match (pregexp-match anchor-regexp str))
            )
        (if match
            (let* ((ref (string-append "[" (number->string cnt) "]"))
                   (anchor (car match))
                   (url (cadr match))
                   (s1 (pregexp-replace anchor-regexp str ""))
                   (s2 (pregexp-replace anchor-close-regexp s1 
                                        (string-append " " ref)))
                   )
              (loop (string-append s2 "\n" ref " " url) 
                    (+ cnt 1))
              )
            (strip-all-tags str) ;; throw out everything else and return
            ))
      )
    )
  )

I've long been a fan of Perl regular expressions so I had fun with this. Even so, at some point it would be nice to replace this with a full-on HTML2text function that does a better job of formatting, etc.

Regular expressions made the next job pretty easy as well. Each mailing list has header and footer material that wraps the messages generated from the RSS feed. I wanted the templates to be in seperate files that could be edited as text. This function just substitutes a message into a file where it finds a "[body]" tag. This isn't any kind of templating language, just a simple template that's easy to do with a regular expression.

(define template-body-regexp (pregexp "\\[body\\]"))
(define (template-substitution mlist message)
  (let* ((read-all (lambda (path)
                     (let ((size (file-size path)))
                       (call-with-input-file path
                         (lambda (p)
                           (read-string size p))))))
         (template (read-all (mlist-template mlist)))
        )
    (pregexp-replace template-body-regexp template message)
  ))

The final task is to email the resulting message to the list. This function accomplishes that using the smtp.ss functions from the net library.

(define (send-email-to-list mlist message)
  (let ((from (mlist-owner mlist))
        (to (mlist-to mlist))
        (subj (string-append (mlist-name mlist) " Newsletter"))
        )
    (smtp-send-message 
     smtp-server
     from
     to
     (standard-message-header from to null null subj)
     message)))

The final piece is the code that puts all this together:

(define (go mlist-pair)
  (let* ((mlist-id (car mlist-pair))
         (mlist (cdr mlist-pair))
         (items (rss->item (get-rss (mlist-url mlist))))
	 (guid-hash-db (gdbm:open (mlist-gdbmfile mlist) :gdbm:write-create))  
	 (new-items (filter-item-list items guid-hash-db))
	 )
    (if (not (null? new-items))
	(send-email-to-list 
	 mlist 
	 (list (template-substitution mlist
				      (build-message new-items mlist))))
	(return-message "nothing to do for " mlist-id)
	)
    (gdbm:close guid-hash-db)
    ))

The code that actually runs to make things happen gets an id from the command line and then uses that to select one of the members of the assoc list.

(if (= (vector-length argv) 0)
    (show-usage mailing-lists) ;; we must have a name
    (let* ((list-name (vector-ref argv 0))
           (mlist-pair (assoc list-name mailing-lists))
          )
      (if (not mlist-pair)
          ;; it must match one of the ones in the assoc list
          (show-usage mailing-lists) 
          ;; otherwise do it
          (go mlist-pair)
      )))

The file is set up to run from the command line and installed in a crontab entry to run with the right mailing list id as its argument.


Last Modified: Tuesday, 12-Apr-2005 15:45:51 MDT