Errors and Error Handling in KRL


Summary

A small tutorial on how KRL developers can use error events to monitor applications and find problems.

fail

Errors are events that say "something bad happened." Conveniently, KRL is event-driven. Consequently, using and handling errors in KRL feels natural. Moreover, it is entirely consistent with the rest of the language rather than being something tacked on. Even so, error handling features are not used often enough. This post explores how error events work in KRL and describes how I used them in building Fuse.

Built-In Error Processing

KRL programs run inside a pico and are executed by KRE, the pico engine. KRE automatically raises system:error events when certain problems happen during execution of a ruleset. These events are raised differently than normal explicit events. Rather than being raised on the pico's event bus by default, they are only raised within the current ruleset.

Because developers often want to process all errors from several rulesets in a consistent way, KRL provides a way of automatically routing error events from one ruleset to another. In the meta section of a ruleset, developers can declare another ruleset that is the designated error handler using the errors to pragma.

Developers can also raise error events explicitly using an error statement in the rule postlude.

Handling Errors in Practice

I used KRL's built-in error handling in building Fuse, a connected-car product. The result was a consistent notification of errors and easier debugging of run-time problems.

Responding to Errors

I chose to create a single ruleset for handling errors, fuse_error.krl, and refer all errors to it. This ruleset has a single rule, handle_error that selects on a system:error event, formats the error, and emails it to me using the SendGrid module.

Meanwhile all of the other rulesets in Fuse use the errors to pragma in their meta block to tell KRE to route all error events to fuse_error.krl like so:

meta {
  ... 
  errors to v1_fuse_errors
  ...
}

This ensures that all errors in the Fuse rulesets are handled consistently by the same ruleset. A few points about generalizing this:

  • There's no reason to have just one rule. You could have multiple rules for handling errors and use the select statement to determine which rules execute based on attributes on the error like the level or genus.
  • There's no requirement that the error be emailed. That was convenient for me, but the rule could send them to online error management systems, log them, whatever.

Raising Errors

As mentioned above, the system automatically raises errors for certain things like type mismatches, undefined functions, invalid operators, and so on. These are great for alerting you that something is wrong, although they don't always contain enough information to fix the problem. More on that below.

I also use explicit error statements in the rule postlude to pass on erroneous conditions in the code. For example, Fuse uses the Carvoyant API. Consequently, the Fuse rulesets make numerous HTTP calls that sometimes fail. KRL's HTTP actions can automatically raise events upon completion. An http:post() action, for example will raise an http:post event with attributes that include the response code (as status_code) when the server responds.

Completion events are useful for processing the response on success and handling the error when their is a problem. For example, the following rule handles HTTP responses when the status code is 4XX or 5XX:

rule carvoyant_http_fail {
  select when http post status_code re#([45]\d\d)# setting (status)
           or http put status_code re#([45]\d\d)# setting (status)
           or http delete status_code re#([45]\d\d)# setting (status) 
  pre {
  ... // all the processing code
  }
  event:send({"eci": owner}, "fuse", "vehicle_error") with
    attrs = {
          "error_type": returned{"label"},
          "reason": reason,
          "error_code": errorCode,
          "detail": detail,
          "field_errors": error_msg{["error","fieldErrors"]},
          "set_error": true
         };
  always {
    error warn msg
  }
}

I've skipped the processing that the prelude does to avoid too much detail. Note three things:

  1. The select statement is handling errors for various HTTP errors as a group. If there were reasons to treat them differently, you could have different rules do different things depending on the HTTP method that failed, the status code, or even the task being performed.
  2. The action sends the fuse:vehicle_error event to another pico (in this case the fleet) so the fleet is informed.
  3. The postlude raises a system:error event that will be picked up and handled by the handle_error rule we saw in the last section.

This rule has proven very useful in debugging connection issues that tend to be intermittent or specific to a single user.

Using Explicit Errors to Debug

I ran into an type mismatch error for some users when a fuse:new_trip event was raised. I would receive, automatically, an error message that said "[hash_ref] Variable 'raw_trip_info' is not a hash" when the system tried to pull a new trip from the Carvoyant API. The error message doesn't have enough detail to track down what was really wrong. The message could be a little better (tell me what type it is, rather than just saying it is not a hash), but even that wouldn't have helped much.

My first thought was to dig into the system and see if I could enrich the error event with more data about what was happening. You tend to do that when you have the source code for the system. But after thinking about it for a few days, I realized that just wasn't possible to do in a generalized way. There are too many possibilities.

The answer was to raise an explicit error in the postlude to gather the right data. I added this statement to the rule that was generating the error:

error warn "Bad trip pull (tripId: #{tid}): " + raw_trip_info.encode() 
   if raw_trip_info.typeof() neq "hash";

This information was enlightening because I found out that rather than being an HTTP failure disguised as success, the problem was that the trip data was being pulled without a trip ID and as a consequence the API was giving me a collection rather than the item—as it should.

This pointed back to the rule that raises the fuse:new_trip event. That rule, ignition_status_changed, fires whenever the vehicle is turned on or off. I figured that the trip ID wasn't getting lost in transmission, but rather never getting sent in the first place. Adding this statement of the postlude of that rule confirmed my suspicions:

error warn "No trip ID " + trip_data.encode()  if not tid;

When this error occurred, I got an email with this trip data:

{
  "accountId": "4",
  "eventTimestamp": "20150617T130419+0000",
  "ignitionStatus": "OFF",
  "notificationPeriod": "STATECHANGE",
  "minimumTime": null,
  "subscriptionId": "4015",
  "vehicleId": "13",
  "dataSetId": "25857188",
  "timestamp": "20150617T135901+0000",
  "id": "3530587",
  "creatorClientId": "",
  "httpStatusCode": null
}

Note that there's no tripId, so the follow-on code never saw one either, causing the problem. This wasn't happening universally, just occasionally for a few users.

I was able to add a guard to ignition_status_changed so that it didn't raise a fuse:new_trip event if there were no trip ID. Problem solved.

Conclusion

One of the primary tools developers use for debugging is logging. In KRL, the Pico Logger and built-in language primitives like the log statement and the klog() operator make that easy to do and fairly fruitful if you know what you're looking for.

Error handling is primarily about being alerted to problems you may not know to look for. In the case I discuss above, built-in errors alerted me to a problem I didn't know about. And then I was able to use explicit errors to see intermittent problems and capture the relevant data to easily determine the real problem and solve it. Without the error primitives in KRL, I'd have been left to guess, make some changes, and see what happens.

Being able to raise explicit errors allows the developer, who knows the context, to gather the right data and send it off when appropriate. KRL gave me all the tools I needed to do this surgically and consistently.