Asynchronous Boundaries and Back Pressure

Vanising Train

A significant component of reactive programming is asynchronous boundary between the message sender and receiver. The problem is that the receiver might not work as fast as the sender and thus fall behind. When this happens the sender can block (blocking), the event bus can throw messages away (losiness), or the receiver can store messages (unlimited message queue). None of these are ideal.

In response, a key concept from reactive systems is non-blocking back pressure. Back pressure allows queues to be bounded. One issues is that back pressure can't be synchronous or you lose all the advantages. Another is that if the sender doesn't have anything else to do (or can't do it easily) then you effectively get blocking.

Picos, as implemented by KRL, are lossy. They will queue requests, but the queue is, obviously finite, and when it reaches its limits, the event sender will receive a 5XX error. This could be interpreted as back-pressure, a NACK or sorts. But nothing in the krl event:send() action is set up to handle the 5XX gracefully. Ideally, the error code out to be something the sender could understand and react to.


Regaining Control of Our Data with User-Managed Access

Choices

User control is central tenant of any online world that most of us will want to live in. You don't have to consider things like surveillance-based marketing or devices that spy on us long to realize that a future that's an extrapolation of what we have now is a very real threat to personal autonomy and freedom.

A key part of the answer is developing protocols that make it easy to give control to users.

The Limitations of OAuth

Even if you don't know what OAuth is, you've likely used it. Any time you use Twitter, Facebook, or some other service to login into some other Web site or app, you're using OAuth. But logging in is only part of the story. the "auth" in OAuth doesn't stand for "authentication" but "authorization." For better or worse, what you're really doing when you use OAuth is your granting permission for the Web site or app that's requesting you "log in with Facebook" to access your Facebook profile.

Exactly what you're sharing depends on what the relying site is asking for and what you agree to. There's usually a pop-up or something that says "this app will be able to...." that you probably just click on to get it out of the way without reading it. For a fun look at the kinds of stuff you might be sharing, I suggest logging into Take This Lollipop.

But while I think we all need to be aware that we're granting permissions to the site asking us to "log in," my purpose isn't to scare you or make you think "OAuth is bad." In fact I think OAuth is very, very good. OAuth gives all of us the opportunity to control what we share and that's a good thing.

OAuth is destined to grow as more and more of us use services that provide or need access to APIs. OAuth is the predominant way that APIs let owners control access to another service.

But OAuth has a significant limitation. If I use OAuth to grant a site to access Twitter, the fact that I did so and my dashboard for controlling it is at Twitter. Sounds reasonable until you imagine OAuth being used for lots of things and the user having dozens of dashboards for controlling permissions. "Let's see...did I permission this site to access my Twitter profile? Facebook? BYU?" I've got to remember and go to each of them separately to control the permission grants. And because each site is building their own, they're all different and most aren't terribly sophisticated, well-designed, or easy to find.

User-Managed Access to the Rescue

The reason is that while OAuth conceptually separates the idea of the authorization server (AS, the place granting permission) and the resource server (RS, the thing actually handing out data), it doesn't specify how they interact. Consequently everyone is left to determine that for themselves. So there's really no good way for two resources, for example, to use a single authorization server.

That's where UMA, or User-Managed Access, comes in. UMA specifies the relationship between the AS and RS. Further, UMA envisions that users could have authorization servers that are independent of the various resources that they're granting permission to access. UMA has been a topic at Internet Identity Workshop and other places for years, but it's suddenly gotten very real with the launch of ForgeRock's open-source OpenUMA project. Now there's code to run!

Side note: If you're a developer you can get involved in the UMA developer working group as well as the OpenUMA effort depending on whether your interests lie on the client or server side.

With UMA we could each have a dashboard, self-hosted or run by the vendor of our choice, where we control access permissions. This may sound complicated, like a big mixing board, but it doesn't have to be. Once there's a single place for managing access, it's easier for default policies and automation to take over much of the busy work and give owners better control at the same time.

UMA and Commerce

Doc Searls coined the term "vendor relationship management" or VRM years ago as a play on the popular customer relationship management (CRM) tools that businesses use to manage sales and customer communications. It's a perfect example of the kind of place where UMA could have a big impact. VRM is giving customers tools for managing their interactions with vendors. That sounds, in large part, like a permissioning task. And UMA could be a key piece of technology for unifying various VRM efforts.

Most of us hate seeing ads getting in the way of what we're trying to do online. The problem is that even with the best "targeting" technology, most of the ads you see are wasted. You don't want to see them. UMA could be used to send much stronger signals to vendors by granting permission for them to access information would let them help me and, in the process, make more money.

For example, I've written about social products. Social product provide a link back to their manufacturer, the retailer who sold them, the company who services them, and so on. These links are permissioned channels that share information with companies that tell them what products and services I need.

UMA is a natural fit for managing the permissions in a social product scenario, giving me a dashboard where I can manage the interactions I have with vendors, grant permission for new vendors to form a relationship, and run policies on my behalf that control those interactions.

Gaining Control

I'm very bullish on UMA and its potential to impact how we interact with various Web sites and apps. As the use of APIs grows there will be more and more opportunity to mix and mash them into new products and services. UMA is in a good position to ensure that such efforts don't die from user fatigue trying to keep track of it all or, worse, fear that they're losing control of their personal data.


Culture and Trustworthy Spaces

Karneval der Kulturen|Carnival of Cultures

In Social Things, Trustworthy Spaces, and the Internet of Things, I described trustworthy spaces as abstract places where various "things" could come together to accomplish tasks that none of them could do on their own.

For example, in that post I posit a scenario where a new electric car needs to work with other things in its owner's home to determine the best times to charge.

The five properties I discussed for trustworthy spaces were decentralized, event-driven, robust, anti-fragile, and trust-building. But while I can make points about why each of these is desirable in helping our car join a trustworthy space and carry out negotiations, none of them speak to how the car or space will actually do it.

In Systems Thinking, Jamshid Gharajedaghi discusses the importance of culture in self-organizing systems. He says "cultures act as default decision systems." Individuals identify with particular cultures when their self-image aligns with the shared image of a community.

Imagine a trustworthy space serving as a community for things that belong to me and use a lot of power. That space has default policies for power management negotiations. These aren't algorithms, necessarily, but heuristics that guide interactions between members.

In its turn, the car has a power management profile that defines part of its self-image and so it aligns nicely with the shared image of the power management space. Consequently, when the car is introduced to the household, it gravitates to the power management space because of the shared culture. It may join other spaces as well depending on its self image and their culture.

My description is short on detail about how this culture is encoded and how things discover the cultures of spaces upon being introduced to the household, but it does provide a nice way to think about how large collections of things could self organize and police themselves.

Gharajedaghi defines civilization as follows:

[C]ivilization is the emergent outcome of the interaction between culture (the software) and technology. Technology is universal, proliferating with no resistance, whereas cultures are local, resisting change with tenacity.

I like this idea of civilization emerging from a cultural overlay on our collections of things. By finding trustworthy spaces that are a cultural fit and then using that culture for decision making within a society of things, our connected things are tamed and become subject to our will.


Resources, Not Data

Rest Area?

You'll often hear people explain the mainstay HTTP verbs, GET, POST, PUT, and DELETE, in terms of the venerable CRUD (create, retrieve, update, and delete) functions of persistent storage systems. Heck, I do it myself. We need to stop.

In a RESTful API, the HTTP verbs are roughly analogous to the CRUD functions, but what they're acting on is quite different. CRUD functions act on data...static, stupid data. In REST, on the other hand, the verbs act on resources. While there are cases where a resource is just static data, that case is much less interesting than the general case.

To see how, let's pull out the old standby bank account example. In this example, I have a resource called /accounts and in a CRUD world, you could imagine deposits and withdrawals to an account with identifier :id being PUTs on the /accounts/:id resource.

Of course, we'd never expose an API where you could update an account balance with a PUT. In fact, I can't imagine anything you'd do with the account balance in such an API except GET it. There are too many necessary checks and balances (what we call "model invariants") that need to be maintained by the system.

Instead, what we'd do is design an account transfer resource. When we wanted to transfer $100.00 from /accounts/A to /accounts/B, we'd do this:

POST /transfers

{
  source: /accounts/A,
  destination: /accounts/B,
  amount: 100.00
}

This creates a new transfers resource and while it's true that data will be recorded to establish that a new transfer was created, that's not why we're doing it. We're doing it to effect the transfer of money between two accounts. Underneath this resource creation is a whole host of processes to maintain the transactional integrity and consistency of the bank's books.

Interesting resources have workflow rather than just being a collection of data. So stop focusing on the HTTP verbs and think about the resources instead. REST is resource-oriented and that doesn't just nicely map to objects, relational databases, and remote procedure calls. Most bad APIs are a result of this mistaken attempt to understand it in terms of old programming paradigms.


Tesla is a Software Company, Jeep Isn't

Tesla Sightings

Marc Andreessen has famously said that "software is eating the world." Venkatesh Rao calls software, "only the third major soft technology to appear in human civilization."

"So what?" you say. "I'm not in software, what do I care?"

You care, or should, because the corollary to this is that your company is a software company, whether you like it or not. Software is so pervasive, so important that is has or will impact every human activity.

The recent hacks of a Jeep Cherokee and Tesla Model S provide an important example of what it means to be a software company—even if you sell cars. Compare these headlines:

After Jeep Hack, Chrysler Recalls 1.4M Vehicles for Bug Fix

Researchers Hacked a Model S, But Tesla’s Already Released a Patch

If you were CEO of a car manufacturer, which of these headlines would you rather were written about you? The first speaks of a tired, old manufacturing model where fixes take months and involve expense and inconvenience. The second speaks of a nimble model more reminiscent of a smart phone than a car.

You might be thinking you'd rather not have either and, of course, that's true. But failure is inevitable, you can't avoid it. So mean-time-to-recovery (MTTR) is more important than mean-time-between-failures (MTBF) in the modern world. Tesla demonstrated that by not just having a fix, but by being able to deliver it over the air without inconvenience to their owners. If you're a Tesla owner, you might have been concerned for a few hours, but right now you're feeling like the company was there for you. Meanwhile Jeep owners are still wondering how this will all roll out.

The difference? Tesla is a software company. Jeep isn't.

Tesla can do over-the-air updates because the ideas of continuous delivery and after-sale updates are part of their DNA.

No matter what business you're in, there's someone, somewhere figuring out how to use software to beat or disrupt you. We've seen this over and over again with things like Uber, Fedex, Walmart, and other companies that have used IT expertise to gain an advantage their competitors didn't take.

Being a software company requires a shift in your mindset. You have to stop seeing IT as the people who run the payroll system and make the PCs work. IT has to be part of the way you compete. In other words, software isn't just something you use to run your company. Software becomes something you use to beat the competition.


Authorization, Workflow, and HATEOAS

APIs present different authorization challenges than when people access a Web site or other service. Typically, API access is granted using what are called "developer keys" but are really an application specific identifier and password (secret). That allows the API to track who's making what call for purposes of authorization, throttling, or billing.

Often, more fine-grained permissioning is needed. If the desired access control is for data associated with a user, the API might use OAuth. OAuth, is sometimes called "Alice-to-Alice sharing" because it's a way for a user on one service to grant access to their own account at some other service.

For more fine-grained authorization than just user-data control, I'm a proponent of policy-engine-based access control to resources. A policy engine works in concert with the API manager to answer questions like "Can Alice perform action X on resource Y?" The big advantages of a policy engine are as follows:

  • A policy engine allows access control policy to be specified as pattern-based declarations rather than in algorithms embedded deep in the code.
  • A policy engine stops access at the API manager, saving resources below the manager from being bothered with requests that will eventually be thrown out.

Recently, Joel Dehlin at Kuali.co got me thinking of another pattern for API access control that relies on workflow.

Consider an API for course management at a university. The primary job of the course management API is to serve as a system or record for courses that the university teaches. There are lots of details about how courses relate to each other, how they're associated programs, assigned to departments, expected learning outcomes, and so on. But we can ignore that for now. Let's just focus on how a course gets added.

The university doesn't let just anyone add classes. In fact, other than for purposes of importing data in bulk, no one has authority to simply add a class. Only proposals that have gone through a certain workflow and received approvals required by the university's procedure can be considered bonafide courses.

So the secretary for the Univerity Curriculum Committee (UCC) might only be allowed add the class if it's been proposed by a faculty member, approved by the department, been reviewed by the college, and, finally, accepted by the UCC. That is, the secretary's authorization is dependent on the current state of the proposal and that state includes all the required steps.

This is essentially the idea of workflow as authorization. The authorization is dependent on being at the end of a long line of required steps. There could be alternative paths or exceptions along the way. At each step along the way, authorization to proceed is dependent on both the current state and the attributes of the person or system taking action.

In the same way that we'd use a policy engine to normalize the application of policy for access control, we can consider the use of a workflow engine for many of the same reasons:

  • A general-purpose workflow engine makes the required workflow declarative rather than algorithmic.
  • Workflow can be adjusted as procedures change without changing the code.
  • Declarative workflow specifications are more readable that workflow hidden in the code.
  • A workflow engine provides a standard way for developers to create workflow rather than requiring every team to make it up.

One of our principles for designing the University API at BYU is to keep workflow below the API since we can't rely on clients to enforce workflow requirements. What's more, developers writing the clients don't want that burden. As we contemplated how best to put the workflow into the API, we determined that HATEOAS links were the best option.

If you're not familiar with HATEOAS, it's an awkward acronym for "hypertext as the engine of application state." The idea is straightforward conceptually: your API returns links, in addition to data, that indicate the best ways to make progress from the current state. There can be more than one since there might be multiple paths from a given state. Webber el. al.'s How to GET a Cup of Coffee is a pretty good introduction to the concept.

HATEOAS is similar to the way web pages work. Pages contain links that indicate the allowed or recommended next places to go. Users of the web browser determine from the context of the link what action to take. And thus they progress from page to page.

In the API, the data returned from the API contains links that are the allowed or recommended next actions and the client code uses semantic information in rel tags associated with each link to present the right choices to the user. The client code doesn't have to be responsible for determining the correct actions to present to the user. The API does that.

Consider the application of HATEOAS to the course management example from above. Suppose the state of a course is that it's just been proposed by a faculty member. The next step is that it needs approval by the department chair. GETting the course proposal via the course management API returns the data about the proposed course, as expected, regardless of who the GET is for. What's different are the HATEOAS links that are also returned:

  • For the faculty member, the links might allow for updating or deleting the proposal.
  • For the department chair, the links might be for approving or rejecting the course proposal.
  • For anyone else, the only link might be to return a collection of proposals.

Seen this way, a workflow engine is a natural addition to an API management system in the same way a policy engine is. And HATEOAS becomes something that can be driven from the management tool rather than being hard coded in the underlying application. I'm interested in seeing how this plays out.


Social Things, Trustworthy Spaces, and the Internet of Things

20110529 Bee Swarm-3

Humans and other gregarious animals naturally and dynamically form groups. These groups have frequent changes in membership and establish trust requirements based on history and task. Similarly, the Internet of Things (IoT) will be built from devices that be must be able to discover other interesting devices and services, form relationships with them, and build trust over time based on those interactions. One way to think about this problem is to envision things as social and imagine how sociality can help solve some of the hard problems of the IoT.

Previously I've written about a Facebook of Things and a Facebook for My Stuff that describe the idea of social products. This post expands that idea to take it beyond the commercial.

As I mentioned above, humans and other social animals have created a wide variety of social constructs that allow us to not only function, but thrive in environments where we encounter and interact other independent agents—even when those agents are potentially harmful or even malicious. We form groups and, largely, we do it without some central planner putting it all together. Individuals in these groups learn to trust each other, or not, on the basis of social constructions that have evolved over time. Things do fail and security breaks down from time to time, but those are exceptions, not the rule. We're remarkably successful at protecting ourselves from harm and dealing with anomalous behavior from other group members or the environment, while getting things done.

There is no greater example of this than a city. Cities are social systems that grow and evolve. They are remarkably resilient. I've referenced Geoffrey West's remarkable TED talk on the surprising math of cities and corporations before. As West says "you can drop an atom bomb on a city and it will survive."

The most remarkable thing about city planning is perhaps the fact that cities don't really need planning. Cities happen. They are not only dynamic, but spontaneous. The greatness of a city is that it isn't planned. Similarly, the greatness of IoT will be in spontaneous interactions that no one could have foreseen.

My contention is that we want device collections on the Internet of Things to be more like cities, where things are heterarchical and spontaneous, than corporations, where things are hierarchical and planned. Where we've built static networks of devices with a priori determined relationships in the past, we have to create systems that support dynamic group forming based on available resources and goals. Devices on the Internet of Things will often be part of temporary, even transient, groups. For example, a meeting room will need to be constantly aware of its occupants and their devices so it can properly interact with them. I'm calling these groups of social things "trustworthy spaces."

My Electric Car

As a small example of this, consider the following example: suppose I buy an electric car. The car needs to negotiate charging times with the air conditioner, home entertainment system, and so on. The charging time might change every day. There are several hard problems in that scenario, but the one I want to focus on is group forming. Several things need to happen:

  • The car must know that it belongs to me. Or, more generally, it has to know it's place in the world.
  • The car must be able to discover that there's a group of things that also belong to me and care about power management.
  • Other things that belong to me must be able to dynamically evaluate the trustworthiness of the car.
  • Members of the group (including the car) must be able to adjust their interactions with each other on the basis of their individual calculations of trustworthiness.
  • The car may encounter other devices that misrepresent themselves and their intentions (whether due to fault or outright maliciousness).
  • Occasionally, unexpected, even unforeseen events will happen (e.g. a power outage). The car will have to adapt.

We could extend this situation to a group of devices that don't all belong to the same owner too. For example, I'm at my friend's house and want to charge the car.

The requirements outlined above imply several important principles:

  • Devices in the system interact as independent agents. They have a unique identity and are capable of maintaining state and running programs.
  • Devices have a verifiable provenance that includes significant events from their life-cycle, their relationships with other devices, and a history of their interactions (i.e. a transaction record).
  • Devices are able to independently calculate and use the reputation of other actors in the system.
  • Devices rely on protecting themselves from other devices rather than a system preventing bad things from happening.

I'm also struck that other factors, like allegiance, might be important, but I'm not sure how at the moment. Provenance and reputation might be general enough to take those things into account.

Trustworthy Spaces

A trustworthy space is an abstract extent within which a group of agents interact, not a physical room or even geographic area. It is trustworthy only to the extent an individual agent deems it so.

In a system of independent agents, trustworthiness is an emergent property of the relationships among a group of devices. Let's unpack that.

When I say "trustworthiness," that doesn't imply a relationship is trustworthy. The trustworthiness might be zero, meaning it's not trusted. When I say "emergent," I mean that this is a property that is derived from other attributes of the relationship.

Trustworthy spaces don't prevent bad things from happening, any more than we can keep every bad thing from happening in social interactions. I think it's important to distinguish safety from security. We are able to evaluate security in relatively static, controlled situations. But usually, when discussing interactions between independent agents, we focus on safety.

There are several properties of trustworthy spaces that are important to their correct functioning:

Decentralized

By definition, a trustworthy space is decentralized because the agents are independent. They may be owned, built, and operated by different entities and their interactions cross those boundaries.

Event-Driven

A trustworthy space is populated by independent agents. Their interactions with one another will be primarily event-driven. Event-based systems are more loosely coupled than other interaction methodologies. Events create a networked pattern of interaction with decentralized decision making. Because new players can enter the event system without others having to give permission, be reconfigured, or be reprogrammed, event-based systems grow organically.

Robust

Trustworthy spaces are robust. That is they don't break under stress. Rather than trying to prevent failure, systems of independent agents have to accept failure and be resilient.

In designed systems we rely on techniques such as transactions to ensure that the system remains in a consistent state. Decentralized systems rely on retries, compensating actions, or just plain giving up when something doesn't work as it should. We have some experience with this in distributed systems that are eventually consistent, but that's just a start at what the IoT needs.

Inconsistencies will happen. Self-healing is the process of recognizing inconsistent states and taking action to remediate the problem. Internal monitoring by the system of anything that might be wrong and then taking corrective action has to be automatic.

AntiFragile

More than robustness, antifragility is the property systems exhibit when they don't just cope with anomalies, but instead thrive in their presence. Organic systems exhibit antifragility; they get better when faced with random events.

IoT devices will operate in environments that are replete with anomalies. Most anomalies are not bad or even errors. They're simply unexpected. Antifragility takes robustness to the next level by not merely tolerating anomalous activity, but using it to adapt and improve.

I don't believe we know a lot about building systems that exhibit antifragility, but I believe that we'll need to develop these techniques for a world with trillions of connected things.

Trust Building

Trust building will be an important factor in trustworthy spaces. Each agent must learn what other agents to trust and to what level. These calculations will be constantly adjusted. Trust, reputation, and reciprocity (interaction) are linked in some very interesting ways. Consider the following diagram from a paper by Mui et al entitled A Computational Model of Trust and Reputation:

The relationship between reputation, trust, reciprocity, and social benefit

We define reputation as the perception about an entity's intentions and norms that it creates through past actions. Trust is a subjective expectation an entity has about another's future behavior based on the history of their encounters. Reciprocity is a mutual exchange of deeds (such as favor or revenge). Social benefit or harm derives from this mutual exchange.

If you want to build a system where entities can trust one another, it must support the creation of reputations since reputation is the foundation of trust. Reputation is based on several factors:

  • Provenance—the history of the agent, including a "chain of custody" that says where it's been and what it's done in the past, along with attributes of the agent, verified and unverified.
  • Reciprocity—the history of the agent's interaction with other agents. A given agent knows about it's interactions and the outcomes. To the extent they are visible, interactions between other agents can also be used.

Reputation is not static and it might not be a single value. Moreover, reputation is not a global value, but a local one. Every agent continually calculates and evaluates the reputation of every other agent. Transparency is necessary for the creation of reputation.

A Platform for Exploring Social Things

A few weeks ago I wrote about persistent compute objects, or picos. In the introduction to that piece, I write:

Persistent Compute Objects, or picos, are tools for modeling the Internet of Things. A pico represents an entity—something that has a unique identity and a long-lived existence. Picos can represent people, places, things, organizations, and even ideas.

The motivation for picos is to design infrastructure to support the Internet of Things that is decentralized, heterarchical, and interoperable. These three characteristics are essential to a workable solution and are sadly lacking in our current implementations.

Without these three characteristics, it's impossible to build an Internet of Things that respects people's privacy and independence.

Picos are a perfect platform for exploring social products. They come with all the necessary infrastructure built in. Their programmability makes them flexible enough and powerful enough to demonstrate how social products can interact through reputation to create trustworthy spaces.

Benefits of Social Things

Social things, interacting with each other in trustworthy spaces offer significant advantages over static networks of devices:

  • Less configuration and set up time since things discover each other and set up mutual interactions on their own.
  • More freedom for people to buy devices from different manufactures and have them work together.
  • Better overall protection from anomalies, perhaps even systems of devices that thrive in their presence.

Social things are a way of building a true Internet of Things instead of CompuServe of Things.


My thoughts on this topic were influenced by a CyDentity workshop I attended last week put on by the Department of Homeland Security at Rutgers University. In particular, some of the terminology, such as "provenance" and "trustworthy spaces," were things I heard there that gelled with some of my thinking on reputation and social things.


Choosing a Car for it's Infotainment System

2008_05_26_car_radios_03

Recently when I've rented cars I've increasingly asked for a Ford. Usually a Ford Fusion.

It's true that I like Fords, but that's not why I ask for them when renting. I'm more concerned about a consistent user experience in the car's infotainment system.

I have a 2010 F-150 that has been a great truck. I wrote about the truck and it's use as a big iPhone accessory when I first got it. The truck is equipped with Microsoft Sync and I use it a lot.

I don't know if Sync is the best in-car infotainment system or not. First I've not extensively tried others. Second, car company's haven't figured out that they're really software companies, so they don't regularly update them. I've reflashed the firmware in my truck a few times, but I never saw any significant new features.

Even so, when faced with a rental car, I'd rather get something that I know how to use. Sync is familiar, so I prefer to rent cars that have it. I get a consistent, known user experience that allows me to get more out of the vehicle.

What does this portend for the future? Will we become more committed to the car's infotainment system than we are to the brand itself? Ford is apparently ditching Sync for something else. Others use Apple's system. At CES this past January there were a bunch of them. I'm certain there's a big battle opening up here and we're not likely to see resolution anytime soon.

Car manufacturers don't necessarily get that they're being disrupted by the software in the console. And those that do aren't necessarily equipped to compete. Between the competition in self-driving cars, electric vehicles, and infotainment systems, car manufacturers are in in a pinch.


API Management and Microservices

beehives

At Crazy Friday (OIT's summer developer workshop) we were talking about using the OAuth client-credential flow to manage access to internal APIs. An important part of enabling this is to use standard API management tools (like WSO2) to manage the OAuth credentials as well as access.

We got down a rabbit hole with people trying to figure out ways to optimize that. "Surely we don't need to check each time?!?," "Can we cache the authorization??," and so on. We talked about not trying to optimize too soon, but that misses the bigger point.

The big idea here isn't using OAuth for internal APIs instead of some ad hoc solution. The big idea is to use API management for everything. We have to ensure that the only way to access service APIs is through the managed API endpoint. API management is about more than just authentication and authorization (the topic of our discussion on Friday).

API management handles, discovery, security, identity, orchestration, interface uniformity, versioning, traffic shaping, monitoring, and metering. Internal APIs—even those between microservices—need those just as badly as external APIs. No service should have anything exposed that isn’t managed. Otherwise we'll never succeed.

I can hear the hue and cry: "This is a performance nightmare!!" Of course, many of us said the same thing about object-relational mapping, transaction monitors, and dozens of other tools that we accept as best practices today. We were wrong then and we'd be wrong now to throw out all the advantages of API management for what are, at present, hypothetical performance problems. We'll solve the performance problems when they happen, not before.

But what about building frameworks to do the things we need without the overhead of the management platform? Two big problems: First, we want to use multiple languages and systems for microservices. This isn't a monolith and each team is free to choose their own. We can't build the framework for every language that comes along and we don't want to lose the flexibility of teams using the right tools for the job.

Second, and more importantly, if we use a standard API management tool any performance problems we experience will also be seen by other customers. There will be dozens, even hundreds of smart people trying to solve the problem. Using a standard tool gives us the advantage having all the smart people who don't work for us worried about it to.

If there's anything we should have learned from the last 15 years, it's that standard tooling gives us tremendous leverage to do things we'd never be able to do otherwise. Consequently, regardless of any potential performance problems, we need to use API management between microservices.


Errors and Error Handling in KRL

fail

Errors are events that say "something bad happened." Conveniently, KRL is event-driven. Consequently, using and handling errors in KRL feels natural. Moreover, it is entirely consistent with the rest of the language rather than being something tacked on. Even so, error handling features are not used often enough. This post explores how error events work in KRL and describes how I used them in building Fuse.

Built-In Error Processing

KRL programs run inside a pico and are executed by KRE, the pico engine. KRE automatically raises system:error events when certain problems happen during execution of a ruleset. These events are raised differently than normal explicit events. Rather than being raised on the pico's event bus by default, they are only raised within the current ruleset.

Because developers often want to process all errors from several rulesets in a consistent way, KRL provides a way of automatically routing error events from one ruleset to another. In the meta section of a ruleset, developers can declare another ruleset that is the designated error handler using the errors to pragma.

Developers can also raise error events explicitly using an error statement in the rule postlude.

Handling Errors in Practice

I used KRL's built-in error handling in building Fuse, a connected-car product. The result was a consistent notification of errors and easier debugging of run-time problems.

Responding to Errors

I chose to create a single ruleset for handling errors, fuse_error.krl, and refer all errors to it. This ruleset has a single rule, handle_error that selects on a system:error event, formats the error, and emails it to me using the SendGrid module.

Meanwhile all of the other rulesets in Fuse use the errors to pragma in their meta block to tell KRE to route all error events to fuse_error.krl like so:

meta {
  ... 
  errors to v1_fuse_errors
  ...
}

This ensures that all errors in the Fuse rulesets are handled consistently by the same ruleset. A few points about generalizing this:

  • There's no reason to have just one rule. You could have multiple rules for handling errors and use the select statement to determine which rules execute based on attributes on the error like the level or genus.
  • There's no requirement that the error be emailed. That was convenient for me, but the rule could send them to online error management systems, log them, whatever.

Raising Errors

As mentioned above, the system automatically raises errors for certain things like type mismatches, undefined functions, invalid operators, and so on. These are great for alerting you that something is wrong, although they don't always contain enough information to fix the problem. More on that below.

I also use explicit error statements in the rule postlude to pass on erroneous conditions in the code. For example, Fuse uses the Carvoyant API. Consequently, the Fuse rulesets make numerous HTTP calls that sometimes fail. KRL's HTTP actions can automatically raise events upon completion. An http:post() action, for example will raise an http:post event with attributes that include the response code (as status_code) when the server responds.

Completion events are useful for processing the response on success and handling the error when their is a problem. For example, the following rule handles HTTP responses when the status code is 4XX or 5XX:

rule carvoyant_http_fail {
  select when http post status_code re#([45]\d\d)# setting (status)
           or http put status_code re#([45]\d\d)# setting (status)
           or http delete status_code re#([45]\d\d)# setting (status) 
  pre {
  ... // all the processing code
  }
  event:send({"eci": owner}, "fuse", "vehicle_error") with
    attrs = {
          "error_type": returned{"label"},
          "reason": reason,
          "error_code": errorCode,
          "detail": detail,
          "field_errors": error_msg{["error","fieldErrors"]},
          "set_error": true
         };
  always {
    error warn msg
  }
}

I've skipped the processing that the prelude does to avoid too much detail. Note three things:

  1. The select statement is handling errors for various HTTP errors as a group. If there were reasons to treat them differently, you could have different rules do different things depending on the HTTP method that failed, the status code, or even the task being performed.
  2. The action sends the fuse:vehicle_error event to another pico (in this case the fleet) so the fleet is informed.
  3. The postlude raises a system:error event that will be picked up and handled by the handle_error rule we saw in the last section.

This rule has proven very useful in debugging connection issues that tend to be intermittent or specific to a single user.

Using Explicit Errors to Debug

I ran into an type mismatch error for some users when a fuse:new_trip event was raised. I would receive, automatically, an error message that said "[hash_ref] Variable 'raw_trip_info' is not a hash" when the system tried to pull a new trip from the Carvoyant API. The error message doesn't have enough detail to track down what was really wrong. The message could be a little better (tell me what type it is, rather than just saying it is not a hash), but even that wouldn't have helped much.

My first thought was to dig into the system and see if I could enrich the error event with more data about what was happening. You tend to do that when you have the source code for the system. But after thinking about it for a few days, I realized that just wasn't possible to do in a generalized way. There are too many possibilities.

The answer was to raise an explicit error in the postlude to gather the right data. I added this statement to the rule that was generating the error:

error warn "Bad trip pull (tripId: #{tid}): " + raw_trip_info.encode() 
   if raw_trip_info.typeof() neq "hash";

This information was enlightening because I found out that rather than being an HTTP failure disguised as success, the problem was that the trip data was being pulled without a trip ID and as a consequence the API was giving me a collection rather than the item—as it should.

This pointed back to the rule that raises the fuse:new_trip event. That rule, ignition_status_changed, fires whenever the vehicle is turned on or off. I figured that the trip ID wasn't getting lost in transmission, but rather never getting sent in the first place. Adding this statement of the postlude of that rule confirmed my suspicions:

error warn "No trip ID " + trip_data.encode()  if not tid;

When this error occurred, I got an email with this trip data:

{
  "accountId": "4",
  "eventTimestamp": "20150617T130419+0000",
  "ignitionStatus": "OFF",
  "notificationPeriod": "STATECHANGE",
  "minimumTime": null,
  "subscriptionId": "4015",
  "vehicleId": "13",
  "dataSetId": "25857188",
  "timestamp": "20150617T135901+0000",
  "id": "3530587",
  "creatorClientId": "",
  "httpStatusCode": null
}

Note that there's no tripId, so the follow-on code never saw one either, causing the problem. This wasn't happening universally, just occasionally for a few users.

I was able to add a guard to ignition_status_changed so that it didn't raise a fuse:new_trip event if there were no trip ID. Problem solved.

Conclusion

One of the primary tools developers use for debugging is logging. In KRL, the Pico Logger and built-in language primitives like the log statement and the klog() operator make that easy to do and fairly fruitful if you know what you're looking for.

Error handling is primarily about being alerted to problems you may not know to look for. In the case I discuss above, built-in errors alerted me to a problem I didn't know about. And then I was able to use explicit errors to see intermittent problems and capture the relevant data to easily determine the real problem and solve it. Without the error primitives in KRL, I'd have been left to guess, make some changes, and see what happens.

Being able to raise explicit errors allows the developer, who knows the context, to gather the right data and send it off when appropriate. KRL gave me all the tools I needed to do this surgically and consistently.