Advanced Analytics in the Anonymized Data Space: Jeff Jonas


Jeff Jonas gave a great keynote this morning. (Here's a paper from IEEE Security and Privacy that explains some of this.) This afternoon he's adding context. Literally. Contexts allow seemingly unrelated records to become related. The idea is that two records get created in two different data stores, because of some common event, but the common event is unobservable to the organization and the perceptions around that event are not connected.

When the organization queries these data sources to make a decision, the fact that these records are related might not be known. He calls this enterprise amnesia. The answer is a database that creates persistent context that relates these records. The query is done against the persistent context. The context is like the card catalog in a library, serving as an index to records.

Query might be any number of things. If you do not process every new piece of data (perceptions) first like a query, then you will not know if it matters...until someone asks. Jeff treats query as data. When a query is made against the context, and gets no response, it's stored as a database. Later if data shows up that matches the query, you get a match. Treating queries like data makes it so you don't have to ask every question every day.

  • Queries find data
  • Data finds queries
  • Data finds data
  • Queries find queries

The latter one gets users with like interests together with one another.

In the grand scheme of things, the context allows you to reconstruct the non-observable. More perceptions lead to reduced ambiguity. The time when you're ingesting data is the best time to make discoveries. Jeff calls this perpetual analytics.

Jeff's analytics engine takes queries and data in through the same pipe. No joins, to triggers, no stored procedures. When you discover something new, you fix all the related records. He tells the story of a con man who had six different identities with no overlap. One day the con man introduced a PO Box to the system that allowed all six identities to be tie together.

Context is "persistent" in the sense that it's not created on the fly with the query on federated data sources. Sequence neutrality is crucial since perceptions may come in different order.

The technology Jeff developed (called NORA) is for sharing context within an organization. What if you want to share data with others. For example, the government doesn't want to share it's data with the cruise line, and the cruise line doesn't want to share customer information with the government. Can you encrypt to the data and analyze it in the encrypted form? Jeff calls this ANNA.

The anonymizer doesn't just hash the data, it first processes it to create rooted forms. For example, Bob, Bobbie, and Rob are all rooted to Robert. This allows the encrypted form to be analyzed and queried for matches. Then the context points back to specific records and you can then have a narrow conversation about specific records rather than grabbing entire data sets.