Ron Kohavi on Data Mining and eCommerce


Today's colloquium was Ron Kohavi from Microsoft research. His talk was titled: Focus the Mining Beacon: Lessons and Challenges from the World of E-Commerce (PPT). Ron was at Blue Martini Software where he was responsible for data mining. They developed an end-to-end eCommerce platform with integrated business intelligence from collections, ETL, data warehousing, reporting, mining, and visualization.

Later Ron was at Amazon doing the same thing. Again, simple things work (people who bought X bought Y). Human insight is the key--most good features come from human ideas, not extensive analysis. Amazon measures everything. Any change was introduced with a controlled experiment so that they could quantify the value of any change.

Really simple things help customers a lot. Customers want simple stuff. He references an experience at SGI where the Naive Bayes algorithm was what pleased customers the most even though it's one of the simpler machine learning algorithms.

For data mining to work you need:

  • Large amounts of data (lots of records)
  • Rich data with many attributed (wide records)
  • Clean data from reliable collections (no GIGO)
  • Actionable domain (have a real-world impact, experiment)
  • Measurable ROI.

eCommerce is a great domain for using data mining techniques.

Auto-creation of the data warehouse works well if you own the operational and analysis systems. At Blue Martini, they had a DSSGen process that auto-generated a star-schema data warehouse,

Collect business level data from the operational site. Collect search information, shopping cart stats, registration form stats. Don't forget external events (marketing, etc.). Put those in the warehouse as well to correlate them.

You have to avoid "priors." Businesses are amazingly biased and believe they know what they know is right. Data can help.

Do you collect form errors on the site? They did this for BlueFly. When they ran the report after the homepage went live, they noticed thousands of form errors. People were putting search terms in an email sign up box because it looked like a search box and there was no noticeable search box on the page.

Crawl, walk, run. Do basic reporting first. Generate simple reports and simple graphs. Then use OLAP for hypothesis testing and then ask the characterization questions and use data mining algorithms.

Agree on terminology. For example, how is "Top Seller" defined? Amazon and Barnes and Noble have a different definition. Sales rank can be hard to calculate when you're doing it continuously.

Any statistic that appears interesting is almost certainly a mistake. He gives this example "5% of customers were born on the same day, including year." This is because, lots of people enter 11/11/11 for their birthday when it mandatory. Daylight savings time creates a small sales spike in October and sales dip in April.

Simpson's paradox: if you don't understand this, you can reach mistaken conclusions. He shows the Bob and Ann reviewing papers example. Kidney stone example. Simpson's paradox happens when summed data leads to one conclusion, but when you segment the data you get the opposite conclusion. This happened in a study of UC Berkeley graduate admissions where the aggregate data showed a greater percentage of men than women were accepted, but when you segmented the data by department, each department admitted more women than men. The key is understanding that the segmenting variable interacts with "success" and with the counts. This is non-intuitive. Here's a formulation:

if a/b < A/B and c/d < C/D, 
then its possible that (a+c)/(b+d) > (A + C)/(B + D) 

Simpson's paradox happens in real life. During knowledge discovery, you an state correlations and associate them with causality, but you have to look for confounding variables. This can be complicated because the confounding variable may not be the ones you're collecting. Look for statements about confounding variables. Also, controlled experiments that split the population randomly, you don't get the paradox.

On the Web, you can't run experiments on sequential days. You can't use IP to split population (load-balancer randomization) because of proxy servers. Every customer must have an equal change to fall into either population.

  • Duration: only measure short term impact
  • Primacy effect: changing navigation in a web site may degrade customer experience, even if the new navigation is better.
  • Multiple experiments: on a large site, you might have multiple experiments running in parallel. Scheduling and QA are complex.
  • Contamination: assignment is usually cookie based, but people may use multiple computers.
  • Normal distributions are rare (97% of customers don't purchase, leading to a skew toward zero.

Auditing data is important to make sure its clean and you get good results:

  • Make sure time series data exists for the whole period. It's easy to conclude that this week was bad relative to last week because some data is missing.
  • Synchronize the clocks from all collection points. Make sure all servers set to GMT.
  • Remove test data. The QA organization isn't using the system in ways consistent with customers.
  • Remove bot traffic. 5-40% of site traffic can come from search bots in some periods. These can significantly skew results.
  • Utilize hierarchies. Generalizations are hard to find when there are many attribute values. When you have 20 million SKUs, you may not see many trends. Generalize product categories.
  • Remember data time attributes. Look for time of day correlations. Computer deltas between such attributes (e.g. ship date minus order date).
  • Mine at the right granularity level. Aggregate clickstreams, purchases, and other information to the customer level.
  • Phrase the problem to avoid leaks. A lean is an attribute that "gives away" the label. E.g. heavy spender pay more sales tax. Phrasing the problem to avoid leaks is a key insight. instead of asking who is a heavy spender, ask which customers migrate from spending a small amount in period one to a large amount in period two.

Picking the right visualization is key to seeing patterns. A heatmap cross referencing date by day of week will show anomalies in purchases more readily than the classic "purchase by day" kind of graph.

UI Tweaks. Small changes to a UI make a large different. He gives an example from Microsoft help. Changing from a "yes/no" answer on "was this helpful" to "five starts" dropped response rate by 3.5 times. Another example is checkout page. There's a factor of 10 difference in conversion rate on a checkout page when you add a "enter a coupon code" box. People think they should go see if they can get a coupon somewhere and abandon the cart.

One challenge is finding ways to map business questions to data transformations. SQL designers thought they were making it easy for business people to interact with databases. Explaining models to users is difficult. How can you make models more comprehensible? Slowly changing dimensions are hard. Customer attributes drift over time. Think about making recommendations for maternity dresses. Also, products change. Detecting robots and spiders is difficult. There are heuristics, but they're far from perfect.

Ron finished with a few quotes: "One accurate measurement is worth a thousand expert opinions" (ADM Grace Hopper) and "Not everything that can be counted counts and not everything that counts can be counted." (Albert Einstein)