Data Talk: Five Concepts About Predictive Analytics
Here are five conceptual building blocks for having a conversation about predictive analytics. These conceptual building blocks raise some issues, but they but don't take a stand on those issues. They don't answer every question or set a particular direction; instead, they suggest good questions to ask and ideas to pause on before proceeding.
1. Throw Out the Regression Assumptions (and Their Protections)
Do you remember when you took Statistics 101 and your professor told you that "Correlation is not causation?" You can forget about that. And those pesky 0.05 significance levels? A thing of the past. And you don’t have to worry anymore about any of those other hard to pronounce difficulties like “homoscedasticity” or “multicolinearity.”
Predictive analytics is freeing, flexible, and powerful.
The downside, though, is that you don’t get any of the benefits that come with the regression assumptions. With predictive analytics, you really can’t tell the difference between a simple correlation and a true causal relationship. And you can’t know if your findings are really true, or just randomness in your sample.
2. Learning: Supervised or Unsupervised?
Predictive analytics comes in two flavors: with and without outcomes.
Supervised learning trains predictive analytics models to spot when an outcome is more likely to occur. These models typically sort through different potential predictor variables looking for the variables that can be combined used to make predictions about which cases are most likely to have (or not have) an outcome. This type of learning is based on reflecting upon previous trends to find groupings, patterns, and associations, and then applying this knowledge to new cases or examples.
In contrast, unsupervised learning trains the algorithm to spot common patterns and combinations. Unsupervised learning identifies similar attributes to establish clusters or grouping. For instance, unsupervised learning could classify animals into different species, but it does not tell us anything about those species. Unsupervised learning can tell you that supermarket customers who buy gourmet mustard also tend buy gourmet cheese, but tend not to buy baby diapers.
3. Data Mining
Closely related to predictive analytics is data mining. Data mining is true to its name: digging through a mountain of dirt (data) looking for a gem (a pattern). Data mining is exploration, searching around and hoping to stumble on something fantastic. Data mining works best when you are comfortable saying that you always find things in the last place you look.
Later this month on NCCD’s blog, analyst Andrea Bogie will explore the difference between modeling with found data versus collected data—the difference between scraping up a meal from what you can find in the cabinet versus buying the ingredients you really need for the recipe you really want to make.
Data mining can show us a myriad of things, but we have to decide whether to bend practice to match a what was found or be intentional about our analytics to ensure it meets real practice needs.
4. Three Common Model Outputs
Predictive analytics models have three common forms: trees, checklists, and “black boxes.”
Decision trees can be a visually appealing way to portray a model. For each tree, every case starts at the root and at the first branch; some cases split off according to some criteria, and at the next branch others are split again. Branches split into even smaller branches, with fewer and fewer cases going down smaller branches.
Tree models are especially good at accounting for complex relationships, interdependencies, and nonlinear combinations.
Another common model form is a checklist. Checklists assign different weights to particular factors or variables to predict an outcome. Checklists are the form most actuarial models take. For example, factors that can help predict an auto insurance claim might be owning a hot rod car, a history of past accidents, and living in an area with poorly maintained roads. The model could predict that someone with two or three of these factors would be more likely to have a future insurance claim than someone with none or one of these factors.
Checklist models are especially intuitive and useful when clarity is a priority.
A third common predictive analytics model form is a black box.
A model is called a black box when the computer algorithm making the prediction is so complex it cannot easily be interpreted. This type of modeling can be successful at accurately assigning likelihood scores, but the reasoning behind how it works may not be clear.
The instances where this type of modeling is useful depends on whether transparency in decision making is necessary. For instance, it may work well for market research or identifying potential purchases, but might be less desirable in social service delivery decisions.
5. False Positives
All your favorite websites and apps use predictive analytics, often to predict buying behavior. For instance, if you put a tent in your shopping cart, a website may suggest you buy a sleeping bag too because other consumers who bought a tent often also bought a sleeping bag. Or, a website might recommend a movie to you based on your data, your past viewing history, or what other people watched.
With any predictive model, there will be some amount of error. False positives are an error type that occurs when a model incorrectly predicts a positive result for the outcome being tested—like a sleeping bag that you don't want to buy or a movie that you didn’t enjoy.
There are few repercussions if a consumer is incorrectly assumed to be interested in a sleeping bag or a particular movie. But in the social services and justice fields, the stakes are higher.
False positives are normal for predictive analytics. We should understand the repercussions of when false positives can drive things like court dispositions, child protection case plans, or when a governmental agency asserts jurisdiction over private lives.