Data Science: Theory Vs Practice

4 min readSep 21, 2023

There is an expression you may have heard that offers a subtle value judgement against people who elevate theoretical ideas. Originally formulated byBenjamin Brewster in 1882, it goes something like:

In theory, theory and practice are the same. In practice, they are not.

Those of us who enjoy theoretical ideas tend to dismiss this expression as a kind of intellectual class politics. Practically minded people use it to express disdain for those that put theory on a pedestal. In this post I want to explore the idea that even if you appreciate the insights generated by theoretical work, there are good reasons to maintain an attitude of scepticism toward it. Particularly when using theoretical results for making decisions and recommendations in the practice of data science. Let me begin with a war story taken from the working life of one of my colleagues.

This colleague was interested in the extent to which the application of business rules to limit the number of potential customers in a marketing dataset would reduce the size of the set. While working on this problem he engaged the rest of our team in a number of lunch time discussions. We debated the general nature of these reductions and concluded that in general you should expect that the reduction is more than ½ to the power of N, where N is the number of rules. The reason is that the expected product of N real numbers taken uniformly between 0 and 1 is less than ½ to the power of N. This is an odd and slightly counter intuitive result, but one that we derived great pleasure in finding different ways to prove. I have long since lost the scraps of paper that contained those proofs, but you can play with the code below to simulate the result.

import math
import random
import numpy as np

# Set a value for N
N = 20
# Set the number of samples
S = 1000
# Set the number of simulations
Z = 1000

limit = math.pow(0.5,N)
outcomes = np.zeros(Z)
for z in range(Z):
    products = np.ones(S)
    for p in range(S):
        for n in range(N):
            products[p] = products[p] * random.uniform(0,1)
    if products.mean() < limit:
        outcomes[z] = 1

print(f"Proportion of times that the product was below the limit:")
print(f" {outcomes.mean()}")

As you run the code above you will observe that as you increase the value of N you will find that the proportion of simulations where the product is less than the limit increases. Not only does it appear that the expected value is below 1/2^N, but that the limit as N increases is likely to be guaranteed to be below 1/2^N.

I later learned that the same colleague had used this result in a business meeting to explain the expected effects of applying multiple arbitrary business rules on the universe of potential customers. He provided the proofs and then encountered skepticism and distrust. He reported that the experience was ultimately a negative interaction between himself and his stakeholders. He came away somewhat puzzled and discouraged.

I want to draw several lessons from this story.

The first is that theoretical results are difficult to communicate. My colleague could have perfectly relayed the contents of the proof for his proposition, but the stakeholders would likely not have been able to understand it. This would have meant that they would have to take his word for it. This is what many data scientists believe that their managers and clients should do, simply believe what they say. This may be reasonable in situations where you have a long work history with someone, but data scientists are often external consultants or relatively new employees to an organisation. It is simply too much to expect a client to believe what you say on the basis of an arcane set of symbols they do not understand, especially when it contradicts their experiences and intuitions.

The second point is that theoretical results are only as reliable as the assumptions upon which they are based. If those assumptions fail, which they often do, then the theoretical understanding may be worthless. In this particular story, the assumption is that the restrictions comprise ratios drawn uniformly between 0 and 1. If this does not hold, and in fact the ratios are drawn from a skewed distribution with values generally closer to 1, then the result is invalid. It is entirely possible that the stakeholder team had in mind a set of rules that was not random at all, but in fact would carve off only small proportions of customers with each rule.

A more reasonable approach to this situation is to request actual numerical examples from several real situations and look at how they behave. Do they appear to come from a skewed distribution, how does their combined application affect the universe of potential customers. Exploring multiple real examples gives you the ammunition to either demonstrate to your client how theory plays out in practice, or to illustrate to yourself whether or not the theory is informative for the real situation you are working with.

The ultimate lesson is that the mindset of a data scientist should be scepticism. Your scepticism should be most ruthlessly applied to your own ideas. Any time you find yourself in love with a theoretical result, you should lay bare each and every assumption, and then test them without compromise.