Blog | February 26, 2018

What is a model?

BY The Sensa Team

In this post, I want to expand our understanding of the term ‘model’ – the Machine Learning type, not the Derek Zoolander type. The fundamental task in machine learning is to turn data into models and that’s what this post focuses on.

The dictionary definition of a model is:

“a simplified description, especially a mathematical one, of a system or process, to assist calculations and predictions”

Let’s pick it apart in a slightly different order:

  1. a system or process – our ultimate goal is to emulate a system or process.
  2. simplified description – we want this description to be simple. By ‘simple’, we usually mean compact, so there’s an aspect of compression. This is the most important facet of a model – a compact description implies a deeper understanding of the process that we’re trying to model. In the context of machine learning, compressing data into a model means chipping away at redundancy i.e. the only reason we expect machine learning to work is because we believe that the input data has redundancies that we hope to exploit to construct this compact description.
  3. especially mathematical – typical models are mathematical, but not all e.g. physical models used to understand and prepare for floods. In this post, I will focus exclusively on mathematical models, but it is important to recognize the amazing achievements of physical models.
  4. assist calculations and predictions – obviously, we would like to be able to predict, but note that we are happy with assistance i.e. compact descriptions which are not predictive are also considered models. Final point to consider is the word prediction – personally, I find it more useful to instead think of mapping from the space of inputs to the space of outputs. I find it better because the word prediction has a vague implication of causality, while most predictive systems are not causal.

Types of Models

Algebraic

An equation that describes a system.

As an example, consider Ohm’s Law, which states that

V=IRV=IR

This simple, compact description allows us to compute one statistical property given the other two. This fits all points in the definition above. Note, that this model is statistical in nature, i.e. we do not know OR reason about each individual atom or electron in the system that we are measuring, but we are happy measuring lower resolution properties of the system instead.

Combinatorial

A network of connections that describe a system.

There are many systems that are best described as networks e.g.

  1. Flowcharts. Flowcharts are compact representations of algorithms and are predictive (in the sense that they map their inputs to a (usually) finite set of outputs).
  2. Process flow diagrams. A process flow diagram (PFD) is a diagram commonly used in chemical and process engineering to indicate the general flow of plant processes and equipment. The PFD displays the relationship between major equipment of a plant facility and does not show minor details such as piping details and designations.
  3. Biological pathways. KEGG is a good example of biological models that are best described as networks.
  4. Topological Models. Topology is the study of shape and when used to analyze data, almost always relies on or constructs network representations of data. Simplicial Complexes such as this used for persistent homology calculations and Extended Reeb Graphs used for meta-modeling are examples.

Typically, combinatorial models such as these are not thought of as models, but it is important to recognize that this is a very flexible class of models that:

  1. Depict systems
  2. are Compact,
  3. mathematical (since graphs are easily represented as matrices)
  4. and assist in Computations

Software

A piece of software that describes a system.

As an example, consider a rule processing system. These systems are used to represent knowledge in specific domains e.g. most transaction monitoring systems used to prevent money laundering at banks.

Considerations

Regularization

We discussed the need for models to be compact earlier. Regularization is a set of methods in machine learning that force the models to be as compact as possible. Over and above leading to smaller models, regularization is also beneficial in reducing Generalization Error (i.e. the tendancy of a system to produce better results on training data rather than new/unseen data).

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. – Jon VonNeumann look here

Implicit causality

Fundamental Attribution Error is at play here. In terms of modeling, people consider models to be agents i.e. we imbue intent to a piece of software. The fact that models can be used to predict, somtimes lulls us into thinking that models have some causal information (i.e. the dependent variables cause the independent variable). This is not the case. This is why I prefer to think of models as functions that map from the space of inputs to the space of outputs rather than think of them as being ‘predictive’.

Amusingly, I see many economists make this fundamental error.

Okay, so what if you did want to argue about causality. There are ONLY two ways of ever being able to argue about causality:

  1. The data used to construct the model has causality built in. Sometimes, the data has time dependency built-in – as an example think about the relationship between the number of tellers and the average wait time of a customer at a retail store. This is obviously causal, so any model that we derive from this data will allow us to argue about cause and effect.
  2. We find a model that does well and then run an experiment in real life. In many biological settings e.g. biomarker discovery, this is how it is done – find correlations and then run an experiment in real life to figure out whether the correlations were causal or not.

Outside of these two fundamental methods, there is NO way to argue about causality based on a model.

Summary

In the context of machine learning, compressing data into a model means chipping away at redundancy i.e. the only reason we expect machine learning to work is because we believe that the input data has redundancies that we hope to exploit to construct this compact description.

All models are wrong but some are useful. – George Box

Additional Resources

Why rules-based transaction monitoring falls short of AI/ML

As the volume of data available to humanity has increased over the last decade, it has exposed a weakness in the way we as humans think. Humans are terrible at making decisions that rely on taking many variables into consideration. Scientists have suspected this for some time. As early as 1956, George Miller published a […]

LEARN MORE
Tips for researching an AI-based, AML/fraud detection proof of concept

Not all anti-money laundering and fraud solutions are created equal, and this has caused definite frustration among financial institutions. In fact, according to a 2019 study by Duffs & Phelps, 30% of institutions surveyed rate at least one of their AML components as being either “not at all” or only “somewhat” effective. Even though AI […]

LEARN MORE
The time has come: VASPs need advanced transaction monitoring

As cryptocurrencies continue to grow exponentially, regulation is starting to follow suit. As such, VASPs and related service providers in this emergent industry are finding themselves under greater scrutiny. Crackdowns are under way Late in 2021, a federal court ordered crypto exchange BitMEX to pay $100 million for failing to maintain anti-money laundering protocols (in […]

LEARN MORE