NUMB3RS REVISIT∃D: putting the Science into Data Science

numb-encabezado-1-1167740

Numb3rs was a TV crime series which had as its backdrop different themes related with numbers and mathematics. One of the objectives was to promote numeracy among the populace and it was actually assessed by a team of real mathematicians to guarantee its veracity and authenticity. Each episode covered specific mathematical themes. For example, in one episode “Traffic”, the question of determining whether something is random or not was considered, together with choosing road widths to optimize traffic flow.

In the following, we are going to take a ‘shallow dive’ on two types of data generator systems: “chaotic” and “random” (for a deep dive check the references), giving some practical examples.

  1. Chaotic

Chaotic systems are often defined in terms of a dynamic environment which has a set of initial conditions as starting point, and which is very sensitive to any change in the initial conditions. For example, the relative position and state of several balls on a billiard table or the starting position of a double pendulum. Although chaotic systems seem dynamically unstable, it can be demonstrated that their event sequences are in fact generated deterministically.

Two examples of chaotic systems are present in meteorology and stock markets. Time series moving averages are typically used as short term predictors. However, some apparently small cause can have large effects and “spook” the market or change the evolution of a weather front.

NUMBERS2

Chaos is potentially predictable if there is a particular stable statistical spread of outcomes over a sufficiently long time-period, independent of the starting state. Chaos becomes unpredictable when we lack a long historical and when the initial starting state of the system is itself uncertain.

A simple way of modelling chaotic behaviour when the initial starting state is known is with a “Logistic Map” – based on a simple second degree polynomial equation. The state of a system is represented by a number x which evolves in discrete time steps. At each step, the state is changed according to: x_(n+1)=rx_n (1-x_n)

For some values of r, the behaviour of x_n is relatively simple: for large n, x_n will oscillate between a finite set of values. However, for most values of r beyond about 3.57, the final behaviour of the system is highly dependent on initial conditions (that is, the initial values of x and n).

At IRIS , we have worked on several projects which depend on the weather. For example, one of our H2020 projects, RICE GUARD , seeks to predict the appearance of a rice disease, called Rice Blast, where causality is often dependent on meteorological readings such as moving averages of humidity, temperature and dew point metrics over time.

  1. Random

Random (sometimes called stochastic) processes imply unpredictability. In contrast with Chaos, two successive executions of a random process will give different sequences, even if the initial state is the same.

For example, the results of tossing a coin, or the outcomes of a lottery should be randomly distributed. So, if we train a predictive model on a sample of ten thousand coin tosses with the toss outcome as the output, the precision should approximate to 50%. A similar outcome should be found for winning lottery number ranges. A scientific study was actually conducted to evaluate the popular legend that buttered toast tends to fall buttered side down. Another, more classical example iNUMBERS3s Brownian motion, which refers to the random motion of particles suspended in a fluid, liquid or gas, resulting from their collisions.

However, randomness can be highly useful for data modelling. For example, by applying Monte Carlo methods [2] we can find a combination of inputs which correspond to one or more target output values of a data model. In a Monte Carlo method, we generate random numbers with a given distribution (for example, Gaussian or Normal) based on a mean and a standard deviation [3].  In a symmetrical distribution, this makes it more likely for numbers to be generated in the middle quartiles, and less likely on the edges. We generate Gaussians for each input to the data model, and loop until the model produces an output which is close to a required target (plus or minus a given tolerance). This technique can be used, for example, to calibrate machine parameters for a complex production process.

 

Random and chaotic system behaviour are found, for example, during image recognition – this has to be compensated when a learner builds a data model. For example, at IRIS we have developed different commercial devices based on infrared spectroscopy, such as VISUM Palm which performs in-situ analysis of different raw materials, and HYPERA which detects foreign bodies by Hyperspectral Imaging. In this type of measurements, noise can occur in the product (pizza, piece of chicken, pharmaceutical tablet) we are scanning or can be generated by the instrumentation itself.  Also, particle scattering can have a chaotic behaviour. Different techniques are used to improve the signal-to-noise ratio and avoid undesirable effects, such as the Savitzky-Golay filter [4] and Multiple Scattering correction (MSC), among others.

Acknowledgements: Thanks to Idoia Martí and Laura Rodriguez of IRIS’s Science Dept. for their help with some of the content in this article.

References:

 

 

[1] http://numb3rs.wolfram.com/303/demonstrations.html

[2] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, E. Teller, Equation of State Calculations by Fast Computing Machines, The Journal of Chemical Physics, 21, Vol. 6, pp. 1087-1092.

[3] D. B. Thomas, W. Luk, P. H. W. Leong, J. D. Villasenor, Gaussian Random Number Generators, ACM Computing Surveys, 39, No. 4, Article 11, 2007.

[4] Savitzky, A.; Golay, M.J.E. (1964). “Smoothing and Differentiation of Data by Simplified Least Squares Procedures”. Analytical Chemistry. 36 (8): 1627–39.

Refs images:

  1. https://www.pinterest.com/pin/502995852105519658/
  2. http://www.cs4fn.org/geography/tornadointexas.php
  3. http://python3.codes/random-walk/
PhD David Nettleton
Data Scientist
PhD in Artificial Intelligence

Leave a Reply

Your email address will not be published. Required fields are marked *