Gaining insights and generating meaningful actions in a range of applications
Understanding multimodal data streams – complex sequences of data from different sources – is a key challenge of data science. By developing mathematical descriptions of these streams, using ‘rough path’ theory, it’s possible to gain insights into the data that can be used to generate meaningful decisions and actions.
Applications of these models range from recognising Chinese handwriting on a smartphone, to classifying physical human actions and assisting mental health diagnosis.
When streamed data arrives, it rarely comes all at once, or from a single scalar source, but in multiple modes. Multimodal data streams are found in a huge range of situations and on all scales, and successfully summarising these streams is key to understanding them, and many facets of the world around us. The order of different signals in different modes in these streams provides key information. For example, the order in which glucose levels and insulin levels rise and fall in someone’s blood.
The key question that can be asked about a data stream is how to summarise it over short intervals, to create actionable information that can predict the stream’s effect and interaction on other systems. For example, summarising web click history well enough to be able to discuss and evaluate a range of strategies for effective advert placement, in systematic and automatic ways.
Many analysis techniques aren’t well equipped to deal with multimodal data, as they often treat each mode independently, nor do they deal well with randomness in the data. This is where Rough Path theory, a highly abstract and universal description of complex multimodal data streams, is incredibly useful. It allows us to directly capture the order in which events happen and better model the effects of these data streams, without needing to do high-dimensional recovery of the individual data points.
This is done by generating what’s called the ‘signature’ of the data stream; a set of step-by-step descriptions (or iterated integrals). The elements of the signature form, over short intervals of time, an ideal ‘feature set’ of inputs that can then be used to enhance conventional machine learning algorithms. The signature can dramatically reduce the size of certain learning problems and therefore the amount of data needed to train the related algorithms.
The DataSıg programme is looking at further developing fundamental, signature-based mathematical tools and introducing them to contexts where it is possible to achieve significant outcomes.
One part of the work is the development of useful open source software tools that could be utilised in various machine learning environments. Another bigger part is the interaction with complex, real world data, to be able to easily tackle questions where there is a variety of different data to consume. Themes include mental health, action detection, human computer interfaces and astronomy.
These projects aim to build bridges between high quality, fundamental mathematics and data science applications, and bring these new mathematical technologies into wider use.