Sunday 24 May 2020

On the beauty of creative mathematics

Some years ago I wrote about the dangers of mathematical modelling. There I explained how using formal models might lead us to ignore aspects of realty that are not represented in the model. This is a particular risk when we're using models without understanding them deeply. With this I don't want to take sides in the mathematicians vs engineers wars - I'm quite sure there's space for both. This time I'd rather want to focus on the idea that understanding a model well allows us to creatively exploit it. My goal is to ultimately make a case that both programmable and creative maths should have their space in teaching. This might be a false dichotomy, but let me expand on what I mean.

Think of these (vaguely indicative) mathematical pairs:
OrigamiJapanese Traditional Geometry
Graph TheoryMathematical Optimisation (and in particular Integer Programming)
StereometryVolume integration
Knot theoryTopology
GeometryLinear algebra

What do they have in common? Here's my answer. Anyway, in my subjective experience, the first column contains a set of mentally stimulating problems that have provoked the development of intuitive heuristics. The second are extended formal theories that generalise and aim to solve the problems in the space of the former. My table is only illustrative and alternative formal theories exist. I'm not saying these theories are not creative. Rather my idea is that they are way too complex to provoke untrained intuition and interest.

This is why I am a strong believer that in order to provoke interest for mathematics in its complexity, the problems in the first column need to be presented to children and they need to be allowed time to play with their own heuristics. Based on my personal experience, I hypothesise that then children should probably be taught advanced heuristics, which is a form of history of mathematics, but also incremental provocation to improve the self-taught intuition they've originally developed. I'm thinking of these as a form of build-up as preparation for getting introduced to complex mathematical theories. This build-up should allow students to develop a comprehensive understanding about the complex model that formulas only partially represent. To illustrate this, my thinking is that certain intuition - regardless whether or turns out to be misleading - is an important instrument for the development of knowledge that is actively usable (contrast to passive knowledge, e.g. in languages). From the perspective of threshold concepts, this might be a more informed way to enter the oscillation that precedes the Eureka effect of mastering the relevant threshold concept.

Thursday 19 December 2019

How can we teach machines to perform better?

In my previous article, I provided examples that illustrate how far is AI (practically machine learning) from what some like to call general intelligence. I illustrated a gap between what is called deep learning for machines and for people. I also provided examples for forms of knowledge that at this point appear unthinkable for AI to reach, in some cases even being difficult to define. In this article, I explain how to programmatically address one particular category of knowledge that is difficult for machines to tackle - what is popularly known as know-how.

To this end, let's take a step back and consider one simple categorisation of knowledge that captures well the aspects that are challenging for machine learning. The knowledge categorisation in questions is the one introduced by David Perkins in his article titled “Beyond Understanding” (2008). Although Perkins limits himself to introducing just three categories of knowledge, he manages to find extremely powerful wording: possessive, performative and proactive knowledge. Here possessive knowledge is the one that answers know-that questions, performative knowledge - know-how questions, and proactive knowledge answers know-where-and-when questions. Possibly, proactive knowledge also encompasses know-why. With some generalisation which is admittedly going too far, we can say that machines are already outperforming people in possessive knowledge due to digital memory which allows for practically unlimited storage. However, even when this is combined with the current impressive growth of computational power, it leaves a lot to ask when it comes to performative knowledge, and despite lots of ambition, we’ve barely scratched the surface of machines successfully engaging with proactive knowledge.

So, let’s consider some examples that could give us ideas to where we stand with performative knowledge. Deep machine learning is not just smarter number crunching, but is becoming more aware of the relevant context and inherent structure of the application domains. However, it is already becoming evident that it is no longer enough to test algorithm’s ability via standard datasets and metrics, such as ImageNet, BLEU or FaceForensics (Thomas 2019). Much like standardised tests for humans, these predispose to superficially memorising the test answers, rather than learning to reason about the actual subject. Measuring machine learning would need to grow into learning assessment. To this end, we would need ways to better understand how machine learning algorithms reason through what has become known as explainable AI. This is what would allow us to engage with algorithms in what in organisational learning is commonly called double-loop learning (Argyris 1991). It is a discussion of meaning and relating it to other ideas, to enable algorithms improve machine learning models themselves, not just their outcomes. Such an approach is not unlike hierarchical learning that is currently used, but it would engage into interactions that are aimed at refining the used hierarchies. Possibly a different take on the same, we would need to look at the interplay of questions for relevant insights, as opposed to searching for individual optimal answers that might lead towards suboptimal solutions. To this end, we might look into borrowing techniques from analytics and think how we could compose their results in providing more nuanced and elaborate answers that lead to more insightful findings. Although implicitly this might be the direction in which deep learning is already evolving with the birth of more sophisticated learning architectures, individual design choices would need to be defended in detail and rooted much deeper into particular contextual evidence.

One particular approach that takes advantage of contextual information comes from traditional industrial management - process mining is designed to address performative knowledge. To do this, models are built around the temporal structure of the information provided, i.e. around processes. Process mining is a broad term for a range of techniques that use event logs to discover and analyse business processes. In practice, this approach is applicable to any collections of events. Process mining relies on structured data composed of timestamped combinations of activities and case identifiers (think e.g. of names, reference identifiers or tracking numbers). Process discovery models the path of each case throughout the provided activities. This approach of including temporal assumptions arising from the underlying process allows for a broad range of interpretations. Clearly among these is the ability to answer a variety of performative - or if one still prefers, know-how - questions. I consider some examples for that below. The interpretative potential of a generic process mining dataset hugely expands when additional information about events is provided - like duration, performing actor, valuation, location, or any particular expectations. Other possible variations of process mining could be to work with data that provides less than the generic assumptions required by process mining. For example, instead of specific timestamps only the generic ordering of events could be analysed when precision of collected data is questionable. Activities might be unlabeled and their identity could be derived as a composed signature of other attributes they might have. Identifiers could be partially or completely omitted, for example when working with energy or financial flows.

For some examples, consider crime prevention, the application domain that I mentioned in my article on deep learning and types of knowledge. Various machine learning techniques are already used for surveillance (Moody 2019), crime mapping (Greene 2019) or fraud prevention. However, arguably, widespread approaches are limited to the identification of patterns that could hardly be considered transferable. This is exactly because there are limits to the depth in which behaviour could be captured without building on the structural properties inherent from the sequential character of processes. With process mining, both personal and situational circumstances of public nuisance can be captured by the model. One example for a way to address the personal aspects of crime, consider offender reintegration. Educational process mining (Bogarin, Cerezo & Romero 2017) can be used to identify patterns of support for ex-offenders. This would allow identifying important reintegration activities whose completion is indicative for successful reintegration. It would also inform better reintegration paths that have been really walked through by others, thus exhibiting role models, both between ex-offenders and social servants involved. This is one possible approach to looking into answering the know-who to involve question. As for the situational circumstances, the Italian traffic police (Leuzzi, Del Signore & Ferranti 2017) has identified a number of applications of process mining - such as conformance verification, traffic forensics - where the answers to know-where and know-when questions could provide invaluable insights that are lacking in current approaches. All this potential is yet to be tapped. And we've arrived at the point to do so. A range of open source and commercial process mining tools have reached the a level of maturity that allows their wide use in several production and service sectors. At this point, the only way to reach the limits of what can be achieved with process mining, is to try and apply it to new domains.

* The author of this blog is currently working at myInvenio Srl, a company whose flagship product is a process mining suite, combining applications for automatic process discovery, simulation and analytics.

Monday 16 December 2019

In how far is deep learning really deep?

There’s a sense of history repeating among those following AI news. Time and again we are told that computers will reach the level of human intelligence, just to have the myth bust shortly thereafter. Take the latest hype in deep machine learning. We were told that machines would take over the world. The glamourous visionary Elon Musk literally told us that “AI is our biggest existential threat” (Gibbs 2014). Yet we’ve just scratched the surface of a new generation of challenges. Think of the seemingly trivial tasks like parking that stumble self-driving cars (Marshall 2019), or the huge crowds working behind supposedly automatic text processing algorithms (Dickson 2018), or the fancy fashion that turns out to be enough to mislead mass surveillance (Thys, Ranst & Goedemé 2019). We’ve found that it is not enough for computers to outperform humans in games - like they did in sophisticated ones like chess, go or even StarCraft - for this to have any significant impact on how artificial intelligence compares to the one of humans.

In order to understand this comparison better, one useful perspective could be looking into how intelligence is nurtured. This is surprisingly similar for both AI and human intelligence, and - quite obviously - it is through learning. Yet, under the surface there are huge differences between how we perceive learning in both cases. And this is a gap that is definitely worth understanding better if we are to talk about what possibilities we have of closing it.

Let’s look closer at deep learning. It is named after the multitude of its hidden layers of computation that aim to automatically discover derived complex features of the learned data. We are hearing literary daily of amazing new developments on this front. Yet, there’s an interesting coincidence that is not much spoken about: the concept of deep learning also exists in education - the traditional one, involving people - and was studied by John Biggs (1987), among others. It has been characterised by learners engaging in “seeking meaning, relating ideas, using evidence, and having an interest in ideas” and contrasted to surface learning where learners are engaging in “unrelated memorizing of information, confining learning of the subject to the syllabus, and a fear of failing”. Put in this perspective, it could be questioned to what extent machine deep learning is actually seeking meaning, rather than applying numerical algorithms on what it gets as memorised information. What is more clear, is that for the neural networks used in machine learning, relating ideas is something that is scoped by the assumptions and design choices of the data scientists, and limited by the architecture of the neural network. One could argue that such scope definition corresponds very closely to what a student confining themselves to a syllabus could be.

Consider how evidence is used. A prudent human learner would actively seek out the right evidence that would shed more light on one’s own uncertainties. This relates to active machine learning, a collective term for approaches that ask users for help in addressing data points that machine learning algorithms cannot classify on their own. This is what in machine learning is called semi-supervised learning. Yet, it would be a stretch to claim that algorithms endeavour to search for evidence that might be contradictory, or that they could reasonably decide when a paradigm shift would be appropriate, beyond a measure of better performance against an optimisation function. It is beyond their design to make a decision when the optimisation function they use doesn't adequately capture all the relevant features of the problem space.

To see where deep machine learning stands in the final contrast between interest in ideas and fear of failing, we could try to understand somewhat better these concepts. Let’s turn to the work of a development psychologist, Carol Dweck. In her flagship work on mindset, she identifies two distinct approaches to learning among people that she calls a growth mindset and a fixed mindset (Dweck 2017). Dweck describes growth mindset as “stretching yourself to learn something new” which - one could suggest - is a different way to describe someone interested in ideas, in particular new ones. In contrast, she describes the fixed mindset as defining success to be about “proving you’re smart or talented”. Such a predisposition is a natural prerequisite for fear of failing itself, as failure would be perceived as an indication of the contrary of what is sought to be proven. For a machine learning algorithm, proving itself in mathematical terms is performing better than the baseline with respect to a given optimisation function. The growth mindset is about expanding the limits of one’s knowledge and develop the questions being asked. In formal terms this arguably corresponds to working to improve optimisation function (i.e. the goal) itself. This seems to lead towards one particular shortcoming of machine learning: despite the existence of techniques to escape mathematical local optima, contemporary deep learning algorithms are focused on finding a better (as in global and not local) optima of a given quantified and fixed goal. This doesn’t engage with the possibility to refine, or even evolve the questions being asked, when evidence is accumulated that the current formulation is missing critical factors, which in turn might be distorting the result.

Going back to the distinction between seeking meaning and unrelated memorising of information, one might ask oneself what are the different types of knowledge that one could develop. Turning back to the writings of John Biggs, in an article titled “Modes of learning, forms of knowing, and ways of schooling” (2005) he noted the existence of a multitude of forms of knowledge. Beyond the widely discussed know-that (declarative knowledge) and know-how (procedural knowledge) he considers a range of others. Some of these could be seen as variations of know-that, such as theoretical and metatheoretical - the latter relating to state of the art research, where what is known might change. Others, such as tacit and intuitive knowledge are raising the question whether machines could learn what is not explicitly given in the training data. Finally, there is a category which is widely referred to as conditional knowledge. Biggs refers to it as knowing-how-and-why. However, in another example from a specific - and admittedly very complex - domain, crime prevention, a much more nuanced picture emerges for this category. In (Ekblom 2001), while discussing the challenges of implementation of crime prevention and the reasons behind failure, Ekblom identifies knowledge categories like know-about crime problems, know-who to involve, know-when to act, know-where to dedicate resources, know-why, know-how to put in practice. These last question-based knowledge categories should serve us as a hint of how difficult it is to learn and know, be it for a person or a machine.

Yet, I’m in no way implying that the difficulty of defining the scope of knowledge should stop us from trying to develop better machine learning and expand our own limited horizons. On the contrary, instead of making unfounded claims about the fantastic future of artificial intelligence, I find it more valuable to seek to identify pending real-life problems that could be realistically solvable with the current state of the art in machine learning. In the process of solving them, we can further push the boundaries of what we know and what we can do.

I see this article as a conversation opener for a range of different topics. One of them, how can machine learning address know-how knowledge with the help of process mining.