Sapfundament – Experiment with neural network for rapidly capturing human expression

Written by  on May 30, 2017 


I’ve written up the results of my experiment with robots, AI and dance at the Choreograhic Coding Lab Amsterdam. Part of Fiber Frestival 2017.
Read about it below or at the original dropbox paper article here.

Sapfundament – Experiment with neural network for rapidly capturing human expression


My interactive artworks involve capturing human expression and using it to animate something non-human. What is human expression when it’s removed from the human body? When a performer or participant is empowered to express themselves with light, sound, architecture etc.
This ‘weird mirroring’ is a common theme in many interactive arts practices. In my experience the essential task here is: taking an expressive human as input, analysing the expression, and then applying it to something else. 
Breaking it down that’s three sub tasks:
  • Sensing/Input  
  • Analysis/Mapping
  • Output 
For my time at the Amsterdam choreographic coding lab I want to take a stab at answering a very specific question about the mapping task – can human expression be extracted by a broadly trained neural network? Particularly a neural network trained and generated (relatively) rapidly so it would be suitable for deployment in a dance rehearsal process. How does this approach compare to the ‘traditional’ process EG coding interaction by hand (logical description)? If the broad neural approach has merit that could be very useful, as I spend hours trying to get a sense of human expression out of code. Maybe the neural network could do it faster. 
We’ll also take a swipe at a related philosophical question. Can machines understand human expression?’. Even though I’d never expect to generate anything as concrete as ‘HumanExpression.dat’ it’s often surprising with machine learning to discover concepts that are quite abstract to humans can be seen clearly as patterns by machines. People often attribute this success to machines somehow becoming more like us, but I think that’s a misleadingly mystical approach. I think the truth is instead that humans have more pattern and logic to our emotional and artistic behaviour than we realise. 
Like all art we are dealing with subjective results here. Beauty in the eye of the beholder etc. 
The form of this experiment is definitely influenced by my experience of making art, I’d understand if it doesn’t reflect the approach you would take. 


Going back to the subtasks
  • Sensing/Input  
  • Mapping
  • Output 
I reckon there are two approaches to the mapping task. 
The first is what I’ll call ‘traditional’. 
-Looking at the animation above: To tell a machine that some dots should move when a human moves I’ve got to logically map the dots to the human. This is coding. I might code that: when the body is still the dots should be still. When the body moves the dots vibrate. Some rules might be simple, some could be very complicated. I might also use machine learning in an isolated, logical way. Eg I train the machine to recognise that specific gesture X generates specific output Y. 
-The logical approach is not just between me and the machine.  It also applies to working with the performer or participant. If I can tell the machine that jumping generates a sound I can also tell that to the performer. We have the OPTION of building choreography based on this logic (or vice versa, intending that certain choreography is interpreted logically by the machine). 
-The logical approach to mapping also applies to the audience. The audience can see that jumping generates a sound. Of course if we choose we can hide the logic from the audience, make rules so complicated they can’t work them out. The point is that if we’re working with coded logical interaction we have the OPTION of showing those logical relationships to the audience. 
The approach I want to experiment with is offered by neural networks. I want to call it something like the ‘broad description’ method. 
  • Neural networks are trained with an input and an output, but we don’t have to code or describe logically what the relationships are in the middle. 
  • This is why neural networks are suited to tasks that defy normal logical description. Eg it’s very hard to describe what every cat in the world could look like in a photo, but if you train a neural network with enough photos of cats it will become really good at that job. 
  • I mentioned above you can use neural networks as part of a ‘traditional’ logical mapping system. This is the case when we train specific neural networks for highly logical, specific tasks. EG when gesture X is recognised we turn the stage green. This is more or less just an enhanced traditional process and whilst it still offers very exciting possibilities it’s not what this experiment is about. 
  • Instead what I want to explore is training a neural network with very minimal logical decisions about how the input and the output should map. Instead I propose to simply set up compelling configurations of performer and matching them with compelling outputs in the interactive environment. I.E only providing it with very broad logic. Then once the training is complete we see what it generates with new inputs (Eg we dance in front of it and see what happens). 
  • The question is can we get a sense of human expression in the output using this method? Or will it seem random and disconnected?  
  • To be clear the inputs and outputs can have broad logic, we can say that a subjectively neutral performer position should match a subjectively neutral output position. But we aren’t going to do any fine decision mapping logic. The point of the exercise isn’t about making or not making the decisions, it’s about not having to spend the TIME making those finely grained logic decisions. We describe what we find compelling to the machine by showing it, not by coding thousands of tiny rules. 
  • If successful it could prove to be an interesting technique for rapidly creating performing environments in the rehearsal room or for standalone installations.  


For input we will use the 25 skeleton joints from a Kinect2 sensor. This is a reasonably common sensor in the interactive arts world. 
For mapping we will use Wekinator by Dr. Rebecca Fiebrink.
Wekinator is free and available for Mac, Windows and Linux. 
It allows simple OSC IO for working with neural networks. 
For output we will use a primitive robot called ‘Sapfundament’ – Dutch for ‘Juice Foundation’.
It’s a prototype made from an orange juice bottle and marshmallow sticks. 
Sapfundament has two servo motors, which allow positioning between approximately 0 and 180 degrees. 
I’ve chosen the robot as output for a few reasons 
  • Simple control model. It only has two parameters which are the position of the first and second motors. It’s a good test for our experiment of ‘neural’ vs ‘coding’ as it’s very easy for us to understand what’s happening in the output, it’s just two continuous values. 
  • The robot has a very inhuman body. It’s loosely based on a double pendulum, which is known as one of the simplest devices that produces chaotic motion. 
  • The robot has good potential for anthropomorphisation. I.E with the right behaviour we can imagine it has the character of a human. 
  • Sapfundament also has an LED on the end of the lower arm. In combination with long exposure photos we can use this to record a trail of the robots movement. This will provide a visual comparison between different input stimulus. 
I’m using vvvv to do all the necessary communications between the Kinect sensor, wekinator, and the arduino microcontroller that runs the robot. The only logical processing vvvv did was 
  • Limit the most extreme robot movement so as not to damage the motors by instructing them to move beyond their range
  • Normalise the input skeleton to its own centre. This means the global position of the performer in the space isn’t taken into account, rather it analyses the performers shape without taking into account global position. Eg you don’t have to stand on the exact same spot every time. 
There are three stages to the process
  • Training
We set up the neural network with several pairs of human/robot pose combinations. We will experiment with different training regimes and wether it’s possible to reuse the same training for multiple performers or if the system needs to be retrained for each individual. 
  • Running live
Once the neural network is trained we run it live. The robot will respond in real time to the performer. We record the live data during this period. 
  • Playback
For evaluation we playback the the data generated by the performer in the ‘Running live’ mode. This is also when we will take the long exposure photographs. 


—-Experiment 1: 
All 25 kinect joints into wekinator, normalised to the centre of the skeleton 
Performer Sarah Levinsky 
We trained for 2 neutral poses and 2 ‘extreme’ poses, each had 50 frames of training.  Only 4 poses in total. 
Wekinator had direct control of the robot position via 2 channels 
Results 1: 
— Training resulted in surprisingly good pose recognition, with less than 250 frames of training. The robot very clearly went to the positions trained for the poses. 
Bear in mind the robot only requires 2 channels of control, may not be the case for a more complicated output…. 
— When running live initially it was fun to control the robot but overall it did feel kind of limited in what it could express. There is more to investigate here, maybe with exercises like trying to make the robot breathe. 
— The robot was somewhat compelling to observe as a mirror of the performer. Be interesting to see how other observers would react to this. I’m somewhat biased. 
— When we had the robot playing back the movement of the performer we noticed two things
  • The robot did appear to take on a more human character being driven by human input. It appeared to hesitate, breathe a little, it appeared to have inscrutable but organic purposes in moving. 
  • Overall this particular robot became more compelling to watch when playing back the recording than when under live control. Conceptually the idea of it having agency (even though it was a recording) is more appealing than it mirroring a live person. 
Next test: Same setup but we’ll increase the complexity of the training a little. Also during live mode we’ll attempt to ‘puppet’ the robot with pseudo-emotional performances (eg angry, sad) and see if that expression translates to the robots movement. 
—-Experiment 2: 
All 25 kinect joints into wekinator, normalised to the centre of the skeleton 
Performer Larissa Viaene
We trained for 9 poses, each had 50 frames of training. training regime below
Wekinator had direct control of the robot position via 2 channels 
— As per previous test training resulted in strong and repeatable recognition of the training poses. There was a strong and effective link that when the performer moved the robot moved. 
— The marginally more complex training seemed to result in a more expressive performance. However it was harder for both performer and observer to discern cause and effect links between performer and robot movement. 
Bear in mind overall this is not ‘complex’ training, just a little more complex than experiment 1. 
— We corrected a latency problem that unsurprisingly made the connection more responsive and satisfying for the performer
— During exercises where the performer was focused on puppeting the robot (eg ‘make it breathe’ or ‘make it seem angry’) the performers improvised choreography was at times quite interesting.
— It was often possible to give an apparent intention (curious, angry) to the robot in this way with enough practice, but it wasn’t always intuitive…. see the next point. 
— On the negative side the neural network would often perceive subjectively related behaviours as very different. During one exercise Larissa established a movement that made the robot seem angry, when she performed a more intense version of this movement the robot switched behaviours entirely, as opposed to responding with a more intense version of the movement. 
Next:  Keep the training but try it with a different performer. Continue ‘expressive puppeting’ exercises. 
—Experiment 3:
All 25 kinect joints into wekinator, normalised to the centre of the skeleton 
Performer Tatou Dede 
Using the same training as set in experiment 2 
Wekinator had direct control of the robot position via 2 channels 
— The robot generally responded well to Tatou despite being trained by Larissa. This has to be partially attributed to the Kinect sensor presenting the data in a very similar way (eg as skeleton joints) although there are differences in bone length. If we introduced a child performer to the system I suspect the bone length difference would cause a bigger issue. 
— The robot wasn’t quite as good at achieving the trained response to the trained input poses. It still got it 80% of the time which was impressive. This is probably due to both the bone length issue and individual performer differences between interpretations of each pose. We aren’t really using the neural network for this purpose but if you wanted to use it for hard pose recognition I would recommend training it on multiple performers with the same pose. 
— When we attempted the ‘expressive puppeting’ exercises I thought the results were relatively strong. This is quite a general observation but it seemed the more emotion Tatou put into her performance the more emotion the robot would express. It was definitely possible to make it seem subjectively agitated or peaceful. That kind of relationship would have taken me many hours to code by hand and to see the neural network handle it with just 20 minutes of training was very interesting. 
— As per last time the choreography generated by the performer was quite compelling to watch.
— As per the first experiment the robot itself became most compelling once it was in playback mode and we could believe it was moving due to its own agency. 


Our question was
Can human expression be extracted by a broadly trained neural network?
From this experiment my answer to this question is yes. 
  • We got surprisingly good results to use neural networks in this way. Definitely imbued a sense of human expression in the robot.  Particularly surprising given the very quick and broad training. 
  • The neural net successfully took the 75 linear inputs (25 XYZ) and created an expressive relationship to 2 linear outputs. Training took <20 minutes. 
  • I can see myself using a neural network as part of a hybrid approach to interactive art creation  in future. Curating the use of the neural network and taking over with coded behaviours at other times. 
  • This could also be a great way to discover new potential behaviours if you had exhausted your imagined logical behaviours. 
  • However at times could also be a frustrating way to work, as natural cause and effect relationships allow us to understand an artwork as creators. Because the neural network obscures those relationships I can’t answer why a particular behaviour does or doesn’t occur for a given input. I also have limited scope to make modifications to the neural net to accommodate a desired behaviour change. EG “I like behaviour X but can I increase it by 10%?” “No” . I can’t modify what I can’t understand. I can only try and control the behaviour with training or circumvent the neural network entirely at those times with the hybrid approach.
  • But in the end is a neural network that’s heavily curated by coded behaviours really different to a fully coded approach? It certainly would impact the time/labour advantage of the neural  approach. Whether it’s beneficial or not would come down to the individual artist and their experience. 
  • Certainly here in the year 2017 observers are intrigued knowing behaviour is generated by a neural net. Could be a selling point . 
  • In the special case where you had an output that was considered very difficult to organise logically then the neural network may be able to organise it ‘expressively’ with good results. Needs more testing but that’s my intuition. 
  • Bear in mind this test was entirely done with a two dimensional output, on a robot whose design is easily anthromorphised by observers. Because of its low data dimensionality it’s very quick to pose the robot for training, almost as quick as posing the performer. If you had a very complicated output (like a field of dots) posing each individual dot would be time consuming. You could of course use a function to pose the dots in interesting ways (algorithmic) although I do not know how you get those output states into Wekinator (maybe directly generating example files). I would like to do more experiments with large dimensional outputs. 
Our bonus philosophical question was Can machines understand human expression? 
I reckon yes, but for a limited definition of the word ‘understand’. 
To briefly sidetrack I saw some other presentations at the Fiber festival (that CCL Amsterdam was part of) covering deep learning and neural networks. Most were great but I found it frustrating that several presenters implied that:
— we can’t understand what is happening inside neural networks 
— that because we can’t understand what is happening that there is potentially some kind of general intelligence. 
My two cents is that whilst it’s difficult to understand what’s going on inside neural networks it’s not because what’s happening is mystical or intelligent. Technically it’s lot of basic maths which is perfectly explainable, it’s just difficult to explain WHY a certain configuration of these maths produces such subjectively meaningful output. But the problem isn’t the networks, it’s our limited ability to describe them logically. It’s difficult precisely because the problem neural networks are intended to solve can’t be described logically. (Eg describe what any cat might look like in pixels). If it were so easy to describe how the neural network comes to its decisions we wouldn’t need them as we could just code a cat recogniser or similar with regular coding language. 
The fact that neural networks can perform tasks we previously thought of as exclusively the domain of humans is not about machines becoming more intelligent, it’s about those artistic and emotional expressions having much more pattern and logic to them than we acknowledge. 
To get back to the question ‘Can machines understand human expression?’ I say the answer is that a machine can understand an expression you train it for, understand it enough to potentially generate its own human-like response. But it can’t understand expression as a concept. 

Category : Uncategorized