Sapfundament2 – Neural Networks, Motion Capture and Flamenco

Written by  on April 24, 2018 

Part 2 of a series on combining neural networks with little robots.

Read below or find the original dropbox paper article here.

 

Sapfundament continued – Neural Networks, Motion Capture and Flamenco

////Background 

In January 2018 I spent a week with Choreographer and dancer Annalouise Paul at The Drill Hall in Rushcutters Bay, Sydney.  The week was kindly facilitated by Critical Path.  We spent the time in creative development for a show called Mother Tongue that we hope to remount in 2019. 
During this creative development I used and refined machine learning techniques and this articles covers what I discovered during the development and in subsequent analysis.  
This article directly continues from my previous artistic research on machine learning, conducted at the Choreographic Coding Lab 2017 Amsterdam. 
Support from School of Communication and Creative Arts, Deakin University facilitated a budget for robot parts and the post week analysis time. 
A quick summary of my earlier conclusions  
  • Machine learning allows us to program a computer by showing it results rather than describing what to do as logical code.
  • This has potential applications for recognition and interpretation of human dynamic and emotional input. Eg dance or body language.
  • It’s important to understand that just because the machine can recognise or interpret emotions in a subjectively intuitive way doesn’t mean it has a general or animal-like intelligence. Everything that the machine does follows logic, it’s just that this logic was very difficult to describe to the machine using pure coding. 
  • I find this artistically interesting for a few reasons 
    1. Machine learning techniques potentially save time and provide greater nuance than hand coding for some tasks in interactive art.  Machine learning is essentially another tool in the toolkit of an interactive designer. 
    2. With the right context you could create an artwork that demystified human expression using machine learning. 
    3. There is much interest in creating artworks where the machine and the human dance together. Making the machine a more interesting dancer is always useful. 
In the earlier research most of the input was contemporary style movement. Flowing, abstract, etc.
This week gave me a chance to work with flamenco style. Annalousie is an accomplished flamenco choreographer and dancer.
Flamenco is interesting because it has a long history,  formal structures and is highly rhythmic. As far as body language goes we could say contemporary dance and flamenco dance are two different languages. 
When reading the following bear in mind my context as an artist is real-time live performance. I’m less interested in motion capture techniques from the film sector that require post-processing or specific costumes. 

///Goals and Questions

  • Evaluate current real-time motion capture techniques and their suitability for use with Flamenco 
  • Interpret Flamenco style dancing using a neural network and compare the results to previous experiments with contemporary dance. 
  • Does the more structured nature of flamenco provide an opportunity for more structured input to the machine learning process? For example could it recognise specific footwork? 
  • Is it possible to create a machine that comprehends rhythm? In the sense that it is more sophisticated than simple BPM recognition?  
In regards to the last point I should explain that to produce rhythm (and arguably to understand discrete footwork elements) we need to have some kind of concept of time in our system. But that could be a radical change to the system. The nature of the input of a neural network is an extremely important part of establishing the networks capabilities.  It makes natural sense that the output of the machine, wether that’s projector or a robot, is like it’s body and if that body shape changes the expression changes. But the same is also true of the input as the input represents sensors embedded within the machines body (whether they are physically embedded or not), 
From Rolf Pfeifer and Josh Gongard in their book on embodied intelligence: 
if the sensors of a robot or organism are physically positioned on the body in the right places, some kind of preprocessing of the incoming sensory stimulation is performed by the very arrangement of the sensors, rather than by the neural system. That is, through the proper distribution of the sensors over the body, "good" sensory signals are delivered to the brain; it gets good "raw material" to work on.
I’m actually not entirely sure I agree with Pfeifer and Gongard on all their views yet. But I do agree that if we pre-processed the input data in a way that represents physical sensor phenomenon, i.e perception of time then it should change the expression of the machine. The pre processing could be via regular coding or another neural network.  

////Procedure 

As per the previous research I will use Kinect2 for input, Wekinator for the neural network and a small two dimensional robot (Sapfundament) as output. This will allow me to directly compare results to the previous experiments. 
For reference this is the setup
 
For input we will use the 25 skeleton joints from a Kinect2 sensor. This is a reasonably common sensor in the interactive arts world. 
 
 
For mapping we will use Wekinator by Dr. Rebecca Fiebrink.
Wekinator is free and available for Mac, Windows and Linux. 
It allows simple OSC IO for working with neural networks. 
 
 
For output we will use a primitive robot called ‘Sapfundament’ - Dutch for ‘Juice Foundation’.
It’s a prototype made from an orange juice bottle and marshmallow sticks. 
 
Sapfundament has two servo motors, which allow positioning between approximately 0 and 180 degrees. 
I’ve chosen the robot as output for a few reasons 
- Simple control model. It only has two parameters which are the position of the first and second motors. It’s a good test for our experiment of ‘neural’ vs ‘coding’ as it’s very easy for us to understand what’s happening in the output, it’s just two continuous values. 
- The robot has a very inhuman body. It’s loosely based on a double pendulum, which is known as one of the simplest devices that produces chaotic motion. 
- The robot has good potential for anthropomorphisation. I.E with the right behaviour we can imagine it has the character of a human. 
- Sapfundament also has an LED on the end of the lower arm. In combination with long exposure photos we can use this to record a trail of the robots movement. This will provide a visual comparison between different input stimulus. 
 
 
I’m using vvvv to do all the necessary communications between the Kinect sensor, wekinator, and the arduino microcontroller that runs the robot. The only logical processing vvvv did was 
- Limit the most extreme robot movement so as not to damage the motors by instructing them to move beyond their range
- Normalise the input skeleton to its own centre. This means the global position of the performer in the space isn’t taken into account, rather it analyses the performers shape without taking into account global position. Eg you don’t have to stand on the exact same spot every time. 
 
 
There are three stages to the process
 
- Training
We set up the neural network with several pairs of human/robot pose combinations. We will experiment with different training regimes and wether it’s possible to reuse the same training for multiple performers or if the system needs to be retrained for each individual. 
- Running live
Once the neural network is trained we run it live. The robot will respond in real time to the performer. We record the live data during this period. 
- Playback
For evaluation we playback the the data generated by the performer in the ‘Running live’ mode. This is also when we will take the long exposure photographs. 

////Experiments and Results

Experiment 1 

  • Use training set from original contemporary style experiments with flamenco
  • Expect it to work but not sure how well it will work. 
  • This experiment is a control vs the next experiment where we train it with flamenco and see the results.
///Results
  • Some emotional response made sense, others didn’t. It’s highly subjective but I felt it was more difficult to get a coherent relationship between the robot and Annalouise using the mis-styled training set then it was in Amsterdam. 
  • Annalouise is slightly taller than most of the dancers in Amsterdam. My instinct is this didn’t make a major difference but it’s hard to know. In future to reduce this variable I should scale the system for similar bone length based on say knee to pelvis length in a specific pose. Then I should get more consistent pose data regardless of both their world position and how their height differs.   
  • This was Annalouises first chance to work with the Robot, although none of the dancers in the previous experiment ever worked with it more than a few times over a few days. I don’t think performer experience with the technology is yet a major factor. 

Experiment 2

  • Made a new training set for the neural network with Annalouise moving in a flamenco style, however we still used the same horizontal/vertical grid training methodology as used in the original experiments. 
///Results 
  • As expected making a training set customised for Annalouises style of movement provided better results 
  • I wouldn’t say that the robot ‘danced flamenco’ but I would say that it responded with more vigor to her movements than in experiment 1. 
  • Annalouise commented that it felt to her like a stronger relationship as she was moving. 

Experiment 3 

  • Annalouise and I discussed the difference between emotional movement and dynamic movement. Eg ‘angry’ movements performed with emotion vs nominally similar ‘fast and sharp’ movements performed without emotion.
  • We wanted to see if these were perceived differently by the system.  
  • In this experiment Annalouise performed two pairs of emotion/dynamics.
    • Set 1 ‘Angry’ vs ‘Fast and Sharp’
    • Set 2 ‘Love’ vs ‘Slow and Soft’ 
  • We didn’t just do strict flamenco for these experiments.   
///Results
  • This is even more strongly subjective than usual but I did feel that you could see the difference in the robot between performing emotion and performing a dynamic. 
  • This could of course be due to the fact that they emotions and dynamics are actually performed differently by humans. Even if logic and memory would regard them as quite similar the actual body language as seen and performed by the robot is distinctly different. I still regard this as a valid result. 
  • If this was to be investigated further a randomised test would be better. Eg with new observers who didn’t know which version they were seeing.  

Experiment 4 and 5 – Flamenco fundamental motions 

  • Experiments 4 and 5 relate to capturing flamenco structures over time. They are only attempts, this turned out to be a large (but not impossible) task. 
  • For these I stepped outside process outlined earlier and attempted to modify the system in different ways to accept new input.
  • Flamenco moves can be broken down into fundamental motions including stamp, tap, stamp, toe (point) dig, heel, a muted clap, hard clamp, thigh clap, a click, front brush, back brush. 
  • I thought capturing these fundamental motions would be significant because they would potentially allow the machine learning system to have a sense of rhythm. 
  • There’s quite a few assumptions in that statement, some of which I ran into as we conducted the experiments. 
  • To comprehend rhythm enough for accurate dancing requires the system to comprehend the fundamental motions very fast. Eg if the system only understands that a tap has occurred several seconds after the event it will not make a very responsive dancer. 
///Experiment 4 – single state analysis of a stamp motion
  • Experiment 4 was an attempt to capture flamenco fundamental motions from pose data on a single frame basis. Eg without memory (other than single frame pose velocity).  I knew this approach was flawed but for the purpose of elimination still worth trying.
  • If this was possible it would allow relatively simple programs to understand fundamental motions. 
  • It seems like if you showed me a photo of a Flamenco dancer who was mid-stamp I could understand what was happening from the still image. Can the machine do the same? 
////Results Experiment 4 
  • From Kinect1 skeleton pose data it is possible to understand that a foot is on the ground. 
  • If you have the simple memory of one frames worth of velocity you can understand the direction the foot is moving. Theoretically by combining the velocity and pose and a height value for the floor we can sense that the foot has arrived at the floor or left the floor this frame.  Whilst this isn’t a stamp I would call this ‘foot up’ or ‘foot down’ recognition, we tested this recognition to see how far towards recognising a stamp we got. 
  • In our short experiment I found the Kinect data was only moderately reliable at recognising the ‘foot down’ trigger and couldn’t recognise it fast enough to feel right. The short lag on the Kinect was more than enough to make the triggers feel sluggish.
  • Theoretically with a pose sensor with higher framerate and accuracy you could make the detection more reliable and perhaps faster. It was also a short experiment, you could probably tease somewhat better accuracy out of the Kinect2 as well. 
  • In regards to the photo analysis above I think that’s based on my cultural memory of what’s going on in that image. Whilst it would be possible to train a convolution neural network to recognise that image as a tap I don’t think that’s feasible for a general solution based on photos (but possibly based on video, see below).  
//Experiment 5 – time based analysis and prediction of a stamp motion 
  • This was an attempt to sense the stamp motion from pose data but with simple logic inferred over time. 
  • Theoretically it should be possible for a small state machine to understand from pose data 
    • that the foot is in the air – classify as ready for a stamp
    • then the foot is moving downwards with significant velocity – classify as moving towards stamp. 
    • with knowledge of the floor height to predict the timing of the foot actually hitting the ground.
    • hence being able to create an accurate trigger for the stamp motion
///Results Experiment 5 
  • While this was a promising approach my implementation didn’t work very well. As before accuracy was choppy (aprox 50%) and when stamps were recognised timing of triggers was sluggish. 
  • I do believe this approach still has merit but with this approach you would need to tune a lot of the parameters for the individual dancer. Eg every dancer will have a different velocity that feels like a tap or stamp to them. 
  • We didn’t get to analysing for example the difference between different footwork motions, eg toe or heel impact.  But this is also theoretically possible from Kinect2 pose data by analysing the angle between the ankle point and foot point. 
    • But first you would need to test how accurate that data is, particularly at fast velocities and with the typical cangle. It may be that Kinect2 doesn’t report significant angles between foot and ankle when the foot is pointed straight at the depth camera. 

////Conclusions 

Our questions were 
////

Evaluate current real-time motion capture techniques and their suitability for use with Flamenco 

Sensors that give skeletal pose data (Kinect1 and 2) pick up both large and small flamenco movements well enough for a machine to respond with subtlety. 
However with pose data alone I don’t think they are accurate enough to detect fast rhythmic motions. 
However I think the addition of a contact microphone either on the stage or potentially wearable could supply accurate information about when the floor is struck. See further down for a detailed explanation. 
Another possibility is wearable motion sensors. I know from experience trying to build alternative drum controllers that current generation easily available wearable accelerometers do not give fast enough response for a drummer to feel like they are playing in natural time. I think it would be the same for a flamenco dancer. 
////

Interpret Flamenco style dancing using a neural network and compare the results to previous experiments with contemporary dance. 

I found that a neural network trained on flamenco with a flamenco dancer can produce a similar range of expression to a network trained on contemporary with a contemporary dancer. 
I.E the neural network isn’t only suitable for ‘flowing’ body language. In these experiments it is just as expressive when trained with the sharp and formal language of flamenco.  It responds to both coarse and subtle movements with the appropriate dynamic response. 
////

Does the more structured nature of flamenco provide an opportunity for more structured input to our output from the neural network? For example could it recognise specific footwork? 

From these experiments I think it’s entirely possible to build a footwork recogniser that would work on flamenco and possibly other styles.
If I was setting out to build a footwork recogniser for realtime performance I would:
  • Collect pose data from Kinect AND audio data through a contact microphone.
    • The sound of the foot hitting the ground is the actual physical event we want to sense. The pose data gives it context (which dancer, which foot, what kind of strike)
  • I would combine this data with a classifier as discussed above in experiment 5.
  • When the classifier predicts that in the next window of time a foot should hit the ground the system will take the next spike in amplitude from the contact microphone as the footwork trigger.
  • Audio analysis occurs at very high speed and should be fast enough to be ‘musical’. 
  • I think we would get the triggers with very high accuracy. I think with Kinect2 the classification between taps, stamps, heel impact, toe impact would be imperfect but still usable accuracy. You should also be able to get good accuracy between  classifying left and right feet. 
  • You would also be able to seperately detect multiple performers as long as they are in view of a sensor.
    • Upgrading the system to handle multiple performers is not trivial, it requires running a separate neural network for each performer and then a further layer of intelligence on top of that to handle special cases eg two or more performers tapping at the same time. 
    • Like other Kinect systems you also need a strategy for dealing with multiple performers occluding each other. 
  • Rather than manually writing a state machine I would use a separate neural network analysing pose and audio amplitude data together as the main logic.
  • To do this successfully I think you would need a more sophisticated neural net training interface than wekinator offers. You need a way to notate classifiers and trigger events over time. Perhaps wekinator with a custom vvvv program built around it that takes over the interface could offer this level of functionality. 
  • I think you could build a general recogniser that worked on any dancer but 
    • you would need a lot of training 
    • you would need to remove world position AND orientation from the pose data so the exact position and angle the performer is facing doesn’t matter. 
    • Due to skeleton errors introduced by occlusion this won’t always work for every angle. You should also put in a condition that will only provide results for one foot if the other is occluded. 
  • Or rather than a general solution it would be easier to train a recogniser for a specific dancer with a specific orientation. 
////

Is it possible to create a machine that comprehends rhythm? In the sense that it is more sophisticated than simple BPM recognition?  

This is a lofty goal and in a week of practical development we unfortunately did not get close to testing this out. But I did get some insights into this question. .
  • Once you have detected individual footwork motions you could indeed supply these to a rhythm detection process. 
    • You could also use only a contact mic and assume every loud noise is a tap, in a controlled environment this would work for (for prototyping at least). 
  • To begin considering Rhythm involves working out how time will be dealt with by your machine on a holistic level. 
  • Because it’s about how rhyhtm will be reflected in the output. Is the machine going to copy a dancers rhythms? Respond with their own rhythms? Try and play in time with you? Engage in a call and response? All of these will require a different approach. 
  • Options for developing ‘rhythm awareness’ are 
    • A coded component supplied with footwork inputs that essentially does BPM tracking for each one. The bpm outputs would give some useful information to a main dancer network, but not very exciting. 
    • An isolated neural network instance for rhythm detection. This would again feed your main dancer network. Train it with footwork inputs against a notated beats and bars and it could potentially understand micro features (eg continous stamps = golpe, three brushes = latigos), time signature, tempo and most important (and difficult) when a new bar begins. 
  • Note that overall tempo and rhythm pattern are different concepts so your training needs to take multiple tempos into account. 
      • I claim that the bar starting is the most important rhythmic feature. Thats my artistic opinion because that’s when an audience will recognise a moment of synchronicity (even if its a culturally alien time signature to that audience). A great time for a dramatic moment in the machine output. 
    • Integrate the rhythm concept into your main dancer neural network. For me this would require a redesign from the ground up for the system design I’ve been using so far. It’s hard to conceive of building a machine that ‘understands’ time, as it’s not totally clear how our own consciousness relates to time.  I Eg you could give the agent the ability to ‘decide’ when to act or not act, but does this constitute understanding time? 
      • There’s some metaphysics to think about if you go down this route.

As always I would stress the informal and subjective nature of this research. The results will certainly help me make more compelling dance tech performances and spur me to think more about the nature of humans and machines. 
Big thanks to Annalouise Paul, Scott deLahunta, Papermoose, Rebecca Fiebrink and Wekinator and Critical Path.

Category : Uncategorized