History & Formats
“There is no non-spatial hearing”
Written by Steven Maes
The inventor of immersive music is … a Fleming. Already in 1555, Adriaan Willaert began experimenting with double‐choiring in the Basilica of San Marco in Venice. He placed two choirs on the balconies to the left and right of the Duke and thereby, for the first time in music history, experimented with space and the positioning of vocalists.
Not a chauvinistic display, nor a flourish of nationalism or unbridled pride that I intend to exhibit here, but Braxton Boren writes extensively about Willaert in ‘Immersive Sound: The Art and Science of Binaural and Multi Channel Audio’ (p 44).
Likewise, the Dutchman N.V. Fransen does so in Stereophonica (1962). He describes how human directional hearing initially developed primarily as a mechanism for detecting danger and only later acquired an artistic dimension with – indeed – Adriaan Willaert as the originator of this idea. The ingenious Willaert is presented here as a Dutchman, which, unfortunately, is incorrect for our northern neighbors, as the man hails from Rumbeke near Roeselare.
Perhaps a touch of chauvinism, for I wish to address the core of our matter, namely “immersive music”, and thereby frame our research at PXL-Music in Hasselt. We continue the tradition of our compatriot and have made the creation and exploration of music in three-dimensional audio our core business.

Music begins where noise ends
While other research centers primarily focus on how pink noise or a tone moves in space, we replace that with natural sounds or instruments, and we explore which factors are important to translate music as beautifully as possible into spatial audio.
I do not want to undermine purely theoretical and academic research in any way because it inspires us and encourages experimentation. Ultimately, we also arrive at more theoretical questions that we want to answer.
Immersive
If Willaert can be regarded as the pioneer of “immersive music”, then the Frenchman Clément Adler is the originator of spatial music recordings. Although he did not make recordings over electric telephone lines to a listening booth at the International Exhibition of Electricity.
The listeners present held two telephone handsets to their ears and were able to perceive the spatial position of sound sources by the interaural differences transmitted over the lines.
The most straightforward method of spatial transmission.

Are mono and stereo also immersive?
The term stereo is derived from Greek and means solid, steadfast, or sturdy. In scientific texts it is also translated as “three-dimensional cube” (e.g. in the works of Plato, Aristotle and Euclid). It is this second meaning that clearly refers to space.
(M. Philippa, F. Debrabandere, A. Quak, T. Schoonheim en N. van der Sijs (2003-2009) Etymologisch Woordenboek van het Nederlands, Amsterdam).
Stereo is a reproduction method for representing a space using two channels. Even mono has a spatial aspect: a distant source sounds subtle and indirect, while a nearby source is loud and concrete. But what makes something immersive?
That is not so simple, and personally I draw the line between mono, stereo and 3D sound when one can speak of a clear distinction between front, back and height. It then concerns not only spatiality but also stimuli, the ability to place and assign a well-defined source within a space. For example: a speaker that you can perceive to the rear left at 30° or a helicopter that flies overhead, and of which you can clearly hear that it moves from the rear right to the front left above you.
Historically, there have been various experiments since the 1930s to reproduce and record immersive sound.
I shall reduce them to the following three:
1/ The reproduction of spatial audio with a multi-speaker setup via various channels.
2/ Ambisonic.
3/ Binaural headphone technology.
Wave field synthesis and transaural reproduction are other interesting methods for presenting immersive content via loudspeakers.
Because they are of little relevance for the distribution of music, they will not be further discussed here.
1/ Channel & Object based
channel-oriented transmission methods
By placing various speakers in a space, we can reproduce auditory cues and spatial information.
We do this by panning signals in loudness between different speakers, and we call this a ‘channel based’ system.
Additionally, there are ‘object based’ systems in which sounds are assigned dynamic information about their position via metadata during the mix. During playback, the system processes (renders) this data and accordingly positions the sound within the space, distributed across the available speakers. It does not matter whether the playback is in stereo, via headphones, or in a cinema with 64 speakers.
The metadata enables an ‘object based’ system to determine where a sound should be placed and rendered.
Object based versus channel based? There are, of course, various companies that have brought technology to the market for the creation of immersive sound, and naturally the technology used (often also patented) differs per company. The first commercially available technology was Auro-3D, also a Belgian invention, and essentially a ‘channel based’ system with a focus on as natural a music reproduction and cinematic experience as possible.
A few years later, the American company Dolby followed with Dolby Atmos, essentially the first ‘object based’ system, initially also aimed at the cinematic experience. Subsequently, the American DTS introduced DTS:X, followed by several other companies (including Fraunhofer) around the world. A logical question might be: ‘what is better: object based or channel based audio’?
The (perhaps) surprising answer is: ‘a combination of both – depending on the sounds in your mix and whether you wish to clearly perceive movement’. Indeed, all systems available today allow the combination of objects and channels: Auro-3D has added the possibility of working with objects (AuroMax, Auro cX) to its channel based technology, and in Dolby Atmos the so-called ‘bed’ (a channel configuration) is standard in the Dolby Atmos Renderer. Thus, one need no longer make a choice oneself, which is a beneficial development.
One of the earliest and simplest forms of a multi-speaker system is a quad sound installation in which four speakers are placed in the corners of a room. In 1968, the earliest ‘quad’ system was proposed by Peter Scheiber. He developed a matrix to reduce four analog channels to just two. Through phase rotation, two additional signals were transmitted within the standard stereo recording. The original four channels were reconstructed during playback. There were, however, certain limitations regarding channel separation and phase artifacts (Immersive sound the art and science of binaural and multi-channel audio (p52) – The Quadraphonic Story By realspin / March 13, 2015 - https://zstereo.co.uk/2015/03/13/the-quadraphonic-story). Despite the aggressive marketing by producers of quad formats in the 1970s, they failed to achieve commercial success. This resulted in a widespread loss of confidence in the future of 3D sound technology. It was not until 1976, when Dolby Laboratories introduced its 4-2-4 matrix, that surround sound became a success in cinemas. I am employing the term 'surround' in a flexible manner. What is the difference between immersive and surround? Perhaps a distinction should be made somewhere? However, etymologically and technologically, I see no necessity to separate these terms; for me, surround and immersive mean the same. Surround was introduced to us by the film industry. It was the suffix used in standards and products such as, for example, Dolby-, DTS-, 5.1-, pro logic-, 7.1-surround, … to clearly indicate that audio was now also being rendered from the rear and sides in a cinema, and that one would be “surrounded by” sound to enhance a film dramatically and emotionally. It was not until Auro-3D and NHK 22.2 added a height layer in the early 2000s that I first heard the term immersive used commercially. That, in my view, is the generally accepted distinction. There is likely little or no difference between 3D and immersive. It is certain that all manufacturers of immersive technology today disseminate both horizontal and vertical auditory information. Speakers are positioned around us at various heights to cover and render all directions. Dolby Atmos is perhaps the best known and is commercially the most widespread in cinemas. Online, Apple Music, Tidal, Deezer and Amazon also distribute their immersive music with the Dolby Atmos codec. Dolby Atmos is gradually becoming synonymous with immersive, just as a “Kodak” is with a camera and a “Frigo” with a refrigerator. Dolby similarly succeeds in linking their products to technology. Nevertheless, MPEG-H, NHK 22.2, Sony 360 and Auro-3D are also immersive formats that do not differ greatly. Yet the brand name ‘Dolby Atmos’ is often used to indicate that the audio is immersive with a height layer.
2/ Ambisonics
channel-oriented transmission methods
Ambisonic is another method for recording and rendering three-dimensional sound. It originated in the 1960s and 1970s and was conceived by the English mathematician Michael A. Gerzon and the physicist Peter Fellgett. Similar to MS (mid-side technology), we work with a sound field synthesis. We do not use a channel-oriented transmission method as in Dolby Atmos or Auro-3D, but instead divide the space using bidirectional polar characteristics. In first-order Ambisonics, the space is divided into three directions: from front to back via an X-axis, from left to right via a Y-axis and from top to bottom via a Z-axis.
The process aims to reconstruct the sound direction at the listener.
There is also a fourth W-component, a sphere of 360° or an omnidirectional characteristic.
The four components W, X, Y, Z are referred to as the B-format in first-order Ambisonics. By summing W with one of the X, Y, Z components, the sensitivity of the axes can be altered, as you can see below.
Add an omni to a bidirectional microphone and you obtain a cardioid. This is the basis of the directional characteristics of standard microphones.
By adding the X, Y, Z axes relative to one another, we can also change the rotational direction.
If we combine the rotation and the sensitivity as discussed above, then from the B-format any physical positioning within a sphere can be achieved through the addition and subtraction of W, X, Y, Z.
The principle of Ambisonics can be applied both in microphone recording and in mixing. Just as one can convert a stereo mix to MS with a matrix, one can also convert it to Ambisonics according to the same principles. The disadvantage of first-order Ambisonics is its accuracy.
Dividing a space into only three directions is not precise. Therefore, it became necessary to achieve a finer segmentation of space with additional bidirectional components.
As a result, the placement of objects in Ambisonics has become much more precise. In recent years, we speak of higher order Ambisonics (HOA). With a second-order HOA, you have, for instance, 9 channels, and with a third-order HOA 16, as opposed to the 4 channels in first-order Ambisonics. Today, we even reach up to 15th-order Ambisonics.
A-format //
When recording in Ambisonics, one can make use of soundfield microphones. These are mics that record in the A-format.
The A-format is the raw format that comes unprocessed from the microphone capsules. The capsules are arranged as four cardioids in a tetrahedron. The simplest soundfield microphones record in first-order Ambisonics, and the B-format is obtained by summing the various microphone capsules through a matrix (plugin) so that you obtain the B-format with the components W, X, Y, Z. First-order soundfield mics include the RØDE NT-SF1 and the AMBEO VR.
W = LF + LB + RF + RB
X = LF − LB + RF − RB
Y = LF + LB − RF − RB
Z = LF − LB − RF + RB
With the use of HOA, there are now also mics that can record in this format. Notable examples include the Zylia Pro Have It All & Eigenmike, with 32 microphone capsules capable of recording up to 4th-order HOA.
Applications //
1/ As a 360° playback system for Facebook and YouTube.
2/ The sound format in Virtual Reality (VR).
3/ The audio in gaming.
4/ In immersive audio recordings such as music or film.
Ambisonics is used both for gaming and VR as well as for the immersive video content on YouTube and Facebook. In video games, a fully head-tracked audio experience becomes possible. In gaming, as a listener, you are situated at the center of the action.
This is ideal for Ambisonics, where the optimal listening position coincides with the central axis of the B-format.
Objects to the left of a gamer will sound to the left, until they turn their head, after which the sound will shift to the center as they look in that direction. This is all made possible by the real-time manipulation of the B-format audio signals that are integrated into all major VR and 360° video platforms. In recording, Ambisonics has the advantage that you capture raw audio, allowing all manipulations to be performed later, as with MS. Through plugins, one can easily convert an Ambisonic A-format into a stereo or immersive playback in Auro-3D, Dolby, Sony 360…






// Ambisonic examples
These are examples of audiovisual productions that utilize Ambisonics technology for audio. When you view these videos in a browser that supports Ambisonics (such as Google Chrome), the audio rotates in accordance with your viewing direction.
3/ Binaural
headphone systems
In its most basic form, listening with our two ears to systems that deliver two-channel sound is a binaural system. This can already be found in texts by Bell (1880) and in the famous patent by Blumlein (1933) regarding stereo. For Bell, a stereo reproduction was considered binaural if it captured the spatiality as we can perceive it with our two ears.
A. Bell: Experiments relating to binaural audition. Am. J. Otol. (1880). ; A. Blumlein: Improvements in and relating to sound- transmission, soundrecording and sound-reproducing sys- tems. British Patent 394,325, 1933. application 1931 Dec. 14; granted 1933 June 14, partially reprinted in J. Audio Eng. Sec., vol. 6, pp. 91-98, 130 (1958 Apr.).
It is not at all surprising that the term binaural was used in this context.
The word is composed of two Latin words “binarius”, meaning composed of two, and “auris”, meaning ear. In the context of this website, and when it comes to the reproduction of 3D audio and music, binaural means something different and mainly concerns the use of dummy head microphone technology. Fletcher was the first to experiment with this at Bell Laboratories in the 1930s. He used a dress form nicknamed ‘Oscar’. The microphones were mounted above the cheekbones because they were too large to be placed in the ear canal. This did not entirely conform to the ideal, as the ear canal and the positioning of the microphones are better situated within the head.
Binaural Recording Technology: A Historical Review and Possible Future Developments
The entire principle of dummy head microphone technology is to approximate as closely as possible the principles of human hearing and physically replicate them with an artificial head. In this way, the same conditions are created that enable humans to localize sound. The principle of a dummy head or the human head is based on three interaural differences: loudness or IID (Interaural Intensity Difference), time or ITD (Interaural Time Difference) and finally the frequency response of our head. These latter are filtered cues that are added to the sound before reaching the eardrum. The filtering occurs primarily at higher frequencies with shorter wavelengths, where the torso, the head and the pinnae can be considered as mechanical reflectors or filters. ILD, ITD and these filtered cues are collectively referred to as the HRTF or head related transfer function.
The ultimate aim of the reproduction of binaural signals is to deliver an acoustic signal to the listener’s eardrums that is equivalent to the signal a listener would hear under natural conditions. There are many advantages associated with the playback of binaural signals via headphones. The most significant advantage is that headphones provide a controlled listening environment. This controlled environment results from two important factors. The first is that the left channel signal is delivered directly to the left ear and the right channel signal directly to the right ear. The signals intended to reach each ear are, in effect, delivered to the appropriate ear. Playback via headphones is not affected by factors such as the location or orientation of the listener.
This is different when using a headtracker, which registers the movement of the head and, through additional calculations, compensates for the listener’s positioning, thereby delivering signals in accordance with the position of the ears. Without headtracking, the delivered signal will be independent of the listener’s location or orientation. Another advantage of the direct transmission of signals via headphones is that there is no crosstalk between the ears. If one were to reproduce a binaural signal via loudspeakers, crosstalk would occur between the left and right channels, leading to positional confusion and changes in timbre. Therefore, binaural is not suitable for reproduction over loudspeakers. Furthermore, the signals are affected twice by a head-related transfer function (HRTF): first, between the original source and the dummy head used for the recording, and second, between the virtual source (loudspeaker) and the human head of the listener. This double application of the HRTF can be avoided by listening on headphones.
A few considerations //
Since every individual is unique and possesses a very specific physical structure, personal HRTFs vary for everyone. Everyone essentially has an auditory fingerprint. Many researchers therefore assume that this is very important information. In theory, if we could measure everyone’s HRTF—for instance, using small microphones in the ears and playing test tones around a head—then we would be able to create a personalized HRTF for every person and reproduce an exact representation of localization via binaural technology on headphones.
Leckschat, Schmitz from the Institute for Technical Acoustics at the University of Aachen, in the 1990s, searched for a dummy head whose dimensions were chosen based on the best localization. To determine these dimensions, tests were conducted with a large group of people. The final HRTF selected was the one that provided the best performance in localization or sound placement for a large number of individuals.
Neumann used the same approach for the creation of his famous KU 100 dummy head, also known as “Fritz”, which remains the industry standard to this day. At present, many universities and companies such as Dolby, Sony, Genelec are engaged in developing personalized HRTFs, thereby moving away from the concept of an average HRTF obtained from localization tests.
At PXL-Music Research, we contend that personalized HRTFs are not crucial and that an average value derived from various HRTFs is sufficient for source localization. We wish to go further and argue that for an immersive musical experience, the precise placement of sound is subordinate to timbre and tonal quality. An auditory cue from, for example, a shaker positioned to the rear right may be equally effective at 45° or at 30°. Sound quality is at least as important. Therefore, we advocate for an HRTF that is optimized for music (M-HRTF).
For natural positioning, we also believe that other cues, such as head movement or visual input, contribute much more to the localization of a source. Apple AirPods are equipped with a gyroscope and the capability to perform head-tracking. They already actively use this for the listening of immersive music via spatial audio and Apple Music. In our research with our own kiwi (once our paper is ready), a pseudo-dummy head, several significant observations have emerged that further support this assertion.
Another phenomenon in binaural and headphone playback in general is that a source never appears ideally in the front but is usually perceived as located within the head, between the ears. Inside-the-head-locatedness (IHL) is a problem that certainly arises when a singer is positioned centrally in a mix. In binaural playback, you hear the singer inside your head, between your two ears. This, too, is not acceptable for an ultimate musical experience. In my view, there are few clear explanations as to why IHL occurs.
One thing we ourselves have observed is that when we couple headtracking with an HRTF the problem almost immediately
I will take some liberty here. Through minute head movements, we continuously scan back and forth and provide our brain with information about where the sound is located. Furthermore, I am convinced that a sound which we do not see, and assuming that we can keep our head completely still, is interpreted by our brains as if the sound is located between our ears (IHL).
Our brains do this because we do not have eyes on our back and by default place everything between the ears as a potential danger approaching from behind. A minimal head movement or visual confirmation can rule out the danger. In this, I see an explanation for IHL. With static binaural playback, you lack visual confirmation of the sound’s origin and cannot, through head movement, distinguish between front and back.

Binaural and headphones, our belief… amen
At present, we firmly believe that the breakthrough of 3D audio depends on high-quality headphone listening. Everyone has a smartphone and can listen to music via AirPods. For now, a headphone is not yet state of the art when it comes to an immersive musical experience as described above. But there is hope: in our laboratory, we have a prototype of a headphone with HRTF and headtracking that comes close. With this, you can already listen as if you were in a studio surrounded by 22 speakers, or truly experience the double-choir effect of Willaert as the Doge did 470 years ago in the Basilica of San Marco in Venice.
As stated, it is a prototype – but one that leads us to believe that sooner or later we will forget stereo and consider immersive as the norm.