Loudspeaker Imaging Theorem

Introduction

I was asked to share my thoughts on what it takes to make a speaker that images well. Hmmm…that’s a bigger question I thought it was. Because after I was asked I started thinking about exactly how I would answer it and as a result my thoughts made a natural progression from one thing that influenced imaging to another. Suddenly I realized that several of these things were interrelated and that resulted in me arriving at a theory on loudspeaker imaging that I had not fully formed in my mind previously, nor is it one that I have heard espoused before (dangerous territory to be in, I assure you). The interesting thing about this theory is that it actually addresses a possible "scientific" reason why some people prefer analog source material over digital; why people who use speakers that use minimalist crossovers, minimum phase crossovers, and especially series crossovers say they image better; why Dipole speakers image well, why small mini-speakers image well, and even a few other things tossed in too.

Now, I am an engineer whose job focuses on problem solving – that’s what I do every day – I am presented with complex quality/engineering problems (in automatic transmissions none-the-less) and I have to arrive at not only the root cause but also the best corrective action. As a result I am trained to look for patterns and relationships and to recognize that the probability that these relationships exist, or correlate, is unlikely to be by chance. This is the root of statistical problem solving. I guess I can’t help but approach all questions that arise in this same manner to some extent.

However, we need to keep in mind that although I use certain types of physics regularly in my profession, I am not a physicist (even though I approach cosmology as a hobby), and I am certainly not an acoustician either, which means that my theory may be all wet. However, again, good problem solving techniques don’t even care what field you are working in because statistical significance applies to nature in general. So, as an automotive engineer applies his tools to speaker design here’s my thoughts on what impacts a speaker’s ability to image well. Someone, somewhere, may have researched this already (most likely has) and may have come to similar or very different conclusions, I don’t know, but these are my thoughts.

The Theory:

First, we probably need to a good working definition of "imaging" and then we can build on that foundation and see where this takes us. I will define "imaging" as the loudspeaker’s ability to recreate a sense of the original localization information that was present in the recording environment in such a way that vocals and instruments seem to be placed in space, even to the point of creating a sense of three dimensional depth.

Will that definition work OK for everyone?

Now, let’s take a look at this what this definition implies. First, it implies that this localization information is present, which is not necessarily a given when you consider how many recordings are mixed. But when they are done correctly this information should be there. Second, it implies that there are considerations that go into loudspeaker design that seem to influence, constructively or destructively, a speakers ability to reproduce this information. My theory will only deal with implication number two and will assume that the criteria for implication number one has been fully met. I now present my theory:

In higher mathematics there is something called Fractal Analysis. A fractal is actually a mathematical expression that continues to produce a pattern into the infinitesimal. It is often demonstrated with electron micrographs of crystal edges. No matter how far you zoom in there exists a structural pattern along of the edge. This continues to the "infinitely" small, or in this case to the atomic level. I bring all of this up because I believe, well theorize, anyway, that localization cues within recordings are fractal in nature, because these cues consist of very low level secondary and tertiary, etc, delayed sound containing the acoustic "fingerprint" of the original environment. This information contains various levels of ambient frequency, phase, and delay encoding that reduces to a fractal expression. And this is precisely why certain conditions are either constructive or destructive in a speaker’s ability to reproduce this information.

Unveiling lower level imaging information requires removing the multilayered veil that hangs over the sound in the first place. All localization cues, phase and path length relationships, and the "fingerprint" of the original acoustic space (as long as the recording is not mixed in such a way as to destroy this information) exist on the recording in progressive fractal levels and can be retrieved if these veils are stripped away. The key is to progressively strip away these veils, the effects of which are cumulative in nature.

For instance, by nature analog recordings are fractal in structure, but digital sources are not. This is because very low level information is still encoded within analog material, but with digital material there is an information "floor" and nothing at all exists below the floor. It would stand to reason then that analog recordings have a much higher potential of capturing and maintaining the low level localization cues then can possibly be maintained in a digital medium. How much this effects things? I do not know, only that there is some unknown level of effect. There are still many people who swear that LP’s played on high-end turntables create a sense of space that is missing from the digital world. And if what they are saying is true, then I theorize that the reason is due to the fractal nature of the analog source versus the non-fractal digital source. Possibly this difference can be heard and maybe the digital vs analog arguments carry some weight. (I do not know, but quite possibly this is true with tubes versus transistors too. I would like to hear some of your thoughts on that one.)

Here is a personal experience to describe this: Several (many, many) years ago my wife and I were visiting a high-end audio salon. I was (foolishly) discussing the virtues of digital music and also my belief that expensive turntables had nothing to offer over less expensive ones. The salesman chuckled and asked if I would willing to take part in a little experiment, to which we said, "sure". He proceeded to set up some music on a Rega turntable (which is actually a decent table) and we listened for while. Then, through the same system, he played the same music on a Linn Sondek with an Ittok arm (I don’t recall the cartridge, but believe it was a Linn as well). It only took a few seconds for both of us (my wife and I) to look at each other and exclaim the difference we heard. The sound was almost three-dimensional over the Linn, it was virtually "flat" sounding over the Rega. I inquired as to how this could be, and the salesman explained that the suspension on the Linn made it possible to pick up much lower level information that contained this 3-D space. We then listened to a CD of the same music. It was even "flatter" than is was on the Rega. This is just one example. Localization cues are very low level on a recording, just as they are in life. If information retrieval is cut off before reaching this level a loss of imaging will result.

My theory is based on the fact that psychoacoustically we perceive the size of a sonic image by the relative intensity of the ratio of direct and reflected sound and the delay between the two. This is called the Haas Effect and it tells us that there exists specific thresholds for image broadening and ambience based upon this ratio of the reflected to direct sound. We also know that the direction we perceive sound to be coming from is based in part on the frequency response, or the transfer function of that sound. Therefore I theorize that the inverse is also true because it must hold to the same rules and there exist thresholds on the other end of the spectrum for the perception of localization cues as well. The problem is that the very things that tend to broaden an image also are the things that overshadow or veil the localization cues. I believe these are two different extremes on the same continuum.

I have already mentioned the difference between analog and digital source material, now in part two let me point out some veils that exist in loudspeaker design, how they effect imaging and how some designs attempt to remove that veil and restore the image cues (whether they realize it or not).

The Room

Essentially all of these culprits are things that "smear" the first arrival sound in such a way as to veil the subtle fractal elements of the original imaging cues. This refers to the Law of First Arrival Sound that says the first information to reach your ear will establish your perception or frequency response, direction, etc. In most of these cases we are dealing with external, additional sound sources. These additional sources are the veils.

One of the first ones that we all have to contend with are room reflections. We all know that reflections are secondary sources. In normal listening rooms they reach the listener with only a very small amount of delay. If this delay is less than 10 msec then we will tend to hear it as part of the direct sound, this is called the Precedence Effect, but that sound will be smeared due to the mixture of delayed sound and this will cause us to lose some perception of the subtle imaging elements within the recording. However, since all rooms have this problem to some extent, and most of us listen midfield in fairly dead living rooms, this item may not effect us as much as some of the other items do, but I am only speculating here. I know room reflections can sometimes be a big problem in response, so it stands to reason that the deader the room the better the speakers will image. This is probably where the LEDE (Live End Dead End) arrangement came from in the first place. There are a lot of devices out there designed to trap, absorb, and control room reflections. If you have a live room I would recommend trying them. (See Uncle Dave Elledge here, he has the scoop on taming rooms).

The Enclosure

A similar problem that does effect first arrival sound due to the small reflection distances relative to the path length to the listener are cabinet reflections and diffraction (see Paul Verdone’s magnificent spreadsheet and user’s manual for more details on this topic). Baffle diffraction is a major culprit in smearing the first arrival sound’s localization cues.

I see two ways that this problem is dealt with in loudspeaker designs that attempt to address it. The most common way is to modify the baffle shape in an attempt to reduce the amount of diffracted and reflected acoustic energy. You will see this in speakers like the Avalon designs with their faceted fronts, Waveform Acoustics with their egg enclosures, Thiel’s curved design, Dunlavy’s felt covered baffle, and many other brands that round edges or recess drivers. All of these are attempts to lower the diffracted sound to direct sound ratio. I also think it is generally accepted that speakers designed in this way tend to image better than more standard designs. Just coincidence? My theory says maybe not, because since diffraction effect our perceived first arrival response then it has already contaminated that sound and effected the low level info as well. The higher level diffraction will cast a veil over the localization fractal information.

The other way this is sometimes dealt with is by making the speaker a dipole (sound radiating equally, but out of phase from the front and the back together). Dipole cancellation occurs at 90 degrees from the front and rear axis and as a result there is almost no acoustic energy at the baffle edges to diffract, hence diffraction is almost entirely eliminated in a dipole speaker. Also due to this cancellation dipoles reduce room reflections from the sides too. Now, you may ask, "What about the rear reflected energy?" Well, that will arrive enough after the first arrival sound that it will not effect the localization cues as much as diffraction from a monopole speaker does. This relates back to the Law of First Arrival Sound. Therefore according to my theory Dipoles should image better, all other things being equal (which they seldom are, because dipole have some problem of their own).

One of the areas that dipoles have real problems, as well as all larger speakers, are in the area of cabinet cumulative frequency spectral decay resonance’s, which are similar, but of a different class than diffraction, because they are simply another external source for delayed sound. In this case we have a cabinet that acts like a big capacitor, storing acoustic energy, delaying it, and then releasing it in the form of a mechanical resonance. This resonance will combine with the sound from the drivers, and like diffraction, will smear the first arrival sound, throwing a veil over the subtle imaging cues.

Small speakers image better. Everyone seems to agree on this. However, the "why" is the argument. Most people assume it goes back to the small speaker having less diffraction, but in reality the opposite is true – small speakers have bigger diffraction problems. I believe the reason is in the fact that most small speakers are significantly stronger structurally than most large speakers are. I think this is likely making the difference in the long run. Years ago I noticed that the speakers that kept getting the best reviews had one thing in common: They all had extremely well built cabinets that attempted to eliminate as many panel resonances as possible. Consider the Wilson WATT with its lead-lined walls, the 3" thick baffle on the Avalons, the cast marble baffle on the Theil CS5’s, and the extensive bracing inside, etc. My theory says that by improving the cumulative spectral decay of the enclosure (faster decay, less smearing) you will reduce its effect in veiling the subtle fractal information that establishes this sense of the original space that we are trying to recreate. The moral? Watch that structure and those mechanical sound sources, they are real and they will rob you of your low-level information.

The Crossover

Most loudspeakers are multi-way (at least two-way) and as a result use more than one driver, with different acoustic centers, that produce different path lengths to the listener. These different path lengths are a source of time distortion because sound from one driver (usually the tweeter) will arrive before the sound from the other driver (the woofer). Now, you may not be able to hear this offset consciously, but I believe it is the source of another one of our veils. Test me on this and try this experiment. Tilt your two-way speaker back a little so that the path length for both drivers is very close to the same to your ear. Now, play some music and tell me if there is more depth in the presentation. I bet most of you will say that there is. Every speaker has a Zero Delay Plane (ZDP) on some axis, bringing these voice coils into alignment on the listening axis will have a significant and measurable effect on step and impulse response. Since the step and impulse are cleaner then it stands to reason that the distortion of low level information will be reduced as well, there is simply less time smear present to effect it. The effect of aligning acoustic centers preserves arrival time cues across the whole spectrum.

The folks on the "Fullrange Forum" would swear that this is one of the reasons fullrange drivers are "superior" (in there estimation, anyway). However, having listened to a pair of Jordan JX92S’s I can personally say that these things image better than anything else that I have had in my system. I can not think of a reason this would be the case unless it is the fact that the full spectrum is time and phase coherent. (Well, there is one more reason which I have yet to touch on, and that is the Jordan’s lack of an electrical filter, which creates problems of it’s own).

Another area that is related to the time domain (path length) but is not exactly the same is the phase response of the speaker. Speakers themselves tend to be minimum phase within their linear operating range, meaning there is constant phase shift with frequency or minimum deviation from the expected shift, hence the term minimum phase. But when you work with multi-way speaker the physical offsets, resulting path lengths, the acoustic roll-offs of the drivers, their different phase responses, and the crossover used all alter this combined system phase response making it no longer minimum phase.

I believe in preserving the phase response of the original signal as much as possible in the crossover design. However, this is difficult and its need is even a bit controversial. There is a strong school of thought out there that says attempting to preserve the absolute phase response of the original signal is not audible and double-blind tests seem to indicate that they are correct. However, I don’t necessarily agree and here’s why. The problem with this double-blind testing is that they still did not address one factor that makes all the difference: the recording. Most recordings consist of layers of sound recorded at different times, and even in different rooms, mixed together. If localization cues exist at all in these recordings it is a confused mess of overlaid cues, which would be meaningless. But, do these tests with a proper Blumlien or Binaural two-mike recording and I bet people would begin to pick out the difference in depth of presentation between two speakers, one that preserved the minimum phase integrity of the signal and one that did not.

I noticed years ago that some speakers produced a much deeper image with better localization than other speakers did. The first time was about 1980 listening to a pair of Dahlqusit DQ-10’s (remember them). I have also noticed this effect from Theils, Vandersteens, Martin Logans, Magnepans, and Spica’s just to name a few. Guess what these systems all have in common? Yep, they preserve the original phase response of the signal. I have listened to other very highly regarded speakers with excellent response that just did not produce this depth. They did not preserve this phase information. I think it does matter and it looks like there are still a lot of people out there that agree with me. I have heard a lot of people say, "All a minimum phase crossover does that’s special is pass a square wave, and I listen to music, not square waves." Well, doesn’t it stand to reason that if it passes a square wave with low transient distortion that this would mean better things for music too? Remember, we are after the retrieval of the low-level ambient and placement information. The ability to reproduce that square wave may be important here.

But this is a real problem for the DIYer, so I will propose a compromise. You see it is nearly impossible for the DIYer to construct a speaker that is actually minimum phase. Aligning the drivers and using first order crossovers will reduce the phase distortion but it won’t get you home. The problem lies in the speakers themselves. Take any good dome tweeter. They all begin to roll-off somewhere, most of the well-damped ones do this around 2kHz. On a Dynaudio D28/2 this roll-off is first order from about 2kHz to around 600 Hz where it makes a transition to second order. Now if you add a single capacitor to this you end up with a second to third order acoustic slope and an additional 90 degrees of phase rotation that you do not want. Woofers, also, usually have a rapid roll-off above some point and this introduces some phase shifts as well. In other words it is almost impossible to achieve minimum phase results with simple crossovers.

So the next best thing may be to preserve linear phase tracking between the two drivers, thus keeping them in relative phase with each other for a very wide range of frequencies. This is easily achieved by designing a 2^nd or 4^th order Linkwitz/Riley acoustic crossover. And I have found that I have been able to arrive at an acoustic 4^th order L/R response with very simple electrical networks, generally using nothing more than damped first or second order filters that arrive at flat frequency response and maintain great relative phase tracking. This being the case I recommend the L/R crossover as a goal in designing for the DIYer. The reverse null makes it relatively easy to verify and then you have a good relative phase relationship between the two drivers. This is not to be confused with transient perfect because there is still some ringing, however in our quest for a better imaging loudspeaker we are incrementally reducing the distortions that veil our fractal information retrieval. Remember, these veils are cumulative.

I have one more comment to make about crossover design and this one may explain why so many people say things like, "I switched to a series crossover and the depth and imaging just blew me away, it was night and day between this and the parallel network!"

Here’s what I think may be going on here. Capacitors and inductors are simple reactive components, but they are also more than that because they all alter the sound in some way. If this was not the case we would all be using El Cheapo brand electrolytics and little iron core chokes and nobody would be spending $40 for an Audio Cap Theta if it didn’t sound better (well some of you might, but not as many). The truth is these things do impact the sound. As a result my rule is to avoid overly complex electrical networks with lots of reactive components, keep the crossover as simple as possible for good results, and use high quality components when you do. From my own experience if you keep adding components to a circuit to correct for everything you can think of you will find that somewhere along the way your sound became compressed, flat, and lifeless. There is a cumulative effect from all of those components. Let’s use the K.I.S.S principle for crossover design (use your own words there, OK?). This is why I believe people love first order circuits, especially series designs, even though the objectivist will tell you it doesn’t matter (the people who use them know better and I think their comments often support my theorem). Series circuits reduce the number of reactive components in series with the drivers. Maybe this in turn reduces the thickness of the "crossover veil" and provides for better imaging cues to be reproduced.

Conclusion

Remember again, these veils are cumulative. Just like adding layers to Saran wrap over a box, at first you can see right through and identify what’s in the box. In fact, you would say the Saran wrap is clear. But add several more layers and you find that you can no longer make out what’s inside. The cumulative effect of these layers has added up to veil our box's contents. It is the same way with all of these items I am listing. If we remove a few layers of the sonic saran wrap we may find our music is beginning to blossom into a beautiful three-dimensional panorama of sound with instruments just hanging there in space. You may find that you can actually hear height and depth as well as width, and the width may extend past the speakers (I have heard all of these things). I didn’t even go into things like the M.A.R.S. system that Irving M. Fried used in his last incarnations and the old Carver Sonic Holography and the Polk S.D.A.’s. All of these operate on the same principle of canceling interaural crosstalk between the two stereo channels and your two ears. I have designed a system that did this very simply and very effectively and on some recordings the effect was almost unbelievable, talk about three dimensional, wow! But we can save that for another day. For now let’s just focus on reducing the veils that cover our ability to retrieve and reproduce the complex yet subtle fractal ambient and localization information that exists in the recording.

My conclusion – get rid of the smear campaign against imaging.

These are my thoughts on imaging, what are yours?