The onset of mobile phones has made us tolerant of bad voice quality on our phone conversations. VoIP technology has made voice and video communication accessible and affordable to many people. Service providers and carriers have been striving to provide high quality voice using VoIP technology. So, what is so different in VoIP from the “old” technology?
VoIP carries voice signals over an IP network. This is fundamentally different from the “old” circuit switched technology where digital voice signals were carried on dedicated channels. IP was designed to carry data. It meant that the file you wanted to download can take it’s own time to download. You wouldn’t be concerned if it downloaded in 20 seconds or 100 seconds as long as it completes in a reasonable time. However, a telephone conversation should be more “real-time”. You expect to hear a “Hello” from the other end immediately after you’ve said Hello, so that you can have a meaningful conversation.
So, what network parameters affect voice quality in VoIP?
Think of this as a pipe that can get you data. The bigger the pipe, the more simultaneous conversations you can have. But how big of a pipe do you need to have one conversation? The answer to this question depends on the codec. When analog voice is digitized, it is sampled 8000 times per second. Each sample is encoded in 8-bits. So, we need to have a bandwidth of 64000 bits per second (or 64 kbps) one way. Now-a-days, available bandwidth is not a concern – as long as all of you at home are not watching your own movies streamed over the internet.
The IP network is expected to be lossy. Packets can be dropped randomly. However, today’s IP network is fairly reliable and losses are minimal. The core networks advertise maximum loss rate of 0 – 0.5%. This is the main contributor to bad voice quality. The codecs are very sensitive to loss. Ideally, there should be zero packet loss.
We saw earlier that there will be 8000 samples of voice taken per second. The Real-time Transport Protocol (RTP) uses 20ms or 30ms packets. So, there is a packet expected every 20ms at the client. Due to the nature of the IP network, the packets may be delayed on the way. This delay varies and is called jitter.
A way to tackle this problem is to have jitter buffers. But, this could be a double edged sword – a big buffer will introduce latency and a very small one can be useless. Usually, it is better to have the jitter buffers on the endpoints and not in the network. This will make sure that no additional delays are introduced while still smoothing out the voice signal played.
This is the time it takes for the packets to travel from source to destination. Physics tells us that the signals do take some finite time to travel. Added to that are all the routers and switches these packets have to travel through. So, a certain amount of time is spent in the network by these packets.
The question is how much latency can be tolerated? The answer is: it depends.
Remember that TV correspondent on the field staring into the camera for a seemingly long time of silence before speaking? Assuming there is only latency in the network, the question becomes how long can you wait for a response after you’ve spoken? If the latency is very low, the responses will be immediate. If it is, say 1 second, then you’ll hear a response 2 seconds after you’ve spoken because your voice packet takes 1 second to reach the other side and the response from the other side takes 1 second to reach you.
ITU-T recommends a one-way ear-to-mouth delay of about 150 ms or less for excellent quality voice and 400 ms for acceptable quality.
So, how can we get a rough measure of end-to-end latency? I learnt a trick here at Bandwidth.
Assume you have two mobile phones. The end-to-end delay can be measured like this:
- Use a microphone. Connect it to the laptop.
- Run Audacity application on the laptop.
- Now call Phone B from Phone A
- Answer phone B and put it on speakerphone.
- Start recording on the Audacity application.
- Tap on the table. The sound is captured by the microphone over the air.
- After some delay, it is also heard over the microphone.
- Stop recording.
- Look at the waveform shown on audacity and identify the first tap sound and the second occurrence.
- The time difference represents the end-to-end delay.
This gives you a fairly accurate measure of the delay.
In the figure above, the first noise is the tap on the table caught by the microphone through the air. The second noise is the tap heard from the phone that had the microphone near its’ speaker. As can be seen, the selection started at 10.311s and ended at 10.629s. The difference is the end-to-end delay which is 318ms.
So the next time you hear a distorted voice on your phone, you know what could be the possible causes.