Audio Coding Report October 19th, 2013
Abstract:
For our main sending scheme, we used a form of multiple On-Off Keying (OOK). It differs in traditional OOK in that 0’s are sent as an actually frequency, as to increase the robustness with which a zero is sent. The error correction we used is a combination of Reed Solomon codes, Hamming codes, scrambling/interleaving techniques, and a primitive hash-checker to verify if the message was correct or not on the receiver’s end. Aside from the theoretical tools we used, we also had a number of promising practical features in our implementation. For example, our scheme is completely dynamic and real-time. In particular, not only does the receiver detect the start and stop on the fly so that any variable-length message could be sent, but it also processes in real-time as it’s receiving the message, which takes away what would’ve been a huge hit in bitrate which comes from processing tons of data after receiving it. Finally, for synchronization, we would send long, strong chirp tones at the beginning and end of our message, and smaller chirps on top of each small burst or packet, so that our receiver would auto-sync with each packet it received. Overall, our scheme gets a rate of over 300 bits per second when sending a reasonably long message. This bitrate is the number of actual message bits we sent divided by the time it took to send it, which is measured from when the receiver detected the start of the message until when the receiver was completely done decoding the message.
Planning Stage:
Before starting off, we had to decide which encoding scheme we should use. The first one we discussed in class was on-off keying (OOK). We decided against using pure OOK because as simple as it is to implement, it is extremely slow. For example, if we use a message duration of 0.05 seconds, which is reasonable for the hardware we have, that would only be 20 bits/second, and that is before any sort of error correcting (although we can reasonably assume that the probability of error would be very close to 0 given an intensity threshold and adequate synchronization and equalization). Next, we covered frequency-shift keying. This is a very enticing option, because as opposed to OOK, we can send multiple bits in one message, which would really ramp up the bitrate. However, the basic schemes that we learned in class can always be modified to increase their bitrate. For example, OOK doesn’t make use of the entire bandwidth. So in order to increase the bitrate, we can use a modified form of OOK where instead of having an “on” frequency and an “off” frequency (OOK normally uses silence for a zero, but we found it would be much easier to use a different “off” frequency), we have multiple “on” and “off” frequencies. We can divide our bandwidth into many slices, and have two frequencies in each slice, one for a bit being a 0 and one for a bit being a 1. The advantage of this is that to send n messages per second, we only need 2n frequencies. However, we need a good decoding and synchronization scheme, because we are playing n frequencies at once for the n bits. Instead of this, we can use FSK, where we only have one tone per message. This is appealing, because decoding is physically easier since we’re just picking out one frequency. In addition, we are still sending multiple bits per burst, so it isn’t going to be as slow. Furthermore, if we can use multiple intensities of the tone, we can increase the bitrate even more, because we’ll have more options for messages within our bandwidth. Thus, we had to decide between two good options, FSK and OOK on multiple tones. We looked online for other schemes and found the likes of phase-shift keying and orthogonal frequency-division multiplexing. However, we weren’t convinced that they would be that much better than our own implementations. For example, OFDM would require very little sensitivity to noise, and we have to be careful with bandwidth, because our speaker/microphone combination isn’t as sensitive as we would like. Thus, we decided to work with both the schemes we had thought of, and ultimately pick the one that ended up working better.
First Approach: Frequency Shift Keying
I. A Note on Languages, Libraries, and Implementation
In general, the encoding methods and error correcting schemes found in this report can be replicated in any higher level programming language, as demonstrated by our dual Python/Matlab implementations. However, the tools used do reveal the overall design approach and software engineering principles behind our project, so I will briefly describe the reasoning behind the technical implementation of our modulation system. For FSK encoding, we essentially followed the agile programming philosophy, creating modular pieces of code that functioned on a high level, then deconstructing for more subtle details. As such, we aggressively used Python’s extensive library frameworks to quickly implement a working audio encoder/decoder, relying on the libraries’ black-box functionality and down-dressing from there. I believe this was elegant software engineering; at the early stages, it seemed better to be bogged down by algorithmic considerations rather than syntactic ones. For example, the self-explanatory WinSound library was initially great for writing a basic sending-computer class, as playing tones for a given frequency and duration was a simple method call. Its ease of use allowed us to quickly sketch out and implement the outline of our FSK algorithm. However, WinSound was decidedly inflexible when we wanted to convolve signals, play simultaneous signals, or even play slightly more complicated tones, such as a chirp. To accomplish that, we ended up using Python’s system library Wave to build the sinusoids ourselves, deobfuscating the black-box. PyAudio , Numpy, and Scipy deserve special note. Numpy and Scipy essentially replicate Matlab functionality, making tasks such as Fast Fourier Transforms and convolution simple. PyAudio was our central method of interacting with the machine’s speaker and microphone samples, allowing us to load audio frames into buffers we could manipulate with Numpy, set sampling rates, output .wav files, and create busy-waiting synchronization loops. Much of the mechanical functionality of was left to PyAudio (note- this refers to the actual interaction with audio drivers and system protections. We had to consider the frequency response of the system, audio delays, etc., manually.) The other libraries- Sys, Random, Time, Itertools, Collections- were used just to access system variables or take advantage of Python’s functional programming methods.
II. Overview of Implementation
Using frequencies between 250 hertz-1200 hertz, we create 64 distinct orthogonal signals corresponding messages of bit length 6. After mapping length-6 messages to the signals, we transmit the frequencies over our audio channel. For context, here is the dictionary of messages to hertz:
{'110000': 8266, '110001': 8433, '110101': 9101, '110100': 8934, '010100': 3590, '010101': 3757, '001100': 2254, '001101': 2421, '011110': 5260, '011111': 5427, '001001': 1753, '001000': 1586, '011011': 4759, '011010': 4592, '000110': 1252, '000111': 1419, '000011': 751, '000010': 584, '100100': 6262, '100101': 6429, '111100': 10270, '111101': 10437, '100010': 5928, '100011': 6095, '101110': 7932, '101111': 8099, '111001': 9769, '111000': 9602, '101011': 7431, '101010': 7264, '110011': 8767, '110010': 8600, '010010': 3256, '010011': 3423, '010111': 4091, '010110': 3924, '110110': 9268, '110111': 9435, '011000': 4258, '011001': 4425, '001111': 2755, '001110': 2588, '011101': 5093, '011100': 4926, '001010': 1920, '001011': 2087, '101101': 7765, '000000': 250, '000001': 417, '100111': 6763, '100110': 6596, '000101': 1085, '000100': 918, '111111': 10771, '111110': 10604, '100001': 5761, '100000': 5594, '010001': 3089, '010000': 2922, '101100': 7598, '111010': 9936, '111011': 10103, '101000': 6930, '101001': 7097}
On the receiving end, we open our audio stream, waiting for the start tone. Once the start tone ends, we read a pre-defined CHUNK of audio frames to a buffer (our CHUNK was 1024 frames per 1 second.) We take the Fast Fourier Transform and Inverse Fast Fourier Transform to translate our buffer into the frequency domain. We then find the peak frequency over the frame, multiply it by the sampling rate, and divide it by our CHUNK size. We also added intensity analysis, so we could transmit approximately 3 more bits in a message by mapping frequency-intensity combinations. To quickly summarize that approach, we use the same buffer analysis, but instead of using Fourier transforms to work in the frequency domain, we simply find the amplitude peaks. For context, here’s the (important part of the) decoding code:
data = stream.read(CHUNK)
b = numpy.fromstring(data, dtype='int16')
fftData = numpy.fft.rfft(b)
highestFreq = fftData[1:].argmax() + 1
frequency = (highestFreq * RATE)/CHUNK
amps.append(numpy.abs(b).mean())
freqs.append(frequency)
frames.append(data)
We split our buffer, now in the frequency domain, into the number of messages we expected to receive given the time period. To be clear, this does not require prior communication except the length of each message- it is clear that a 10 second transmission of 100 millisecond messages should be split into 100 segments. We then rely on the orthogonality of the sent signals; the inner product of our received frequency with the set of possible sinusoids would be maximized by the correct symbol. Therefore, we match the received signal and possible signals with the largest inner product. Then, we take the mode of the match frequencies for each CHUNK of the buffer. This should correct for synchronization errors that are less than half of our message length (in this case, sync errors less than 50 milliseconds.) For context, here is the implementation of our inner product:
hard = range(250, 11000, (11000-250)/64) #for i in range(len(li)): #li[i] = hard[min(range(len(li)), key=lambda i: abs(hard[i]-li[i]))] #for i in range(len(li)): #li[i] = min(li, key=lambda x:abs(x-li[i])) takeClosest = lambda num,collection:min(collection,key=lambda x:abs(x-num)) for i in range(len(li)): li[i] = takeClosest(li[i], hard) inQuestion = collections.Counter(li) return inQuestion.most_common(1)[0][0]
Finally, we decode our received frequencies with the original codebook.
III. Error Analysis of Frequency Shift Keying
Now, we can look at the error analysis of our schemes. Beginning with our attempted FSK scheme, we tested two different message durations: 250 ms, and 100 ms. Without implementing any sort of continuous synchronization, we noticed that we had one error every 16 messages. Since we sent 7 bits per message, this would mean that every 112 bits, we’d corrupt 7 consecutive bits. However, the earlier messages were usually decoded quite well, while the error appeared in the later messages. Therefore, it stands to reason that if we had continued with this scheme and implemented periodic synchronization, say, a chirp every 10 messages, we would not have seen any error at all. On the other hand, when we sent messages with a duration of 100 ms, we definitely encountered a lot more error. We encountered 1 erasure approximately every 5 messages, and 1 incorrect message every 15 messages. This ended up looking like an overall error rate of around 27%, which is not good at all. We could see that the error rate was due to our decoding scheme. We tried to decode by taking 100 ms chunks of our received frequencies and figuring out what the mode frequency was out of those. However, while debugging, we printed out the frequencies, and could see that while the chunks were supposed to be mostly contained within the 100 ms time period, there was a lot of overlap, and that prevented the decoding from being accurate. Once again, periodic synchronization would have alleviated this problem. In our ideal FSK scenario where synchronization would be implemented perfectly, we would be able to send a message about every 1 ms. Obviously, this would be close to impossible in practice, as that would imply a bitrate of 7000 bits/sec.
IV. Shortcomings of Frequency Shift Keying
Our implementation summary has something of an elephant in the room- it makes no mention of our synchronization mechanism. Initially, our synchronization method was similar to our buffer/frame method for decoding- the receiving computer simply busy-waited until it heard a special frequency/intensity tone, waited for the tone to finish, and began analysis. This would be repeated every 20 or so messages, and provided sync accuracy to 115 milliseconds. This is by no means negligible. But by relying on the orthogonality of the sent signals and by finding the mode inner product of each CHUNK, this could be corrected by ensuring the received computer would trim the sync tones to the expected size. Let me expand on that. Essentially, thanks to FSK’s inner product minimizing a wrong orthogonal signal matching our received signal, each CHUNK decoded should overwhelmingly map to the correct signal, ensuring the entire data segment is decode correctly. At worst, we can expect a linear packet shift- the sync tone will last an extra message-duration. Hence the trim function. This functioned almost perfectly for sending 250 millisecond messages- which is not quite our goal bitrate. Including our intensity mappings, we could reasonably expect to send 10 bits per message. We aimed to send 1 message every 50 milliseconds. A quick glance at our decoding of each frame (1 CHUNK/second) proved this was completely plausible; in fact, we were accurately detecting frequencies down to a fraction of this time period. However, our sync accuracy was a problem. 115 milliseconds delay leaked into many messages, pushing many tones into the next- and even next-to-next- message. As a result, we would have many duplicate messages following any resynchronization. The obvious answer to this was to implement the chirp functionality and convolve with it the received chirp. However, we ran into many implementation problems along the way. At this point, our other encoding scheme had already surpassed our goal bitrate for FSK.
Main Approach: Modified Multiple On-Off Keying
I. The overall approach
For our main approach, we used OOK with actual frequencies sent for 0’s. What we would do is send multiple bits at once for a single burst. For example, if we sent 8 bits at once, then we would divide up the frequency bandwidth we use into 8 equally-spaced bands, and divide those in half, giving us 16 different evenly spaced frequencies. For each of the 8 frequency bands, the frequency on the bottom half of it would correspond to a zero in that position, while a frequency in the top half would correspond to a 1 in that position. On the receiver side, an FFT would be taken of the burst. Whichever of the 0 or 1 frequencies was higher would determine what the bit in that position should be received as. It turns out that this scheme was very robust, with an error rate of at most around .2%, but very frequently an error rate of 0 (this is assuming correct syncing). The reason we decided to use a specific frequency for 0’s, as opposed to traditional OOK, is because we thought that 0’s just don’t transmit too well through our channel, as opposed to a dedicated frequency. And it’s hard to measure the intensity of a 0 tone, especially with a high noise environment. Furthermore, by having a measurable intensity for both the 0 and 1 frequencies, we could detect erasures if the two intensities weren’t an ample amount apart.This had disadvantages as well; we have to send twice as many frequencies as traditional OOK (OOK would send frequencies for just the 1’s, where we have to send frequencies for the different 0’s as well). As a result, we are twice as band-limited. We accepted this as a necessary loss though, because we preferred the robustness and flexibility that our scheme offered over the reduced band-usage of traditional OOK.
II. Decisions
The tradeoffs in our project primarily consisted of balancing bits sent per burst, the duration of a burst, and the parameters of frequency bandwidth, We wanted to send as many bits as we could per burst, but this required us to strain our frequency bandwidth even more. That is, different frequencies would have less and less wiggle space, leading to inter-burst interference. Moreover, sending too many bits per burst reduced the intensity of each frequency. The signal sent out had to be normalized to an amplitude of 1, which means the more frequencies I try to send, each frequency would get a smaller and smaller slice of that amplitude-1-pi, and be more easily corruptible. On the other side of the tradeoff is the duration of the burst. If I try to push the duration too short, then I won’t be sending my messages with enough energy, and so noise could easily corrupt it all. Moreover, this makes each burst very burst-error prone, because a clap or some other burst-error would likely last over many bursts if each one is really short. Of course, I also can’t make my duration too long, or else I take huge hits in transmission time. Not only that, but I also take huge hits in computation time as well, because longer-length bursts require more time to convolve/match filter. Intermingled in between these two tradeoffs is the frequency band, which also came with its own set of problems. We want as large a frequency band as possible to have as much room for different bits as possible. We also wanted to have as high frequencies as possible, so that many periods of each frequency would fit into the duration. But both of these wants came with costs, such as finicky performance of equipment for higher frequencies, and the channel in general preferring specific frequencies. In the end, to make our decisions on these tradeoffs, we tried to find the points of diminishing returns for all of them. Sending 4 bits per burst instead of 2 bits per burst wasn’t much harder at all. However, sending 64 or 128 bits per burst as opposed to 32 bits per burst would be a little harder, and might be asking for too much while giving too little. On the other side, sending a .5 second burst as opposed to a 1 second burst is fairly simple. However, sending a .02 second burst as opposed to a .05 second burst was asking for too much and giving too little. Throughout all this, we also tried to calibrate what frequency band is worth using, and what we could handle. So in the end, the blend we chose (which could probably be further optimized) was 32 bits per burst, with .1 seconds per burst. The frequency bandwidth we used was between 2 kHz and 10 kHz.
III. Synchronization (and what went wrong in the presentation)
We wanted to implement something that was completely dynamic, variable, and real-time. Syncing was a very important factor. The vision we had in mind was that the receiver would stand idly listening and waiting. It should then detect a start signal, and with each incoming packet, it should auto-sync, and finally, it should dynamically detect a stop-signal. To accomplish this, we sent a linearly ascending chirp signal at the beginning that’s twice as long as our burst duration of .1 seconds. At the end of our message, we send another linearly ascending chirp signal with twice the normal duration of a burst. Within our message, we send a linearly ascending chirp along with each and every burst, as displayed in the figure above. The start and end chirps are particularly drawn-out so that they will absolutely be detected and so that they are different than the auto-syncing chirps sent with each burst. The auto-sync chirps sent with each burst are used to pinpoint where each burst started and ended. The pinpointing is extrapolated from the maximum of the convolution of the signal we receive with a time-reversed chirp. Sending a chirp with each burst was a bold move, but it pays off in that we remain perfectly synced throughout a long drawn-out message, and very robustly too, down to the sample. However, every now and then (which was frequent enough to destroy our scheme), these short-and-frequent auto-sync chirps get mismatched, and the receiver thinks that it’s a whole half a burst-duration ahead or behind where it should be all of a sudden. This should never happen, as we expect the receiver to only be off by a small portion of the duration at any given burst. For the presentation in class, we had designed the receiver to terminate with a syncing error if it ever thought that it was behind or ahead by half a duration all of a sudden. This is why our messages would always cut off during a very long message, because every now and then there would be a syncing error, and so our receiver would just terminate. What we should’ve done is, when the receiver accidentally thinks it’s off by more than is reasonable, we should assume that it’s just an infrequent chirp-detection error and assume that we should just advance by a burst-duration anyways. This should solve the problem, because we found that the receiver would almost always simply advance by a burst-duration anyways because it’s usually perfectly synced. Another problem with syncing is how we dynamically detected the stop and the start. We thought and thought, and the best we thought of was that we want some comparison. So what we did was we just kept on cross-correlating what we received with the start chirp and measuring the intensity (the max) of that cross-correlation. If the intensity we measured was a significant factor higher than the previous several measured intensities, then we assumed that we got a start. Similarly, for the end chirp, we would keep cross-correlating until we detected an intensity a large factor higher than the previous several measured intensities. The only problem with the previous scheme was figuring out which “large factor” to use. For our presentation, we had too small of a “large factor”, so we could get false-starts simply from someone whistling like a chirp. So for making our project better, we had to be more picky and selective and ask for a much larger factor, which we definitely could do easily. With all these syncing elements combined, we found that we had a very robust transmission with extremely low error rates, if there were any errors at all.
III. Error Correction Because we were focused on syncing issues and bugs in the implementation of the syncing protocols, which contained the main reason why our scheme would simply not work at all, we didn’t have time to add what few lines the ECC’s would be. But after our presentation, it took us not much time to add them in at all. Since our experimented error rate is so low (about 0.1%) we are using just a Hamming Code with a high rate as the core part of error correction. The length of the total bits are counted first, and recorded in 17 bits which will be sent along with the data bits. This is because data are sent in 32 bits in a packet and the end is paddled with zeros to 32 bits. The receiver end knows which 17 bits represents the length of the message.
After the length Count, the entire message is passed into a hash function which return a 5 bits hash-code for the total bits. The hash function guarantees the authenticity of the entire message so that after decoded from the hamming code, the receiver end has a way to make sure the received message is not corrupted. The receiver end knows the hash function. The Hamming code is a Hamming(31, 26). For every 31 bits sent there is 26 bits from the data, which gives a rate of 83.9%. The Hamming code is expected to detect and correct single error/erasure. Since the error rate is so low, it is not likely to have two bits error in every 31 bits and even if that does happen, the hash function will know that the message is wrong. At last, the bits will be randomly permuted across the packets. The permutation is reversible and the receiver end knows the pattern of permutation. This scheme is to deal with lost packets mainly. Since there is no re-transmission at the moment, a lost packet would be devastating. Since the hamming code can correct single error/erase, if the errored packet’s bits are spread across the the whole message by the random permutation, the lost packet would cause only one error in actual groups of bits, and the lost packet can still be recovered. This utilizes the figure of the Hamming code and the low error rate.
IV. Performance To measure performance, we had our receiver start timing when it detected the start of our message, and then end the timer when it was completely done decoding and had the original message. Error rates would be calculated by manually passing in what should have been received and finding discrepancies with what was received. Finally, we also checked for several things like whether our hashcode matched up or not, whether the length was too short or not, and how many errors the Hamming code had to correct (if the Hamming code was really strained, it might lead us to raise our eyebrows and chiggity check ourselves). Our data rate would be the number of bits we wanted to send (excluding all the extra bits we add on) divided by the amount of time the receiver saw from the beginning of the message to the end of decoding it. When all was said and done, we got over 300 bits sent per second. One part of why this is a little higher than we’d expect is because our receiver is processing when the signal starts somewhat slowly (it is a large convolution). So there is a computational delay between when the signal actually starts and when we detect the signal to start. In real life, we thought that when you detect the signal to start is all that really matters, so we used the signal start detection time, which would admittedly push the time up slightly. Another factor which makes our bitrate so high is the fact that we are doing the lionshare of computation while we’re listening, and not after we’re done listening, so almost as soon as we detect a stop, we’re ready to spit out what we got. V. Closing thoughts We were satisfied with our ultimate algorithm and data rate, but suffered from mediocre software engineering principles. A disporpotante amount of time was spent on this project. In retrospect, our code could’ve been modularized more, workload could have been more evenly allocated, and the overall scheme could have been better planned from the beginning so that our legacy code from previous attempts and homework assignments could have been of more use. A lot of agony came (a long with copious amounts of wasted time) from using unacceptably poor equipment without realizing it. And as soon as we ran the scheme with a half-decent external speaker and microphone, we got great results. Some more techniques that should have been employed from the start was to test our scheme in the ideal case extensively. As soon as we just started passing in our signal directly, with AWGN, it helped us to debug our code much faster. The theory was correct, just the coding involved was perilously error-prone, so we needed an easy environment to debug in, and passing in ideal signal with AWGN was definitely that easy environment. Our greatest takeaway from this project was the practical implementation of the various coding schemes we encountered in lecture. We already understood the mathematical rigour powering the different algorithms, but actually balancing the various tradeoffs between bandwidth, energy, and bitrates supplied the intuition behind the coding, illuminating the subtleties of classical digital communication.