Saturday, November 30, 2013

Communications security: Watch out for middlemen

Imagine a middle-school classroom. I know we all want to forget, but this will just take a moment. So little Suzie has a crush on little Johnny. She's shy because, you know, middle school, so she passes a note along the row of desks. It's a folded up note with "Johnny" written on the outside; the inside professes her adolescent love. Halfway through class, a note makes it back to her that says "Suzie" on the outside and "Ew gross you're a big booger brain don't talk to me again" on the inside (because, you know, middle school). Not looking so good for Suzie and Johnny, right?

Meanwhile, little Lucy is laughing to herself. Her desk lies between Suzie and Johnny, so the notes were passed through her. Without either of them seeing, she replaced the notes with ones she had written: Johnny got a note that said "You're yucky and have boy cooties so don't talk to me ever," and when he wrote back, heartbroken, his letter was altered to be about booger brains.

Suzie has basically just done an analog man-in-the-middle attack. In cyber security, this is a type of eavesdropping where the Suzies of the Internet insert themselves into a communication between two parties. The attacker intercepts data from both sides, reading information never meant for them and inserting their own messages into the conversation.

The man in the middle breaks the normal flow of communication (source: veracode.com)

The attacker splits the TCP connection between the parties into two new connections: client-to-attacker and attacker-to-server. As you might have guessed, this is bad. What if you're trying to communicate to a bank server so you can transfer money (to pay for that privacy visor you've always wanted). You send $24.99 for your purchase. If someone has man-in-the-middle'd you, they can not only grab all the passwords and account numbers you're using but also change the destination of your transfer to their own account and change the amount to $2499.00.

Unfortunately, this kind of attack is really difficult to prevent. Any time you're trusting someone who's not you with confidential information, you run the risk of having that information misused. Using HTTPS or a VPN can greatly reduce the chance that someone will insert themselves into your conversation: HTTPS uses your browser's SSL (secure socket layer) and verifies the identity of the servers you're connecting to. VPNs require access to a VPN access point. It is, however, possible for men in the middle to intercept HTTPS connections. In this case, the user's browser generally gives a warning.

You've probably seen one of these before

And the user often ignores that warning because they don't know better. Admit it: you've clicked "proceed anyway," haven't you? I hope you didn't give your credit card information...

But besides the fact that we should know about Bad Things People Can Do (tm), why do we care? As long as we stay away from sketchy sites and use secure connections, we're probably going to be okay, right? Well let me ask you something: do you use an iPhone? A lot of people do. There's a slight problem, though: last month, Skycure announced to the RSA Europe conference that many iOS apps are vulnerable to a man-in-the-middle type of attack that "lets the attacker take dynamic control over the app." Again, when you think of financial data being sent and received through an app, this is bad news.

And it's not just people hacking your apps. Since Apple technically has the encryption information for its iMessage protocol and stores the public and private keys for each user, they can basically have a man-in-the-middle eavesdropping party any time they (or a friendly government agency) want to.

And speaking of friendly government agencies, they're pretty well-versed at being middle men. The NSA's Quantum servers are placed strategically enough to run man-in-the-middle interceptions to Google services.

The Snowden / NSA / communications security blowup has really brought this sort of eavesdropping attack to the forefront of everyone's mind. Hopefully this will mean more development in actually-secure communications, but I wouldn't hold your breath. If it can be built, it can be broken.

Also, since mentioning the NSA has probably put me on some sort of list, I'd like to say hello to whatever bored government agent has been told to read this over and make sure I'm not a terrorist. You should check out my other posts; you don't guest-star in any of them, but  they might be a nice change from having to read every Facebook post that calls someone a terrorist.

Thursday, November 28, 2013

Artificial intelligence: Coming soon to a stage near you

Some things are naturally polarizing; people tend to have very strong opinions about things like jazz music, operating systems, and the cats-vs-dogs debate. I know I can't do much to convince you that cats are simply better, but I think I found something that will appeal to both sides of the jazz debate: GenJam.

GenJam is a very special jazz musician; it's a really clever computer program that improvises jazz music in real time. Its name is short for Genetic Jammer: it jams with RIT professor Al Biles using musical phrases developed with a genetic algorithm. It's probably my favorite use of AI because 1) it's impressive that it works at all, 2) it's not a pretentious project, and 3) I don't really know many uses of AI, so my selection pool is pretty small.

AI mostly just makes me think of robots, but everyone knows about robots so that's not as interesting

GenJam was developed by Biles in 1993 and has been consistently improved and advanced ever since. It is now capable of improvisation in over 300 songs (including a bunch of Christmas music, in case you need to book an act for an upcoming Christmas party). If you want to see the kinds of things it's capable of, check out this video of Biles and GenJam having an impov session. Those solos weren't preset in GenJam's memory; it had to learn how to create a good solo and use those techniques on the fly, the same as any human musician. 

Before talking about how GenJam has learned how to play jazz, we should talk about genetic algorithms first. That's how GenJam does its learning, and it's probably not something you've done before. AI-Junkie has a fantastic introduction to genetic algorithms if you're interested in the details, but chances are if you're reading this it's for a grade and not because you actually care about what I have to say. So I'll paraphrase the important parts. All information credit is to AI-Junkie, though. 

Unless you slept through seventh grade, you probably know how genes work in nature: mommy and daddy have separate DNA with slightly different chromosomes; these chromosomes can cross over and create new chromosomes for their offspring that are combinations of genes from both parents. The offspring then mate and mix up their own DNA and novel populations are created. 

Chromosomes cross over and mix DNA. This looks like a bad MS Paint drawing but it's apparently an official McGraw-Hill image. I should apply to their graphics department.

Mutations occasionally happen, changing one or two G's to C's and so on. Sometimes this causes a noticeable change in the organism; sometimes it doesn't. Some of the changes that occur (both through mutations and mating) give the organism a slight advantage: they can run faster, jump farther, and make cuter faces at humans. All of these help them survive and give them a better chance of passing on their genes to another generation, so over time the most helpful genes tend to survive while the less helpful ones become much less common. 

This exact process is modeled in genetic algorithms. "Chromosomes" are the possible solutions to a problem. They're encoded as strings of several "genes," representing a part of the solution (for example, trying to find all English words that can be made with five letters? Your "chromosomes" will be five-letter combinations and each "gene" will be a letter.) These chromosomes can cross over and swap genes. You start with a population of n chromosomes made up of random genes. For each chromosome, you see if it's any good at solving your problem and give it a "fitness score," assessing its usefulness (sticking with the five-letter-word example, any chromosomes / words with only consonants or only vowels isn't likely to be a word and will have a super negative fitness score. Things with impossible consonant combinations will have a slightly less negative score. Actual words will have the highest possible positive score). You breed your chromosomes in a way so that ones with higher fitness scores will be more likely to reproduce and ones with lower scores are more likely to die off (so ZARPB will breed since it's sort of word-like, but ZXSDF should get axed pretty early). You keep doing this until you decide you're done, at which point you should have found a bunch of solutions to whatever problem you're working on (in our example, we'll stop after we've checked all 5^5 possible chromosomes and keep the ones with maxed-out fitness). 

Really, though, just read the intro from AI-junkie. It'll make more sense than I ever could. 

For those of you who like pretty pictures, genetic algorithms basically go like this: 

They seem much easier in flowchart form...
Still with me? Maybe? Okay cool. That's the general idea behind genetic algorithms; here's how GenJam applies it. (note: all following information is from the original 1994 paper announcing GenJam to the scientific community. More on that in a bit). GenJam has two separate populations to build up: one for individual measures and one for four-measure phrases. It's not trying to select one perfect solo; instead, it's trying to get a feel for what sorts of measures and phrases work well for the song GenJam is learning. Fitness scores for measures and phrases are determined when GenJam plays them to a "mentor," who gives feedback by pressing "g" for good and "b" for bad. Any given phrase or measure can be "upvoted" or "downvoted" more than once, so particularly bad measures can be given a score of "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb" and hopefully never be played again. There are several ways of mutating measures and phrases, including sorting the notes and reversing their order. With the mentor's feedback GenJam learns how to create good solos for each song in its repertoire. 


The basic layout of how GenJam works

To actually play, GenJam is first given a file that will tell it the tempo of the piece, the style (straight or swing), the chord progression, and what's happening in the song  [PDF warning] (for example, "four bars intro, Biles solos for eight bars, GenJam solos for eight bars," etc -- check out slide 14 from that link to see what I'm talking about). It also gets the midi tracks for any background music (drums, pianos, bass, etc). During a performance, GenJam and Biles listen to each other and improvise on the fly, like two human musicians would do at a jam session.

Biles and GenJam jamming

All the information for GenJam's specific implementation (and the source of the information for the last few paragraphs) is here in the paper Biles wrote for the 1994 International Computer Music Conference. It really is worth checking out: it's extremely accessible, especially as far as scientific papers tend to go, and is an honestly interseting and engaging read. 

Biles did a TEDx talk last year about GenJam, if you'd rather watch and listen instead of read:



This, I think, is one of the coolest uses of artificial intelligence. Not because it's super technologically relevant or useful (I'm firmly in the "the world doesn't need more jazz music" camp) but because music is something we consider uniquely human. And a computer is learning it. It's a different kind of scary than we're used to, and it makes you think. Which, really, is the whole point.

__

Credit where credit is due: I would never have known about this project if one of my professors wasn't really into computer music and always excited to share cool links with the class, so stop by Computer Music Blog and check out Professor Merz's stuff.

Saturday, November 23, 2013

Computer science: the prequel

So I've skimmed a few blogs that have already been written and it's weird to see how many of them are just about C. I mean, yeah it's been incredibly influential and is the precursor of at least a dozen languages, but it's hardly a starting point. It's like writing a biography about Lincoln when you've been asked to summarize the history of the United States. I'm going to take you back a little farther, to the great-great grandparents of computer science.

Hopefully you've heard of Ada Lovelace and Charles Babbage; if not, I'm honestly disappointed in you. It's like being a music major and never having heard of Bach. These two invented computer science back when the lightbulb was still a new and exciting development in technology.

Wow. Such light. So patented.

Charles Babbage is known today as "the father of computing," and for good reason. Babbage, like any other mathematician of his time, relied on mathematical tables for his computations. These tables included the values of logarithms, exponents, and other calculations that we now rely on calculators to solve.

"Table of common logarithms." Instead of running a calculation through a calculator, your great-great-granddad looked it up in a book like this. 

Every calculation in one of these books (and you can see from the picture that there are thousands of them) was done by a computer. Not a computer like you and I are using, but a human computer. That's the origin of the word -- a "computer" in the nineteenth century was a person who did computations for a living. If you that sounds like a massive pain to you, you're right: it took a group of three people six months, working nearly nonstop, to do the calculations to predict the return of Halley's comet. Babbage thought there had to be a better way. In 1821 he conceived of a machine to autonomously compile mathematical tables for polynomial functions; he called this the Difference Engine and quickly sought government funding. Though he initially received some money to work on the project, insufficient funds and personal conflicts with the engineer let to the ultimate abandonment of development of the Difference Engine in 1834. About fifteen years later, he designed a revamped engine (creatively named the Difference Engine No. 2) that would use fewer parts and could calculate seventh-order polynomials with numbers up to thirty-one digits long. Though Babbage made no effort to actually construct this engine, The London Science Museum constructed two Difference Engines in 1991. You can see a video about how these work here, or you could take fifteen minutes in a car and go see one yourself, since the Computer History Museum (yes, there is one. it's in Mountain View) has one of those two engines.

And it's absolutely gorgeous.

But wait, there's more! Despite everything your algebra II teacher may have told you, there's more to life than polynomials. Babbage figured that out somewhere between Difference Engines one and two, and began conceptualizing an all-purpose programmable computing machine. This machine was called the Analytical Engine (creative naming wasn't really a thing in the 1800s) and is what really earned Babbage his title as the first computer pioneer:

It was programmable using punched cards, an idea borrowed from the Jacquard loom used for weaving complex patterns in textiles. The Engine had a 'Store' [a cache!!] where numbers and intermediate results could be held, and a separate 'Mill' [a CPU!!] where the arithmetic processing was performed. It had an internal repertoire of the four arithmetical functions and could perform direct multiplication and division. It was also capable of functions for which we have modern names: conditional branching, looping (iteration), microprogramming, parallel processing, iteration, latching, polling, and pulse-shaping, amongst others, though Babbage nowhere used these terms. It had a variety of outputs including hardcopy printout, punched cards, graph plotting and the automatic production of stereotypes - trays of soft material into which results were impressed that could be used as molds for making printing plates. [text source here, but the link and notes are mine]

But hardware's no good without software, right? That's where Augusta Ada King, Countess of Lovelace -- more commonly known as Ada Lovelace -- comes in (you've probably read her dad's poetry). Lovelace was a longtime friend of Babbage and took a great interest in his work. When an Italian mathematician, Frederico Luigi, Conte Menabrea, published a paper in French (not Italian. it's weird.) about Babbage's Analytical Engine, Babbage asked Lovelace to translate the paper to English and to add her own notes. The two collaborated for a year on a series of notes that ended up longer than the paper itself. In these, Lovelace included step-by-step sequences for solving several mathematical problems. These were the world's first programs written for a general-purpose computer. And they were written in 1843. Seriously. The first computer programs were written while the Oregon Trail was the new big thing

Lovelace fully understood the impact the Analytical Engine could have on the world. She toyed with the idea of an engine that could be fed any series of characters, not just numbers, and have rules to manipulate those characters. Everything, Lovelace recognized, could ultimately be expressed by numbers: Numbers could represent "[letters] of the alphabet. They could represent musical notes. They could represent positions on a chess board" [source]. Electronic instruments hadn't even been invented yet, and Lovelace was imagining computer-generated music. She's been called the "prophet of the computer age," and that title is well-deserved. 

Computer history long predates computers. Thinking that all major developments came with the C language and its syntax is completely ignoring the beauty of a programmable analog computer and the incredible foresight that predicted developments over a century before they were ultimately realized. 

If you got this far, here's a link to a webcomic that I found about Lovelace and Babbage, which I haven't read yet but seems to not suck. 

Tuesday, November 12, 2013

File sharing and the music industry

The idea of sharing has completely evolved in the last few decades. For thousands of years, sharing something with someone meant you had to physically give them something of yours so your friend (or sibling) could enjoy it, too. With computers, though, everything is different: I can share my music without ever giving up my own copy. It's a fantastic system that's allowed my friends and I to show each other obscure bands we would never have known about otherwise. The issue, though, is that this is technically illegal. Music is, after all, copyrighted, and I don't have the legal right to copy an album to a flash drive to show a friend.

Ever since Napster showed up in 1999, peer-to-peer file sharing has skyrocketed. This is great for regular people like you and me, since we now have this weird endless world of free media at our digital fingertips -- it's estimated that 1.2 billion songs were illegally downloaded in 2010.

download map statistics countries
And the US is number one for downloading. USA! USA!
Naturally, the people who could be making money from all this music aren't happy. The Recording Industry Association of America (RIAA) insists that illegal music downloads have caused $12.5 billion losses to the American economy and the loss of 70,000 jobs. Sounds pretty bad, right? Legal-minded people certainly seem to think so, too, since they've developed a track record of severe punishment for what seems like a minor offense. When LimeWire was taken to court, the legal battle lasted five years and ended in a $105 million settlement. People caught illegally downloading music have been forced to pay up to $22,500 per song when taken to court. (Interesting note about that one: the jury decided on a $675,000 total fine for the defendent -- that's the $22,500 per song part -- and the residing judge thought that a fine that enormous might be unconstitutional, presumably because that's pretty cruel and unusual. He suggested it be lowered to a fine of $67,500 -- $2,250 per song. so much more reasonable -- but the First Circuit Court of Appeals  reinstated the original fine.)  More recently, a major supplier of YouTube videos is being sued because people are uploading videos of unprofessional covers of popular songs.

Better watch out, Pentatonix

This might seem a little much, but pirates are ruining our economy, right?

Aren't they?

Turns out they might not be.

While there are conflicting reports on the effect that illegal music downloads have had, one thing is certain: music sales are just not how artists make money any more. A lot of this is probably because it's really hard to make money selling music in a digital age (and digital music sales are here to stay, piracy or no piracy). Money mostly comes from merchandise and tour tickets which, hopefully unsurprisingly, pirates pay for. This study [PDF warning] notes that filesharing has greatly increased consumption of recorded music and has made music more widely accessible -- definitely a social advantage for the artist, as it makes it much easier to create a fanbase if more people are familiar with your work. There is also evidence that, while sales of recorded music have fallen greatly, concert ticket sales and other sources of revenue have grown in the last several years.

music industry revenue record sales concert tickets
As record sales fell, concert sales grew rapidly

But what's the bottom line? Is piracy ruining America or creating new fans who will be more than willing to pay $40 for a t-shirt at a show? Here's my take: if you were never going to pay for a song in the first place, but someone emailed it to you or you downloaded it for free on a whim, you haven't hurt anyone. The RIAA would never have gotten your money anyway. But maybe now you'll go out and buy the album, or a poster, or an expensive concert ticket. Maybe you won't. They're either exactly where they would have been or better off. The band benefits regardless, since they either get money or exposure (and exposure is how you get people to like you enough that they'll give you money).

Anecdotally, I can say that piracy can definitely lead to an increase in revenue. I heard about a band from a casual mention in the newspaper once. Not wanting to commit money to something I might hate, I downloaded their albums illegally. It turned out that I really liked them and have since seen them in concert at least four times and bought a fair amount of merchandise. This probably adds up to at least $300, when buying their music on iTunes would have cost $60, probably only $25 of which would have made it to the band.

This isn't to say I only ever get music illegally. I actually pay for music. I just don't pay full price. There are plenty of places (many of them shady and Russian) that host legal mp3 and movie downloads for about a dollar per album (maybe four for a huge collection) and about five dollars for a movie, and I'm more than happy to pay that. If media wasn't so expensive, fewer people would resort to illegal means to get it. It's like this song that a friend gave me in high school: "you've overcharged us for music for years, and now we're just trying to find a fair balance." Basically, the RIAA and MPAA have done this to themselves.

Sunday, November 3, 2013

Data structures: They're pretty important

I don't know why, but there seems to be this trend where people sleep through CS 146. I've noticed it myself and I've heard stories from other people that this is definitely a thing. First of all, I'm not sure how it's even possible to sleep when Dr. Taylor is running around being excited about Dijkstra. Secondly, though, this is just bad -- that class is probably the most important one in the CS department. Algorithms and data structures are what people want you to know. While I'd love to discuss intricacies of shortest-path algorithms with you, though, the topic for today is data structures. I'm not going to pretend like this decision is because I find them more interesting than algorithms; it's more because I don't pick the topics for this blog. I don't have that kind of power.

So. Data structures. If you're one of those people who slept though Dr. Taylor's class (you can keep it a secret, I swear), let's review. Data structures are, probably unsurprisingly, different ways to store data -- the structure that your data has, if you will. The key thing here is to make sure you're storing data efficiently; the point of a data structure is optimization. Some data structures are better for databases, for example, while others are great for looking up information. This efficiency is a big deal: "efficient data structures are a key to designing efficient algorithms. Some formal design methods and programming languages emphasize data structures, rather than algorithms, as the key organizing factor in software design" [thanks, Wikipedia]. Entire programming languages are structured around data structures. And you slept through that class. Way to go, hero.

Having the right data structure can make a huge difference. For example, suppose you had a bazillion items that you needed to search through. What if you stored all that information in an array? Arrays are simple, space-wise, and easy to use. But then when you have to search for your thing, it's going to take O(n) to find it. And that's not just the worst case; that's the expected runtime. That's bad, especially when you're working with numbers like a bazillion. Instead, what if you got all fancy and used a B-Tree? Clearly, you passed that homework assignment in 146. Instead of some multiple of a bazillion, your search is going to take O(ln n), and log of a bazillion is a lot smaller than a bazillion.

If you need a good cheat sheet for data structure runtimes (expected and worst-case) under various situations, there's a website helpfully called Big-O Cheat Sheet.

big-o cheat sheet algorithms data structures
You might want to bookmark this

If you take nothing else away from this blog (and let's face it, chances are pretty good that if you're reading this you don't actually care what I have to say about data structures) at least remember how important it is to have the right data structure for your programs. It can make or break your performance. This post on Forbes.com will do a great job describing to you that, even though you passed the class, you don't properly know your data structures. To help with that, Coursera has an algorithms / data structures class and Berkeley has posted lectures online.

You have no excuse now. Go learn things. Since you'll be working so hard to get interviews, you might as well make sure you have a chance at passing them.