Sunday, January 17, 2010

Sample Size, Hip Hop and Herpes

I studied statistics in college. Not all of it stuck, of course, which might have been at least partially based on my unwillingness to go to class at even semi-regular intervals, but one thing that did is about sample size.

Essentially (up to a point) the more data you have, the more confidence you can have in your conclusion based on that data set.

Let's look at it mathematically at a very high level before I apply it to some semi-real life situations and finally bring up why I bring it up (it actually has to do with the next blog entry I plan on writing).

Imagine that Truth is a line. You know it's on an X-Y axis, but you don't know the function that determines the line. We're going to examine how easily (and how confidently) we can tell the "truth" from limited data. (I'm not going to apply numbers to any of this... it's just all abstract generalities).

Let's start with a blank slate. No data points. We literally know nothing about the situation.




We COULD guess the Truth, right? ("Chocolate Babies".) But... yeah.

Let's try one step up from that. A big step, but still in a rough spot:




So we have one data point (the green dot). We know that the green dot is a part of the Truth (we're ignoring that the data we have might be bad/not part of Truth). But what can we do with that? There are still so many possibilities for Truth. Note that each of the gray lines represent possible Truths.




Let's take a look at two data points.




At a glance? We've got it. After all, we can "connect the dots," right?




Of course we can. The shortest distance between two points is a straight line. Have we discovered Truth after only two data points? Does it only take two licks to get to the center of a Tootsie Roll Lollipop?

Well, what about this?




Or this?




There is, after all, no guarantee that the line is straight. So we're almost back to where we were with one data point:




After three points, we're a little better.




Four? Maybe closer.




Twenty? We start to feel better, right? The dots almost start to connect themselves.




If I took the time to do a hundred, we would probably arrive at Truth in this little graph exercise (ignoring the X axis extending infinitely to the right, of course).

Now what about bad data? What if we know there's a chance that one (or more) of our data points are no good... we misinterpreted, or were lied to, or whatever... it sort of depends on which data point. If we make a few data points bad (red)... it changes how we see Truth.




As humans, we're always getting more data. We might meet another single person, and the first data point is initial attraction. Second one is whether she's interested in you. Third is a first date. Fourth is a second date. Twentieth is making out (unless you're me, then it's the third data point). One millionth is having a second child.

How do you know when you can trust her? How do you know when you want to be with her for the rest of your life? How do you know when you still want to be with her the rest of your life?

Data comes in and data becomes bad because things keep moving. People change and circumstances change.

Anyway. That is some nonsense about sample size.

With all of this said... it seems unfair/unwise to generalize after a single data point, right?

It might not be possible, for example, to write a blog about all hip hop shows in Seattle based on my attendance of a single evening at one, right?

Well... I don't think I'm going to let any of that stand in my way. Prepare your bad selves for "Seattle Hip Hop Show Generator 1.0" when I get the time and energy.

It's gonna be the Truth. With a high probability of lots of little red dots.

(That reads like a herpes joke, for some reason.)

No comments: