Wednesday, August 28, 2013

unit circle under different norms

Using the p-norm defined as
\[||x||_{p} = \left(\sum_{i=1}^{n} |x_{i}|^{p}\right)^{1/p}\]
for \(p>0\), we can look at what a unit circle is for each norm. Note that for \(0<p<1\) the formula above is not actually a norm as it violates the triangle inequality required to be a norm. Regardless, let's look at what a unit circle looks like in 2D for a few of the norms:
Isn't it cool how the corners of the unit circle get pushed out as the order of the norm increases! 


Friday, August 16, 2013

The importance of high spike rates


My lab uses the Neural Engineering Framework (NEF) for a principled method of implementing computations on our neuromorphic hardware. NEF uses the spike rates of neurons to represent values and calculate the weights connecting neurons. A fundamental dilemma is then how do we estimate the spike rate of a neuron? Real neurons can't just tell you their spike rate (Although simulated ones certainly can), so you have to estimate the neural spike rate from the spikes that the neurons output. 

A simple, reasonable, first approach to estimating neural spike rate would be to just count spikes over a time window and divide the number of spikes by the length of the time window. This would work in the simple case of having as long as you want to observe and with an unchanging spike rate to estimate. In this case, as you make the observation window longer and longer, the spike rate estimate will become closer to the actual spike rate. 

Unfortunately, we don't have the luxury of time, and our systems work in dynamic environments where variables are constantly changing.  Using a box window is also limited by quantization: A spike is either in the window or not, so your spike rate estimate will be quantized to multiples of \(1/T\), where \(T\) is the window length. Here, we can see that as \(T\to\infty\), our estimate could take on any value. However, we typically limit \(T\) to 100ms in operation, so if we used a box window, our spike rate estimates would be quantized to multiples of 10Hz, which not very precise. 

A better way to estimate spike rates for calculating the weights to connect neurons in our network would be to see spike rates in a manner similar to how the neurons will see them. 

How does a neuron see a spike rate? Well, spikes interact with a neuron through the neuron's synapse, which is a low pass filter in our neuron model. The synapse converts the spike (impulses) into a current that is inject into the neuron's soma. It is this current that can be interpreted as what the neuron sees as the spike rate. 

Therefore, to put ourselves in the perspective of the neuron when collecting spike rate data to training the network, we can use a low pass filter, with time constant matched to the neural synapses, to estimate the spike rate instead of counting spikes over a time window. For Poisson and uniform spike inputs, this method looks something like this:
Spike rate estimation for various rates over time. Blue traces are filter estimate when input spikes are generated uniformly. Red traces are filter estimate when input spikes are generated from a Poisson process. Filter time constant is 100ms. Spike rate estimates are normalized to the actual firing rate, so the correct estimate on the y-axis is 1. Vertical dotted lines denote the first 5 time constants of time. Horizontal dotted lines denote exponential progression towards steady state value.
A few things to note from the plot:
  • As spike rates increase, our estimate gets tighter to the actual spike rate
  • A uniform spike input is understandably less noisy than a Poisson spike input.
  • It takes time for the synapse to reach its steady state spike rate estimate. Note how the traces follow the saturating exponential function indicated by the dotted line. 
Why does it make sense that as spike rate increases, the the estimate gets tighter to the actual spike rate? Consider the Poisson spike process, where the number of spikes seen in a window of time is described by a Poisson distribution, in which the mean equals the variance, \(\mu=\sigma^{2}.\) As the rate of the Poisson distribution increases, it looks more and more like a Gaussian distribution. For a Gaussian distribution, about 68% of the probability is within 1 standard deviation of the mean.  Looking at the ratio between the standard deviation and the mean, 
\[\frac{\sqrt{\sigma^{2}}}{\mu}=\frac{\sqrt{\mu}}{\mu}\]
\[ = \frac{1}{\sqrt{\mu}}\]
we see that as the mean rate increases, the spread around the mean drops with the square root of the mean, which explains why the estimate gets tighter to the actual spike rate with higher rates.

It's intuitive that a Poisson spike input would be more noisy than a uniform input. But how much more noisy?Looking at the distribution of spike rate estimate values collected over time after allowing the system to settle (i.e. run for 5 time constants) gives us an idea of what the synapse estimates as the rate:
Distribution (i.e. histogram) of spike rate estimates with a uniform spike input. x-axis is spike rate estimate, and y-axis is histogram counts so area of histogram is 1.
Distribution (i.e. histogram) of spike rate estimates with a Poisson spike input. x-axis is spike rate estimate, and y-axis is histogram counts so area of histogram is 1.
Of note from these plots:
  • The estimate of the Poisson spike rate begins to look Gaussian as the rate increases.
  • The estimate of the uniform spike rate begins to look uniform as the rate increases. Surprise surprise!
  • The x-axis scaling gets tighter with high firing rates, reflecting that our estimate gets better and better.
Code for this post found here.

Monday, July 29, 2013

Noisy Dynamical Systems

Today we're going to play with a nonlinear dynamical system and noise. Code for singing along is posted on github. Our system is described, in general, by
\[\dot{x} = f(x),\]
which says that the rate of change of \(x\) is just some arbitrary function \(f(\cdot)\) of \(x\). In block diagram format, the system looks like this:

block diagram of system
System block diagram. \(1/s\) is an integrator.  The derivative of \(x\) is its input, and \(x\) is its output.
Specifically, let's let our nonlinear function take the form 
\[f(x) = e^{-x}-1.\]

To start analyzing this system, we'll plot the phase portrait, which shows us how \(\dot{x}\) depends on \(x\).
Phase portrait of our dynamical system
The phase portrait for our nonlinear system tells us how the system changes. We'll be running this system from several different initial conditions indicated with the black to red  dots. Their positions along the x-axis are their initial condition, and their positions along the y-axis are their initial derivatives. 
From the phase portrait, we can tell that the system's state, or value of \(x\), increases when the state is negative and decreases when the state is positive. Therefore, we can predict that the system will stabilize at 0.

Do we see this stable point?
System state trajectory over time for initial states scattered between -1.5 and 1.5 .

Yes! Great!

What if we add some noise? 
\[\dot{x} = f(x) + u \]
\[u \text{ is } \mathcal{N}(0, \sigma)\]
That is, we've added Gaussian white noise of variance \(\sigma^{2}\) to our system.
Our system block diagram now looks like this: 
Block diagram of system with noise added to input of integrator.
Do we still see the system go to same stable point?

System simulation with noise added to the input of the integrator.  Noise is  Gaussian white noise with variance indicated by \(\sigma\) above each subplot.

It looks like the system goes to the same stable point at low levels of noise, but system get's pretty swamped with noise at higher variances, which makes it hard to tell if the system settles around the predicted stable point. Running the simulation many times and averaging across trials helps us see the long term behavior more clearly:
Each line is the average of 25 trials.
That's better. It looks like the system settles, on average, at the predicted state for low noise levels, but at higher levels of noise, the system seems to be averaging at a higher state value. If we do the same averaging over trials but with much larger noise, we see this effect more clearly:

The system no longer averages out anywhere close to the predicted stable point without noise. The system doesn't even look like it reaches its average stable value by the end of the simulation for very high levels of noise. Why does the stable point shift?

Does the average stable point change because we are now inputting a low pass filtered version of the noise into the nonlinearity? I'm not sure of an intuitive way of understanding this yet.

Now let's tweak the system a little:

\[\dot{x} = f(x+u)\]

Now, noise is passed through the nonlinearity before integration.

Show system diagram.
Block diagram of system with noise added to the input to the nonlinearity.
Running the simulation for this system yields:

System simulation with noise added to the input to the nonlinearity.  Each line is a single trial, not averaged. 
Look carefully at the value that the system settles at for each level of noise. The system is less noisy, but fixed point position is far more sensitive to the level of noise!

Think about it this way. If you are the system, and you are at a state x, what do you see? Because of the noise you see the everything around you. So if we take the noise distribution and average each point of the phase portrait with our noise (i.e. convolve our system function \(f(\cdot)\) with the pdf (probability density function) of our noise), we get a better sense of the fixed points of our system:
Phase portrait of the the system after convolving the system function, \(f\), with the noise distribution pdf.

Another way of analyzing the effect of noise is via the Taylor series expansion of the \[f(\cdot)\]. With noise, \(\dot{x}=f(x+u)\), and
\[f(x+u)=f(x)+f'(x)u+\frac{1}{2}f''(x)u^{2}+\frac{1}{3!}f^{(3)}(x)u^3+...\]
Taking the expected value,
\[E[f(x+u)]=f(x)+f'(x)E[u]+\frac{1}{2}f''(x)E[u^{2}]+\frac{1}{3!}f^{(3)}(x)E[u^3]+...\]
Since we're assuming \(u\) is symmetric about 0, the expected value of the odd powers of \(u\) are also 0,  so
\[E[f(x+u)]=f(x)+\frac{1}{2}f''(x)E[u^{2}]+\frac{1}{4!}f^{(4)}(x)E[u^4]+...\].
Before moving forward, we can see here why the phase portrait changes with noise. When noise entering the nonlinearity, \(\dot{x}\) is no longer just \(f(\cdot)\), but \(f(\cdot)\) summed with its derivatives weighted by the higher order moments of \(u\). In our example case,
\[f(x)=e^{-x}-1\]
\[f'(x)=-e^{-x}\]
\[f''(x)=e^{-x}\]
\[f^{(3)}(x)=-e^{-x}\]
\[f^{(4)}(x)=e^{-x}\]
\[\vdots\]
Notice a pattern? All of the odd ordered derivatives are the same, and all of the even ordered derivatives are the same! So for our example,
\[E[f(x+u)]=e^{-x}-1+\frac{1}{2}e^{-x}E[u^{2}]+\frac{1}{4!}e^{-x}E[u^4]+...\]
\[E[f(x+u)]=e^{-x}-1+e^{-x}\left(\frac{\sigma^{2}}{2}+\frac{3\sigma^{4}}{4!}+...\right)\]

This is neat! We can analytically see that noise changes the phase portrait.


What's different between the two noise cases?
Both noise cases change the dynamics of the system, but the system is more sensitive to noise input to the nonlinearity.

Keep in mind that in reality we can really see both kinds of noise:.
Block diagram of system with noise added to both 
If you can control its distribution, noise can even become another knob to tune your system!

Sunday, July 14, 2013

A multiply built with adds and bitshifts

Let's be multiplication machines! In our toolbox, let's say we only have adds and bitshifts. We're wonderful machines that can add any two numbers together just as easily as any other two numbers. It's just as easy to do \(1+1\) as it is to do \(5391 + 213492\). We're also binary thinkers and can easily move bits left and right to effectively multiply and divide by 2. Isn't life wonderful? But we don't want to just multiply by 2's. We want to be able to multiply by any two, arbitrary numbers.
How do we calculate something like \(m \times n\) for any \(m\) and \(n?\)
Or maybe since using concrete numbers is nice, let's say \(m=13\) and \(n=127\).

Going way back to when we were but incy wincy adders just learning how to multiply, we learned that multiplication is shorthand for repeated addition, so to calculate \(13 \times 127 \), we could either add 13 to itself 127 times, or add 127 to itself 13 times
\(13 \times 127 = 13 + 13 + 13 + ... \) repeat until we've added 127 13s.
or
\(13 \times 127 = 127 + 127 + 127 + ... \) repeat until we've added 13 127s.
The first way of calculating \(13 \times 127 \) takes 126 operations, and the second way of calculating takes about 12 operations, so we'll prefer the second method. So now we know that \(m \times n \) can be done in \(\min{(m,n)}\) time, that is, linear in either \(m\) or \(n\), whichever is smaller. An implementation of this in c would look something like this:


int multiply_by_adds(int m, int n)
{
   int iterations = MIN(m, n);
   int base = MAX(m, n);
   int result = 0;
   for (int i=0; i < iterations; i++) {
      result += base;
   }  
   return result;
}


But can we do better?

Yes we can! We have bitshifts in our toolbox. What do we know about bitshifts? We know that shifting left is the same as multiplying by powers of two. Each shift left increases the power of two by which we multiply.

In binary,
\[127 = 001111111.\]
\[127 \times 1 = 127 = 001111111 = 127<<0\]
\[127 \times 2 = 254 = 011111110 = 127<<1\]
\[127 \times 4 = 508 = 111111100 = 127<<2\]
etc...

Conveniently, binary representation exactly expresses a number as the sum of powers of two! This representation is really cool because decomposes our weird numbers into factors that the computer can easily manipulate.
\[13 = 8 + 4 + 1 = 2^3 + 2^2 + 2^0 = 1000 + 0100 + 0001 = 1101.\]

Therefore, we can rewrite
\[13 \times 127 = (8 + 4 + 1) \times 127 = \]
\[(8 \times 127) + (4 \times 127) + (1 \times 127) =\]
\[(127<<3) + (127<<2) + (127<<0)  = \]
\[1016 + 508 + 127  = 1651 \]
and we're done! no need for silly repeated additions of \(127\).

By leveraging the binary representation of our numbers, we've written an algorithm that requires one shift and one addition for each power two that compose \(m\) or \(n\). Now there is a slight nuance given the way integers are represented with a fixed number of bits on our computers, so in practice we can iterate through each bit of the representation to do this in constant time. An implementation in c would look something like this:

int multiply_by_shifts_adds(int m, int n)
{
   int multiplier = MIN(m, n);
   int base = MAX(m, n);
   int result = 0;
   for (int i = 0; i < BITS_IN_INT; i++) {
      if (0x1 << i & multiplier) result += base << i;
   }
   return result;
}

However, iterating through all of the bits is pretty wasteful.  If many of the bits are 0, we should just iterate through the minimum number of bits. Then we will have an algorithm that can be done in \(\log(\min{(m,n)})\) time that is roughly bounded by the constant time it takes to just iterate through all the bits of the integer representation. An implementation in c, using the gcc compiler, would look something like this: 

int multiply_by_shifts_adds_opt(int m, int n)
{
   int multiplier = MIN(m, n);
   int base = MAX(m, n);
   int result = 0;
   if (multiplier == 0) return 0;
   int lead_zeros = __builtin_clz(multiplier); // builtin gcc function
   int highest_pow = BITS_IN_INT - lead_zeros - 1;
   for (int i = highest_pow; i >= 0 ; i--) {
      if (0x1 << i & multiplier) result += base << i;
   }
   return result;
}

Full github code here.

Timing results for the three multiplier implementations discussed above.  As predicted, the implementation with just adds increases linearly, the implementation with shifts and adds that works through all the bits of our integer representation works in constant time, and the optimized implementation using shifts and adds that works through the minimum number of bits works it logarithmic time.  

Of course we would never implement something like this in practice at the level of c. All higher level languages have multiplication as a basic operator, but this exercise gave us a glimpse of the power of binary representation that allow computers to do computations quickly.  Multiply away, computers!

Sunday, April 14, 2013

Mounting Stanford AFS Ubuntu

Stanford uses AFS to host filesystem space for students. Stanford also uses Kerberos for authentication. The combination of the two can be maddening to work with remotely.

These directions are to for people with running Ubuntu who would like to mount the Stanford afs to access files as if they were local. Restating 1nfty's directions:
  1. Install kerberos and afs related components
    sudo apt-get install krb5-user krb5-clients openafs-krb5 openafs-client
  2. Make sure /afs exists. If not, create it.
  3. Reboot to find /afs directory filled with pointers to afs volumes all over the world.
  4. Authenticate through kerberos and obtain AFS token by sequentially running
    kinit
    aklog
  5. Make soft link of your user directory to somewhere more convenient
    ln -s /afs/ir.stanford.edu/[path_to_afs_home_directory] ~/afs
These directions did not work straight out of the box for me. I'm pretty sure Narasimham's directions were all I needed, but I'm not sure. Stanford's directions were pretty bloated and I couldn't make much sense of them. If(When) I have to go through the procedure again, then I'll more thoroughly document the procedure and clean this post up.

Once everything is all set up, don't forget to authenticate before actually trying to access your files:
kinit
aklog


Tuesday, March 5, 2013

g++ compiler options

Four stages of creating a program
  1. Preprocess
  2. Compile
  3. Assemble
  4. Link
Example: Creating and using a shared library (lib*.so - filename starts with prefix "lib" and ends with suffix ".so"):
First, we'll create the object file from the source.
g++ -c <filename.cpp> -o <filename.o>
-c specifies that we're only compiling, not linking. We do this when filename.cpp doesn't have a main function so there's nothing to link.
-o specifies the output file name. Note that it doesn't have to end in .o, but the .o ending is convention for object file
Next, we'll create the shared object file library from the object file
g++ -shared -fPIC <filename.o> -o <libfilename.so>
-shared indicates that we're creating a shared library
-fPIC indicates that we're creating position independent code.
Excellent, we've now created a shared library!

Now for the fun part, linking, the biggest headache of them all. Let's say we wan't to use the library in <appname.cpp>.
g++ <appname.cpp> -L. -l<filename> -Wl,-rpath,. -o <appname>
-L. tells the g++ to look in the current directory (hence the ".") at link-time for the library.
-l<filename> tells which library we'll be using (Although the library will look something like lib<filename>.so, the convntion is -l<filename>).
-Wl,<ld option> passes <option> to the linker. <option> can be comma separated for multiple options as we do. By passing "-rpath,." to the linker, we tell it to use the current directory "." at run-time linking. Note the separation of link-time and run-time paths when looking for libraries.

Putting it all together looks something like

g++ -c <filename.cpp> -o <filename.o>
g++ -shared -fPIC <filename.o> -o <libfilename.so>
g++ <appname.cpp> -L. -l<filename> -Wl,-rpath,. -o <appname>



Sunday, February 3, 2013

Adding printer, Ubuntu 12.04

Adding a printer is one of the most annoying tasks I've dealt with.  It doesn't help that there are dozens of printing protocols and ways to connect to a printer. Here is a step-by-step to remind myself of this process.
  1. From terminal, run sudo system-config-printer
  2. In the dialog box that opens, press Add
  3. Enter the device URI
    • e.g. socket://172.25.102.32
  4. It'll ask for a driver, you can see if it's available from the drop down menus but this is unlikely from my experience. Usually, you'll just have to find and provide the .ppd file yourself.
  5. Accept the installable options and device name etc...
  6. print a test page! Hopefully it works. If not, get angry and continue your quest across the internets.
Happy printing! 
(I'm going to hate myself for the cheery ending when this doesn't work)

Sunday, January 13, 2013

Amazing Fact in Probability! Record Setting and Symmetry of Independence

Consider this scenario:
You've been playing your favorite game for a while (I've been recently enjoying Temple Run!), so your skills have practically leveled off and you're basically just playing the game for fun and to see what's the highest score you can achieve!

Lo and behold, after many hours of playing late into the night you achieve a high score on a particular run! Flushed with success but tired as well, you wonder whether it is worth it try another run or not. Intuitively, one would think that achieving a high score on the next run is going to be even more difficult so the probability of the next one run being a record setting run is lower now that you've just set a record on the current run.

Is this intuition correct? Is the probability of the next run being record setting affected by the fact that the current run is record setting?
Stop and think for a bit...
...
Maybe a bit more...
...
Okay now let's take a closer look.
Another way to look at the question is to turn it around: Is the probability of the current run being record setting affected if it is fated that the next run will be record setting? Now, our intuition would say no, the current run does not care what the next run does; it is independent of the next run.  Here is the amazing fact: since the setting a record on the current run is independent of setting a record on the next run, so too setting a record on the next run is independent setting a record on current run! Independence is symmetric! It doesn't matter whether you just set a record or not!

But be careful, I'm not saying that the chance of setting a record on the next run is the same as setting a record on the current run.  Think of it this way, to be a record, a run must beat out all previous runs. For the current run to be a record, it must beat out all previous runs, but for the next run to be a record, it must beat out all previous runs and the current run. Therefore, setting a record on the next run will indeed have lower probability than setting a record on the current run, but only because it has to beat out more runs, not because it's affected by the outcome of the current run.

My worldview has just been recalibrated.

Saturday, January 5, 2013

Notes from CS 222 - Differential Equations in Action

In units 1 and 2, we worked towards simulating Apollo 13's flight from the earth, around the moon, and back to earth. 

$h$ - step size
$x$ - position
$v$ - velocity
$F$ - net force 
$m$ - mass

Euler's method:

Euler's method is the most basic and easy to understand method for simulating a differential equation.  At each step in time, we simply increment the state variables linearly along the line tangent to their current values.
$x[i + 1] = x[i] + h \cdot v[i]$
$v[i + 1] = v[i] + h \cdot \frac{F(x[i],v[i])}{m}$

In the limiting case where $h\to0$, Euler's method will capture the dynamics exactly, but this makes the method very computationally intensive.  Values of $h>0$ will all lead to a certain amount of error at each step, known as the Local Truncation Error.

Heun's method

Heun's method is an improvement on Euler's method.  Rather than increment along the line tangent to the state variables, we use the average of the lines tangent to the state variable at the current point and the tangent to the state variable at the next point had we used Euler's method.
$x_E = x[i] + h \cdot v[i]$
$v_E = v[i] + h \cdot \frac{F(x[i],v[i])}{m}$
$x[i + 1] = x[i] + h \cdot \frac{v[i] + v_E}{2}$
$v[i + 1] = v[i] + h \cdot \frac{F(x[i], v[i]) + F(x_E, v_E)}{2m}$
where $x_E$ and $v_E$ are simply the Euler estimates of the next $x$ and $v$

Adaptive step size

[Needs work]
$h_{new} = h_{old}\sqrt{\frac{tolerance}{LTE}}$
Note that you'll need to track the current time with this method since our time steps will not be evenly spaced.

Simplectic method

[Needs work]
A plot of the state variables is called a phase space plot.  The area of the phase space plot is the energy? of the system. The simplectic method preserves the energy of the system.  In the simplectic method, we alternate between incrementing state variables.
$x[i + 1] = x[i] + h \cdot v[i]$
$v[i + 1] = v[i] + h \cdot \frac{F(x[i+1], v[i])}{m}$
There are a few alternatives to this:
  •  $x$ then $v$ then $v$ then $x$ ... etc. This is known as leapfrogging

Friday, January 4, 2013

Visiting Penelope at 6 months

Penelope "Pom Pom" is my niece!
Successful diaper change. Penelope is really happy when you lift her up like this.


Here are some vignettes from my visit:

Pom Pom had a cold during my visit.  It's really a terrible thing to have a cold as an infant.  You don't even know to how wipe your nose! In fact, rather than wipe your nose after sneezing out a bunch of mucus, you'd rather open your mouth for a taste.  Truly a curious being.
This leaves it to the nearest adult to wipe your nose for you.  However, Pom Pom is a free willed spirit.  She could be smiling happily with snot all over her face one second, but the moment you touch her face with a kleenax to wipe it off, insanity follows.  Head turns away, back arches, arms reach out, legs kick away, and Pom Pom screams.  Every muscle she can control is activated to escape that terrible terrible kleenax.  Pom Pom has her own designs for cleaning her face. If you hold her on your shoulder, her trick is to happy to bury her runny face in the crook of your shoulder and then wipe all that wonderful juice in the perfect arc across your shirt.
The solution: put a kleenax on your shoulder.

PP is learning to crawl. Not deliberately, of course. I doubt babies so purposely and single-mindedly pursue activities like an adult will try to say learn to swim.  PP is motivated by the toy or computer immediately in front of her, and she'll try anything she can to move towards the object of her motivation.  When PP tries to move forward, she pushes out with her hands and legs. Only her hands gain traction though and she ends up pushing herself backwards.

Babies have no memories.  How can the PP be entertained by a toy for 20 min, get bored, then be returned an hour later and be entertained by the same toy for another 20 min!