Computation in data pipeline

Welcome to the last (but certainly not the least!) section of the tutorial! Now that we have some recorded neural data imported into a table (Neuron), we are going to perform some computations and processing, and look at how we can store these results naturally into tables. In doing so, we are going to meet two remaining major table tiers: lookup table (dj.Lookup) and computed table (dj.Computed).

Performing computation on data

Now we have the “raw” electric activities from neurons, we would like to detect spikes automatically. Before we tackle this challenge, let’s first get a better idea about our data by computing a few statistics on the data such as mean and standard deviation of the electric activity.

Defining computed tables

A crticial feature of a data pipeline is that not only the raw data (i.e. entered manually or imported from external source) but also the processed data and computation results reside in the same data pipeline. A computed table is defined by specifying the computation to be performed on data found in other table(s), and the results are stored inside the table itself.

Just like the imported tables (dj.Imported), computed tables (dj.Computed) provide the populate method and makeTuples method, and therefore provides us with a mechanism to perform computation on every combination of parent/depended tables!

Let’s now define our first computed table ActivityStatistics that computes the statistics of electric activity for each neuron in ``Neuron``.

%{
  # statistics about the activity
  -> tutorial.Neuron
  ---
  mean: float    # mean activity
  stdev: float   # standard deviation of activity
  max: float     # maximum activity
%}

classdef ActivityStatistics < dj.Computed

    methods(Access=protected)
        function makeTuples(self,key)

            activity = fetch1(tutorial.Neuron & key,'activity');    % fetch activity as Matlab array

            % compute various statistics on activity
            key.mean = mean(activity); % compute mean
            key.stdev = std(activity); % compute standard deviation
            key.max = max(activity);    % compute max
            self.insert(key);
            sprintf('Computed statistics for for %d experiment on %s',key.mouse_id,key.session_date)

        end
    end
end

As usual, let’s take a look at it step by step!

%{
  # statistics about the activity
  -> tutorial.Neuron
  ---
  mean: float    # mean activity
  stdev: float   # standard deviation of activity
  max: float     # maximum activity
%}

classdef ActivityStatistics < dj.Computed

    methods(Access=protected)
        function makeTuples(self,key)

            activity = fetch1(tutorial.Neuron & key,'activity');    % fetch activity as Matlab array

            % compute various statistics on activity
            key.mean = mean(activity); % compute mean
            key.stdev = std(activity); % compute standard deviation
            key.max = max(activity);    % compute max
            self.insert(key);
            sprintf('Computed statistics for for %d experiment on %s',key.mouse_id,key.session_date)

        end
    end
end

As you might have guessed, you subclass from dj.Computed to defined a computed table in DataJoint.

%{
  # statistics about the activity
  -> tutorial.Neuron
  ---
  mean: float    # mean activity
  stdev: float   # standard deviation of activity
  max: float     # maximum activity
%}

classdef ActivityStatistics < dj.Computed

    methods(Access=protected)
        function makeTuples(self,key)

            activity = fetch1(tutorial.Neuron & key,'activity');    % fetch activity as Matlab array

            % compute various statistics on activity
            key.mean = mean(activity); % compute mean
            key.stdev = std(activity); % compute standard deviation
            key.max = max(activity);    % compute max
            self.insert(key);
            sprintf('Computed statistics for for %d experiment on %s',key.mouse_id,key.session_date)

        end
    end
end

Here each ActivityStatistics entry depends on Neuron. Because the ActivityStatistics table does not define any additional primary key attribute (i.e. no other attribute entries above --- separator), each row in the ActivityStatistics table is uniquely identified by a single neuron in the Neuron table. Each entry in the ActivityStatistics table has non-primary key attributes mean, stdev and max to hold the mean, standard deviation and maximum value of the electric activity, respectively.

Just like imported table (dj.Imported), computed tables are equipped with populate method which would call the makeTuples for every combination of dependent/parent tables. In this case, ActivityStatistics’s makeTuples will be called for every neuron in the Neuron table.

Here, for each neuron in the Neuron table (pointed to by key), we 1) get the value of column activity storing the neuron’s electric activity as NumPy array, 2) compute various statistics and store the values into the key dictionary and 3) insert the dictionary into self (ActivityStatistics).

Note

fetch method will always return a structure, and fetchn will always return a cell array even if there is only one element. When you know that there is only going to be one entry, you can get the attribute value directly by using fetch1 instead, as was done here.

With this computation defined, we can trigger activity statistics to be computed for all entries in Neuron by simply instantiating and calling populate method on ActivityStatistics:

>> avg  =tutorial.ActivityStatistics

avg =


Object tutorial.ActivityStatistics

:: statistics about the activity ::

0 tuples (0.0104 s)
>> populate(tutorial.ActivityStatistics)

**tutorial.ActivityStatistics: Found 5 unpopulated keys

Computed statistics for for 0 experiment on 2017-05-15
Computed statistics for for 0 experiment on 2017-05-19
Computed statistics for for 5 experiment on 2017-01-05
Computed statistics for for 100 experiment on 2017-05-25
Computed statistics for for 100 experiment on 2017-06-01
>> avg

avg =


Object tutorial.ActivityStatistics

:: statistics about the activity ::

  MOUSE_ID    SESSION_DATE      mean       stdev      max
  ________    ____________    ________    _______    ______

    0         '2017-05-15'     0.20736    0.40107    2.4816
    0         '2017-05-19'     0.13274    0.29161    1.8281
    5         '2017-01-05'    0.089179    0.23653    1.3739
  100         '2017-05-25'     0.21907    0.32895    1.7638
  100         '2017-06-01'    0.087327    0.23798    1.3245

5 tuples (0.0575 s)

Great! We just successfully computed various neuronal activity statistics for all neurons in the Neuron table with a single method call to populate. Computation couldn’t really be easier than that!

Detecting spikes from neural activity

Now we have a better idea of our neuronal activity data, let’s try performing the more challenging computation - the spike detection. As you may know, spike detection is a very challenging (and exciting) subject and is a very active area of research! However, Rather than attempting to implement the state-of-the-art spike detection, we are going to implement a very simple algorithm where we register a “spike” every time the activity rises above a certain threshold value.

Importantly, this means that the result of our computation (i.e. detected spikes) will depend a lot on the chosen value of the threshold, and we would like to be able to try a few different value of threshold to see what works well. In other words, we would like to be able to run the spike detection algorithm with few different values of threshold and compare the results side-by-side.

This can actually be achieve rather easily by preparing a lookup table to store different values of computation paramters (i.e. threshold values), and compute spikes for every combination of neurons and parameter value set.

Defining Lookup tables

Let’s go ahead and define a lookup table called SpikeDetectionParam to contain the parameters for spike detection, namely the threshold value. As you might have guessed, you can define a lookup table by subclassing dj.Lookup. A lookup table is almost identical to a manual table (dj.Manual) but signifies that this table contains values like parameters for computation.

%{
# Spike detection thresholds
sdp_id: int      # unique id for spike detection parameter set
---
threshold: float   # threshold for spike detection
%}

classdef SpikeDetectionParam < dj.Lookup
end

Note

Notice that we used a field sdp_id to serve as the primary key for the SpikeDetectionParam rather than using threshold as the primary key, despite the fact that threshold is the only attribute of interest in this table. This is because threshold is of data type float and exact comparison is difficult for float values. In general, it is recommended that you avoid using float data type attribute in your primary key.

Defining Spikes table

Now that we have defined SpikeDetectionParam, let’s go ahead and define the computed table for spike detection and call it Spikes!

Note

It’s usually a good diea to call a computed table based on what would be the results of the computation rather than naming it based on the computation process (i.e. so Spikes rather than SpikeDetection). This would make sense if you consider the fact that the table will be housing the results of the computation, here the detected neuronal spikes.

%{
  # spikes
  -> tutorial.Neuron
  -> tutorial.SpikeDetectionParam
  ---
  spikes: longblob     # detected spikes
  count: int           # total number of detected spikes
%}

classdef Spikes < dj.Computed
    methods(Access=protected)
        function makeTuples(self, key)
            activity = fetch1(tutorial.Neuron & key, 'activity');
            threshold = fetch1(tutorial.SpikeDetectionParam & key, 'threshold');

            above_thrs = activity > threshold;   % find activity above threshold
          rising = diff(above_thrs) > 0; % find rising edge of crossing threshold
          spikes = [0 rising];    % prepend 0 to account for shortening due to np.diff
          count = sum(spikes);   % compute total spike counts

          % save results and insert
          key.spikes = spikes;
          key.count = count;
          self.insert(key);

          sprintf('Detected %d spikes for mouse_id %d session_date %s using threshold=%2.2f',...
              count, key.mouse_id, key.session_date, threshold)
        end
    end
end

Alright, let’s go through this code step-by-step, first focusing on the definition!

%{
  # spikes
  -> tutorial.Neuron
  -> tutorial.SpikeDetectionParam
  ---
  spikes: longblob     # detected spikes
  count: int           # total number of detected spikes
%}

Notice that the Spikes table depends on both ``Neuron`` and ``SpikeDetectionParam``! What does this mean? This means that each entry (detected spikes) in the Spikes table is uniquely identified not only by the identify of a neuron (from Neuron) but by the combination of the neuron identity and the particular spike detection parameter (from SpikeDetectionParam). As you will see, this allows for this table to house results of spike detection under more than one values for the parameter (i.e. threshold)!

%{
  # spikes
  -> tutorial.Neuron
  -> tutorial.SpikeDetectionParam
  ---
  spikes: longblob     # detected spikes
  count: int           # total number of detected spikes
%}

As the non-primary-key attributes, Spikes will contain both the detected spikes as an array of 0’s and 1’s with 1’s at the position of a spike, and the total count of detected spikes.

Now let’s move onto the mean of the computed table - makeTuples:

function makeTuples(self, key)
   activity = fetch1(tutorial.Neuron & key, 'activity');
   threshold = fetch1(tutorial.SpikeDetectionParam & key, 'threshold');

   above_thrs = activity > threshold;   % find activity above threshold
   rising = diff(above_thrs) > 0; % find rising edge of crossing threshold
   spikes = [0 rising];    % prepend 0 to account for shortening due to np.diff
   count = sum(spikes);   % compute total spike counts

   % save results and insert
   key.spikes = spikes;
   key.count = count;
   self.insert(key);

   sprintf('Detected %d spikes for mouse_id %d session_date %s using threshold=%2.2f',...
      count, key.mouse_id, key.session_date, threshold)
end

One of the first thing we do in _make_tuples is to fetch relevant data from other tables. This is a very standard practice when defining _make_tuples in computed tables, as was also performed in the ActivityStatistics table. Here we fetch the neuron’s electric activity Matlab array from the Neuron table, and the value of the threshold from the SpikeDetectionParam table.

function makeTuples(self, key)
   activity = fetch1(tutorial.Neuron & key, 'activity');
   threshold = fetch1(tutorial.SpikeDetectionParam & key, 'threshold');

   above_thrs = activity > threshold;   % find activity above threshold
   rising = diff(above_thrs) > 0; % find rising edge of crossing threshold
   spikes = [0 rising];    % prepend 0 to account for shortening due to np.diff
   count = sum(spikes);   % compute total spike counts

   % save results and insert
   key.spikes = spikes;
   key.count = count;
   self.insert(key);

   sprintf('Detected %d spikes for mouse_id %d session_date %s using threshold=%2.2f',...
      count, key.mouse_id, key.session_date, threshold)
end

Using activity and threshold, we first find and label where activity value is above the threshold. This returns an array of 0’s and 1’s where 1’s corresponds to time bins where neuron’s activity was above the threshold, storing this as above_thrs.

We then find out all the timebins at which the activity goes from 0 to 1, signifying times at which the neuron’s activity raised above the threshold, storing this into rising! We then adjust this array so that it has the same length as the original activity, and store the result as our detected spikes! We store the computed spikes and count by inserting into this (Spieks) table!!

function makeTuples(self, key)
   activity = fetch1(tutorial.Neuron & key, 'activity');
   threshold = fetch1(tutorial.SpikeDetectionParam & key, 'threshold');

   above_thrs = activity > threshold;   % find activity above threshold
   rising = diff(above_thrs) > 0; % find rising edge of crossing threshold
   spikes = [0 rising];    % prepend 0 to account for shortening due to np.diff
   count = sum(spikes);   % compute total spike counts

   % save results and insert
   key.spikes = spikes;
   key.count = count;
   self.insert(key);

   sprintf('Detected %d spikes for mouse_id %d session_date %s using threshold=%2.2f',...
      count, key.mouse_id, key.session_date, threshold)
end

Populating Spikes

Ok, after our hard work putting implemeting the spike detection algorithm, it’s time for us to run it! Let’s instantiate the Spikes table and populate it away!

>> spikes=tutorial.Spikes;
>> spikes

spikes =


Object tutorial.Spikes


<SQL>
CREATE TABLE `tutorial`.`__spikes` (
`mouse_id` int NOT NULL COMMENT "unique mouse id",
`session_date` date NOT NULL COMMENT "session date",
CONSTRAINT `rVWU4dxf` FOREIGN KEY (`mouse_id`,`session_date`)   REFERENCES `tutorial`.`_neuron` (`mouse_id`,`session_date`) ON   UPDATE CASCADE ON DELETE RESTRICT,
`sdp_id` int NOT NULL COMMENT "unique id for spike detection   parameter set",
CONSTRAINT `gKoJB0lO` FOREIGN KEY (`sdp_id`) REFERENCES   `tutorial`.`#spike_detection_param` (`sdp_id`) ON UPDATE CASCADE   ON DELETE RESTRICT,
`spikes` longblob      NOT NULL COMMENT "detected spikes",
`count` int            NOT NULL COMMENT "total number of detected spikes",
PRIMARY KEY (`mouse_id`,`session_date`,`sdp_id`)
) ENGINE = InnoDB, COMMENT "spikes"
</SQL>

:: spikes ::

0 tuples (0.372 s)

>> populate(tutorial.Spikes) %populate it away!

Sadly nothing seems to be happening. Why could this be the case? The answer lies in the SpikeDetectionParam table:

>> tutorial.SpikeDetectionParam

ans =


Object tutorial.SpikeDetectionParam

:: Spike detection thresholds ::

0 tuples (0.00604 s)

Aha! Because Spikes table performs computation on the every combination of Neuron and SpikeDetectionParam, when there is no entry in SpikeDetectionParam, there was nothing to be populated!

Filling in Lookup table

Let’s fix this but creating an entry in SpikeDetectionParam. Consulting the statistics computed for neurons in Populating neuron statistics, let’s pick a value that is at least 1-2 standard deviation above the mean value. Let’s try 0.9 as our threshold! You would fill in values into a Lookup table just how you would for a Manual table:

>> sdp = tutorial.SpikeDetectionParam;
>> insert(sdp,{0, 0.9});

Here we have assigned the threshold of 0.9 the sdp_id of 0.

Running spike detection with multiple parameter values

Alright, now with SpikeDetectionParam populated with a parameter, let’s try to populate the Spikes table once again:

>> populate(tutorial.Spikes)

**tutorial.Spikes: Found 5 unpopulated keys

Populating tutorial.Spikes for:
      mouse_id: 0
  session_date: '2017-05-15'
        sdp_id: 0


ans =

Detected 27 spikes for mouse_id 0 session_date 2017-05-15 using threshold=0.90
ans =

Detected 21 spikes for mouse_id 0 session_date 2017-05-19 using threshold=0.90
ans =

Detected 14 spikes for mouse_id 5 session_date 2017-01-05 using threshold=0.90
ans =

Detected 35 spikes for mouse_id 100 session_date 2017-05-25 using threshold=0.90
ans =

Detected 15 spikes for mouse_id 100 session_date 2017-06-01 using threshold=0.90

Woohoo! This time the algorithm ran, reporting us how the detected spike counts!!

Let’s now try running this same algorithm but under different parameter configuration - that is different values of threshold! Let’s try a much smaller threshold value of say 0.1! Go ahead and insert this new parameter value into the SpikeDetectionParam table:

>> sdp = tutorial.SpikeDetectionParam;
>> insert(sdp,{1, 0.1});

…and re-trigger the populate:

>> populate(tutorial.Spikes)

**tutorial.Spikes: Found 5 unpopulated keys

ans =

Detected 128 spikes for mouse_id 0 session_date 2017-05-15 using threshold=0.10
ans =

Detected 135 spikes for mouse_id 0 session_date 2017-05-19 using threshold=0.10
ans =

Detected 132 spikes for mouse_id 5 session_date 2017-01-05 using threshold=0.10
ans =

Detected 151 spikes for mouse_id 100 session_date 2017-06-01 using threshold=0.10

Wow, that gave rise to a lot more spikes, most likely because the algorithm is now picking up some noise us spikes!

For fun, let’s try slightly bigger value - maybe 1.3?

>> sdp = tutorial.SpikeDetectionParam;
>> insert(sdp,{2, 1.3});
>> populate(tutorial.Spikes)

You’ll find that appears to have been a bit too big for threshold, causing us to have very few spikes!

Seeing them all together

Finally, we can look at all of our hard earned spikes under different threshold values by inspecting the Spikes table:

>> tutorial.Spikes

ans =


Object tutorial.Spikes

:: spikes ::

  MOUSE_ID    SESSION_DATE    SDP_ID    count     spikes
  ________    ____________    ______    _____    ________

    0         '2017-05-15'    0          27      '=BLOB='
    0         '2017-05-15'    1         128      '=BLOB='
    0         '2017-05-15'    2          13      '=BLOB='
    0         '2017-05-19'    0          21      '=BLOB='
    0         '2017-05-19'    1         135      '=BLOB='
    0         '2017-05-19'    2           5      '=BLOB='
    5         '2017-01-05'    0          14      '=BLOB='
    5         '2017-01-05'    1         132      '=BLOB='
    5         '2017-01-05'    2           1      '=BLOB='
  100         '2017-05-25'    0          35      '=BLOB='
  100         '2017-05-25'    1         142      '=BLOB='
  100         '2017-05-25'    2           9      '=BLOB='

        ...

15 tuples (0.0409 s)

Even better, we can see the values of SpikeDetectionParam together by joining the two tables together:

>> spikes * sdp

ans =


Object dj.internal.GeneralRelvar

  MOUSE_ID    SESSION_DATE    SDP_ID    count    threshold     spikes
  ________    ____________    ______    _____    _________    ________

    0         '2017-05-15'    0          27      0.9          '=BLOB='
    0         '2017-05-19'    0          21      0.9          '=BLOB='
    5         '2017-01-05'    0          14      0.9          '=BLOB='
  100         '2017-05-25'    0          35      0.9          '=BLOB='
  100         '2017-06-01'    0          15      0.9          '=BLOB='
    0         '2017-05-15'    1         128      0.1          '=BLOB='
    0         '2017-05-19'    1         135      0.1          '=BLOB='
    5         '2017-01-05'    1         132      0.1          '=BLOB='
  100         '2017-05-25'    1         142      0.1          '=BLOB='
  100         '2017-06-01'    1         151      0.1          '=BLOB='
    0         '2017-05-15'    2          13      1.3          '=BLOB='
    0         '2017-05-19'    2           5      1.3          '=BLOB='

        ...

15 tuples (0.0533 s)

What’s next?

Congratulations!! You have now reached the end of the Building your first data pipeline tutorial!! You have learned a lot throughout this tutorial, and I hope that you now see the strengths of DataJoint in buliding data pipeline! Before moving forward, go ahead and spend some more time playing with the simple but effective data pipeline that you have built! Try to see if you can improve the algorithm for spike detection or even start defining a new computation all togehter!

Furthermore your journey doesn’t end here! Although we have covered the major topics of DataJoint, there are still a lot of cool features to be explored! Be sure to checkout our documentation and stay tuned for upcoming tutorials covering advanced topics in DataJoint!