E Bird’s Eye View of Motus Data

This document has been adapted from a version written by John Brzustowsky in 2017. It provides an overview of the fundamentals of how Motus data processing works. The document has been adapted to incorporate more recent developments, particularly the addition of new digital tags and base stations manufactured by Cellular Tracking Technologies (CTT), who took over the development from Cornell University.

E.1 What data look like

Here’s a segment of data from a receiver (with a single antenna):

Receiver R
                                                 Time ->
   Tag A: /         1-----1--1----1-----1-----1            4---4-----4--4-------4--4-/  <- antenna #1
          \. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . \
   Tag B: /                       3-----3--3--3--3--3-------3----3--3                /  <- antenna #1
          \. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . \
   Tag C: /            2--2---2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2   /  <- antenna #1
          \. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . \
   Tag C: /                      5--------5--5--5--5--5--5-----5--------5--------5   /  <- antenna #2
  • time increases from left (earlier) to right (later)

  • horizontal lanes, separated by . . . correspond to individual tags, labelled at left

  • hits (detections) of a tag are plotted as single digits within its lane

  • in the diagram above, runs are labelled by digits. So all hits of tag A above are either in run 1 or in run 4.

  • the same tag can be detected simultaneously on multiple antennae, and therefore be part of runs that overlap in time. For example, tag C is detected on antenna 1 and 2.

  • In principle, the same tag should never be part of overlapping runs on the SAME antenna, but in practice this can happen in noisy environments with Lotek tags, and it should be assumed that these runs are likely from other sources than actual tags (false positives).

  • Overlapping run would not happen with CTT tags because of the way runs are built, but the rate of hits
    would still be expected to be around 1 per second or lower.

  • for Lotek coded ID tags, individual hits are not sufficient to determine which tag is being detected, because multiple tags transmit the same ID. However, the period (spacing between consecutive transmissions) is precise, and differs among tags sharing the same ID.

  • the fundamental unit of detection for Lotek coded ID tags is the run, or sequence of hits of a given ID, with spacing compatible with the value for that tag. i.e. two hits might differ by the tag’s period, or by twice its period, or three times its period, …, up to a limit required by drift between clocks on the transmit and receive sides.

    • for Lotek ID tags, a run can have gaps (missing hits) up to a certain size. Beyond that size, measurement error and clock drift are large enough that we can’t be sure that the next detection of the same ID code is after a compatible time interval. So for tag A above, we’re not sure the gap between the last detection in run 1 and the first detection in run 4 is really compatible with tag A, so run 1 is ended and a new run is begun. We could link runs 1 and 4 post-hoc, once we saw that run 4 was being built with gaps compatible with tag A, but at present, the run-building algorithm doesn’t backtrack.)

    • for CTT tags, there are 2^32 unique codes, which ensures that each of them is unique. Contrary to Lotek
      tags, runs are not limited to consecutive detections matching a precise period, in part because the period may vary depending on the energy available from the photovoltaic cell. Run are still assembled based on consecutive detections on a single antenna (or node, if applicable), as long as they are not spaced by more than an arbitrary period (600 seconds).

E.2 False positives

False positives (apparent detections of tags actually caused by other sources) exist in all technologies and need to be taken into account. These can happen for quite a number of reasons, and will sometimes affect Lotek and CTT tags in different measure. False positives are difficult to identify, but there are approaches to identify conditions in which the are more likely to occur, and potential ways to mitigate them.

  • Noisy environments: radio noise from interference can create bursts that look like tags. As the amoung of radio pulses from other sources increases in the environment, so it the expected number of false positives.

  • Bit errors: actual tag signals may be incorrectly transmitted. Those should rarely produce valid ID’s, but they will be more likely with CTT tags, mostly because where is a very high number of possible combinations (4 billions). They should still rarely produce an ID of a tag actually manufactured. In CTT tags, bit errors has been found to lead more often to certain patterns (e.g. the last trailing digits being all 7 or F’s: xxxFFFF due to only a partial signal being received), and those patterns have since been excluded from the list of valid tags.

  • Aliasing: aliasing happens when the combination of multiple tags create the appearance of a tag that is not present but matches another known tag. This can happen in at least 2 ways: 1) 2 tags with distinct ID’s, but with the same period (both Lotek and CTT). 2) 2 tags with the same ID, but with distinct periods (only for Lotek). In the first case, if the burst of both tags overlap, they may interfere with each other and create the appearance of a new tag for a while. Assuming that the 2 tags do not have exactly the same period, their bursts should eventually drift apart and the precise alias probably will not persist over a very long run. In the second case, if you have multiple tags with the same ID but distinct periods, this may potentially also result in new periods, but those would likely rarely exceed 2 consecutive hits (run length = 2). Both types of aliasing is mostly problematic in environments where you have many tags present simultaneously that increases the incidence of overlapping tags (e.g. colonies).

For both Lotek and CTT tags, their manufacturers have integrated various methods to reduce the incidence of false detections into their proprietary technologies. Data collected by Sensor Gnomes are processed in Motus by the “Tag finder”, which looks at properties of the signal to exclude likely false positives (e.g. higher deviation in the frequency of pulses within a burst). The parameters can potentially be adjusted, but an aggressive approach aimed at reducing false positives will also result in an increase in false negatives.

  • Run length: With both types of tag technologies, the likelihood of obtaining false positives should decrease as the run length increases. For Lotek tags, we generally recommend ignoring runs comprised only of 2 or 3 hits. Some (many?) probably relate to real tag detections, but the vast majority probably do not. For CTT tags, we do not yet have a suggested minimum threshold. In most cases, given that tag period is short, one would expect that true single-hit runs to be quite rare, so those should likely be excluded to be safer. The likelihood of obtaining the same false ID in consecutive detections is probably very low, except for some specific tag ID’s that are more prone to error. Runs of 2 detections or more are probably safe in most instances.

  • Measures of noise. A higher amount of radio pulses is likely a good predictor of false positives, though other factors are also at play, and not all radio noise is problematic in the same way. We recommend that you use the number of runs comprised of only 2 hits, which is provided by the activity table (see details here). In noisy environments, shorter runs are much more likely to be false positives and should be excluded.

  • Missed detections: False positives should more often lead to gaps in detections during a run. A run that contains several gaps in detection would therefore be deemed less reliable. A simple metric for this would be to divide the number of hits in a run (the run lenght) by the duration of the run (tsEnd - tsStart). Longer runs with few or no gaps in detection would be optimal.

  • Overlapping runs. Runs of the same tag on the same antenna should also be an indication of false positives, but this should be highly correlated with the number and/or ratio of short runs described above.

  • Spatio-temporal models. State-space models and other approach can be used to assess whether movement make sense from a biological point of view, and can help assign probabilities that individual detections are valid.

  • If you have any suggestions about techniques you are using to separate true detections from false positives, we encourage you to share them with us!

E.3 False negatives

False negatives are usually even more difficult to detect.

  • Faulty equipement or installation is always possible of course. Please refer to the installation guidelines to make sure you follow all the recommendations.

  • DO NOT FORGET to register your tags! (details here)

  • DO NOT FORGET to activate your tags before deploying them!

  • Make sure that you report your tag and receiver deployment details BEFORE uploading your data. Motus tag finder only seek tags that are known to be deployed.

E.4 Complication: data are processed in batches

The picture above is complicated by several facts:

  • receivers are often deployed to isolated areas so that we can only obtain raw data from them occasionally

  • receivers are not aware of the full set of currently-active tags

  • sensorgnome receivers do not “know” the full Lotek code set; they record pulses thought to be from Lotek coded ID tags, but are only able to assemble them into tag hits for a small set of tags for which they have an on-board database, built from the user’s own recordings of their tags. This limitation is due to restrictions in the agreement between Lotek and Acadia University for our use of their codeset.

  • Lotek receivers report each detected tagID independently, and do not assemble them into runs. This means a raw Lotek .DTA file does not distinguish between tag 123:4.5 and tag 123:9.1 (i.e. between tags with ID 123 and burst intervals 4.5 seconds and 9.1 seconds).

It follows that:

  • raw receiver data must be processed on a server with full knowledge of:

    • which tags are deployed and likely active

    • the Lotek codeset(s)

  • raw data should be processed in incremental batches

  • processed data should be distributed to users in incremental batches, especially if they wish to obtain results “as they arrive”, rather than in one lump after all their tags have expired.

E.5 Batches

A batch is the result of processing one collection of raw data files from a receiver. Batches are not inherent in the data, but instead reflect how data were processed. Batches arise in these ways:

  • a user visits an isolated receiver, downloads raw data files from it, then uploads the files to motus.org once they are back from the field

  • a receiver connected via WiFi, ethernet, or cell modem is polled for new data files; this typically happens every hour, with random jitter to smooth the processing load on the motus server

  • an archive of data files from a receiver is re-processed on the motus server, because important metadata have changed (e.g. new or changed tag deployment records), or because a significant change has been made to processing algorithms.

Batches are artificial divisions in the data stream, so runs of hits will often cross batch boundaries. Adding this complication to the picture above gives this:

Receiver R
                                                 Time ->
   Tag A: /         1-----1--1-|---1-----1-----1            4---4|-----4--4-------4--4-/
          \. . . . . . . . . . | . . . . . . . . . . . . . . . . | . . . . . . . . . . \
   Tag B: /                    |   3-----3--3--3--3--3-------3---|-3--3                /
          \. . . . . . . . . . | . . . . . . . . . . . . . . . . | . . . . . . . . . . \
   Tag C: /            2--2---2|--2--2--2--2--2--2--2--2--2--2--2|--2--2--2--2--2--2   /
                               |                                 |
          <---- Batch N ------>|<------- Batch N+1 ------------->|<----- Batch N+2 ---->

E.5.1 Receiver Reboots

A receiver reboots when it is powered down and then (possibly much later) powered back up. Reboots often correspond to a receiver:

  • being redeployed
  • having its software updated
  • or having a change made to its attached radios,

so motus treats receiver reboots in a special way:

  • a reboot always begins a new batch; i.e. batches never extend across reboots. This simplifies determination of data ownership. For example, all data in a boot session (time period between consecutive reboots) are deemed to belong to the same motus project. This reflects the fact that a receiver is (almost?) always turned off between the time it is deployed by one project, and the time it is redeployed by another project.

  • any active tag runs are ended when a receiver reboots. Even if the same tag is present and broadcasting, and even if the reboot takes only a few minutes, hits of a tag before and after the reboot will belong to separate runs. This is partly for convenience in determining data ownership, as mentioned above. It is also necessary because sometimes receiver clocks are not properly set by the GPS after a reboot, and so the timestamps for that boot session will revert to a machine default, e.g. 1 Jan 2000. Although runs from these boot sessions could in principle be re-assembled post hoc if the system clock can be pinned from information in field notes, this is not done automatially at present.

  • parameters to the tag-finding algorithm are set on a per-batch basis. At some field sites, we want to allow more lenient filtering because there is very little radio noise. At other sites, filtering should be more strict, because there is considerable noise and high false-positive rates for tags. motus allows projects to set parameter overrides for individual receivers, and these overrides are applied by boot session, because redeployments (always?) cause a reboot.

  • when reprocessing data (see below) from an archive of data files, each boot session is processed as a batch.

E.5.2 Incremental Distribution of Data

The Motus R package allows users to build a local copy of the database of all their tags’ (or receivers’) hits incrementally. A user can regularly call the tagme() function to obtain any new hits of their tags. Because data are processed in batches, tagme() either does nothing, or downloads one or more new batches of data into the user’s local DB.

Each new batch corresponds to a set of files processed from a single receiver. A batch record includes these items: - receiver device ID - how many of hits of their tags occurred in the batch - first and last timestamp of the raw data processed in this batch

Each new batch downloaded will include hits of one or more of the users’s tags (or someone’s tags, if the batch is for a “receiver” database).

A new batch might also include some GPS fixes, so that the user knows where the receiver was when the tags were detected.

A new batch will include information about runs. This information comes in three versions:

  • information about a new run; i.e. one that begins in this batch
  • information about a continuing run; i.e. a run that began in a previous batch, has some hits in this batch, and is not known to have ended
  • information about an ending run; i.e. a run that began in a previous batch, might have some hits in this batch, but which is also known to end in this batch (because a sufficiently long time has elapsed since the last detection of its tag)

Although the unique runID identifier for a run doesn’t change when the user calls tagme(), the number of hits in that run and its status (done or not), might change.

E.6 Reprocessing Data

motus will occasionally need to reprocess raw files from receivers. There are several reasons:

  • new or modified tag deployment records. The tag detection code relies on knowing the active life of each tag it looks for, to control rates of false positive and false negative hits. If the deployment record for a tag only reaches the server after it has already processed raw files overlapping the tag’s deployment, then those files will need to be reprocessed in order to (potentially) find the tag therein. Similary, if a tag was mistakenly sought during a period when it was not deployed, it will have “used up” signals that could instead have come from other tags, thereby causing both its own false positives, and false negatives on other tags. (This is only true for Lotek ID tags; CTT should be unaffected, provided deployed tags are well dispersed in the ID codespace.)

  • bug fixes or improvements in the tag finding algorithm

  • corrections of mis-filed data from receivers. Sometimes, duplication among receiver serial numbers (a rare event) is only noticed after data from them has already been processed. Those data will likely have to be reprocessed so that hits are assigned to the correct station. Interleaved data from two receivers having the same serial number will typically prevent hits from at least one of them, as the tag finder ignores data where the clock seems to have jumped backwards significantly.

E.7 The (eventual) Reprocessing Contract

Reprocessing can be very disruptive from the user’s point of view (“What happened to my hits?”), so motus reprocessing will be:

  1. optional: users should be able to obtain new data without having to accept reprocessed versions of data they already have.

  2. reversible: users should be able to “go back” to a previous version of any reprocessed data they have accepted.

  3. transparent: users will receive a record of what was reprocessed, why, when, what was done differently, and what changed

  4. all-or-nothing: for each receiver boot session for which users have data, these data must come entirely from either the original processing, or a subsequent single reprocessing event. The user must not end up with an undefined mix of data from original and reprocessed sources.

  5. in-band: the user’s copy of data will be updated to incorporate reprocessed data as part of the normal process of updating to obtain new data, unless they choose otherwise. We expect that most users will want to accept reprocessed data most of the time.

Initially, motus data processing might not adhere to this contract, but it is an eventual goal.

E.8 Reprocessing simplified: only by boot session

A general reprocessing scenario would look like this:

Receiver R
                                                 Time ->
   Tag A: /         1-----1-!1-|---1-----1-----1            4---4|-----4!-4-------4--4-/
          \. . . . . . . . .!. | . . . . . . . . . . . . . . . . | . . .!. . . . . . . \
   Tag B: /                 !  |   3-----3--3--3--3--3-------3---|-3--3 !              /
          \. . . . . . . . .!. | . . . . . . . . . . . . . . . . | . . .!. . . . . . . \
   Tag C: /            2--2-!-2|--2--2--2--2--2--2--2--2--2--2--2|--2--2!-2--2--2--2   /
                            !  |                                 |      !
          <---- Batch N ----!->|<------- Batch N+1 ------------->|<-----!Batch N+2 ---->
                            !                                           !
                            !<- Reprocess this period (no, too hard!) ->!

if raw data records from an arbitrary stretch of time could be reprocessed. However, this is complicated because runs like 1 2, and 4 above might lose or gain hits within the reprocessing period, but not outside of it. This might even break an existing run into distinct new runs.

This situation is challenging (NB: not impossible; might be a TODO) to formalize and represent in the database if we want to maintain a full history of processing. For example, if reprocessing deletes some hits from run 2, how do we represent both the old and the new versions of that run?

The complications arise due to runs crossing the reprocessing period boundaries, so for simplicity we should choose a reprocessing period that no runs cross. Currently, that means a boot session, as discussed above.

E.9 Distributing reprocessed data

The previous section shows why we only reprocess data by boot session. Given that, how do we get reprocessed data to users while fulfilling the reprocessing contract?

Note that a reprocessed boot session will fully replace one or more existing batches and one or more runs, because batches and runs both nest within boot sessions.

Replacement of data by reprocessed versions should happen in-band (5 above), so one approach is this:

  • the batches_for_XXX API entries should mark new batches which result from reprocessing, so that the client can deal with them appropriately. This can be done by adding a field called reprocessID with these semantics:
    • reprocessID == 0: data in this batch are from new raw files; the normal situation
    • reprocessID == X > 0: data in this batch are from reprocessing existing raw files.
    • X is the ID of the reprocessing event, and a new API entry reprocessing_info (X) can be called to obtain details about it.
    • if the user chooses to accept the reprocessed version, then existing batches, runs, hits and GPS fixes from the same receiver and boot session are retired before adding the new batches.
    • if the user chooses to reject the reprocessed version, then X is added to a client-side blacklist, and the user will not receive any data from batches whose reprocessID is on the blacklist.
    • later, if a user decides to accept a reprocessing event they had earlier declined, then the IDs of new batches for that event can be fetched from another new API reprocessing_batches (X), and the original batches will be deleted
  • to let users more efficiently fetch the “best” version of their dataset (i.e. accepting all reprocessing events), we should also mark batches which are subsequently replaced by a reprocessing event. For this, we add the field replacedIn with these semantics:
    • replacedIn == 0: data in this batch have not been replaced by any reprocessing event
    • replacedIn == X > 0: data in this batch have been replaced by reprocessing event X. Then the client can ignore any batches for which replacedIn > 0. We could also add a new boolean parameter unreplacedOnly to the batches_for_XXX API entries. It defaults to false, but if true, then only batches which have not been replaced by subsequent reprocessing events are returned.
  • users can choose a policy for how reprocessed data are handled by setting the value of Motus$acceptReprocessedData in their workspace before calling tagme():
    • Motus$acceptReprocessedData <- TRUE; always accept batches of data from reprocessing events
    • Motus$acceptReprocessedData <- FALSE; never accept batches of data from reprocessing events
    • Motus$acceptReprocessedData <- NA (default); ask about each reprocessing event