A Python implementation of the interval binning scheme

These are some utility functions for working with the interval binning scheme as used in the UCSC Genome Browser. This scheme can be used to implement fast overlap-based querying of intervals, essentially mimicking an R-tree index.

Note

Some database systems natively support spatial index methods such as R-trees. See for example the PostGIS extension for PostgreSQL.

Although in principle the method can be used for binning any kind of intervals, be aware that the largest position supported by this implementation is \(2^{29}\) (which covers the longest human chromosome).

Usage

Let’s say you have a set of intervals I in a database system without support for spatial indexing. Querying I on overlap with an interval q can be done as:

\[\{ i \in I ~|~ \mathrm{overlapping}(i, q) \}\]\[\text{where}\]\[\begin{split}\mathrm{overlapping}(i, q) = i.start < q.stop \wedge i.stop > q.start\end{split}\]

But this will be slow, even with normal B-tree indexes on start and stop.

If for each interval i, we also store its bin as given by assign_bin() (and we index it), we can get the same result much faster by first filtering on overlapping_bins():

\[\{ i \in I ~|~ i.bin \in \mathrm{overlapping\_bins}(q) \wedge \mathrm{overlapping}(i, q) \}\]

Similarly, if i must completely contain q (or vice versa), you can use containing_bins() (or contained_bins()).

Installation

To install the latest release via PyPI using pip:

pip install interval-binning

The latest development version can be found on GitHub.

API documentation

binning.assign_bin(start, stop)[source]

Given an interval start:stop, return the smallest bin in which it fits.

Parameters:start, stop (int) – Interval positions (zero-based, open-ended).
Returns:Smallest bin containing start:stop.
Return type:int
Raises OutOfRangeError:
 If start:stop exceeds the range of the binning scheme.
binning.overlapping_bins(start, stop=None)[source]

Given an interval start:stop, return bins for intervals overlapping start:stop by at least one position. The order is according to the bin level (starting with the smallest bins), and within a level according to the bin number (ascending).

Parameters:start, stop (int) – Interval positions (zero-based, open-ended). If stop is not provided, the interval is assumed to be of length 1 (equivalent to stop = start + 1).
Returns:All bins for intervals overlapping start:stop, ordered first according to bin level (ascending) and then according to bin number (ascending).
Return type:list(int)
Raises OutOfRangeError:
 If start:stop exceeds the range of the binning scheme.
binning.containing_bins(start, stop=None)[source]

Given an interval start:stop, return bins for intervals completely containing start:stop. The order is according to the bin level (starting with the smallest bins), and within a level according to the bin number (ascending).

Parameters:start, stop (int) – Interval positions (zero-based, open-ended). If stop is not provided, the interval is assumed to be of length 1 (equivalent to stop = start + 1).
Returns:All bins for intervals containing start:stop, ordered first according to bin level (ascending) and then according to bin number (ascending).
Return type:list(int)
Raises OutOfRangeError:
 If start:stop exceeds the range of the binning scheme.
binning.contained_bins(start, stop=None)[source]

Given an interval start:stop, return bins for intervals completely contained by start:stop. The order is according to the bin level (starting with the smallest bins), and within a level according to the bin number (ascending).

Parameters:start, stop (int) – Interval positions (zero-based, open-ended). If stop is not provided, the interval is assumed to be of length 1 (equivalent to stop = start + 1).
Returns:All bins for intervals contained by start:stop, ordered first according to bin level (ascending) and then according to bin number (ascending).
Return type:list(int)
Raises OutOfRangeError:
 If start:stop exceeds the range of the binning scheme.
binning.covered_interval(bin)[source]

Given a bin number bin, return the interval covered by this bin.

Parameters:bin (int) – Bin number.
Returns:Tuple of start, stop being the zero-based, open-ended interval covered by bin.
Return type:tuple(int)
Raises OutOfRangeError:
 If bin number bin exceeds the maximum bin number.