Parameter Selection

mellon.parameters.compute_L(x, cov_func, gp_type=None, landmarks=None, Lp=None, rank=None, sigma=0, jitter=1e-06)View on GitHub

Compute a low rank \(L\) such that \(L L^\top \approx K\), where \(K\) is the full rank covariance matrix on x.

Parameters:
  • x (array-like) – The training instances.

  • cov_func (function) – The Gaussian process covariance function.

  • gp_type (str or GaussianProcessType) –

    The type of sparcification used for the Gaussian Process:
    • ’full’ None-sparse Gaussian Process

    • ’full_nystroem’ Sparse GP with Nyström rank reduction without landmarks,

      which lowers the computational complexity.

    • ’sparse_cholesky’ Sparse GP using landmarks/inducing points,

      typically employed to enable scalable GP models.

    • ’sparse_nystroem’ Sparse GP using landmarks or inducing points,

      along with an improved Nyström rank reduction method that balances accuracy with efficiency.

  • landmarks (array-like, optional) – The landmark points. If None, computes a full rank decomposition. Defaults to None.

  • rank (int or float, optional) – The rank of the approximate covariance matrix. If rank is an int, an \(n \times\) rank matrix \(L\) is computed such that \(L L^\top \approx K\), the exact \(n \times n\) covariance matrix. If rank is a float 0.0 \(\le\) rank \(\le\) 1.0, the rank/size of \(L\) is selected such that the included eigenvalues of the covariance between landmark points account for the specified percentage of the sum of eigenvalues. Defaults to 0.99 if gp_type indicates Nyström.

  • sigma (float, array-like, optional) – Noise standard deviation of the data we condition on. Defaults to 0.

  • jitter (float, optional) – A small amount to add to the diagonal. Defaults to 1e-6.

  • Lp (array-like, optional) – Prespecified matrix \(L_p\) sich that \(L_p L_p^\top = \Sigma_p\) where \(\Sigma_p\) is the full rank covariance matrix on the inducing points. Defaults to None.

Returns:

\(L\) - Matrix such that \(L L^\top \approx K\).

Return type:

array-like

Raises:

ValueError – If the Gaussian Process type is unknown or if the shape of Lp is incorrect.

mellon.parameters.compute_Lp(x, cov_func, gp_type=None, landmarks=None, sigma=0, jitter=1e-06)View on GitHub

Compute a matrix \(L_p\) such that \(L_p L_p^\top = \Sigma_p\) where \(\Sigma_p\) is the full rank covariance matrix on the inducing points. Unless a full Nyström method or sparse Nyström method is used, in which case None is returned.

Parameters:
  • x (array-like) – The training instances.

  • cov_func (function) – The Gaussian process covariance function.

  • gp_type (str or GaussianProcessType) –

    The type of sparcification used for the Gaussian Process:
    • ’full’ None-sparse Gaussian Process

    • ’sparse_cholesky’ Sparse GP using landmarks/inducing points,

      typically employed to enable scalable GP models.

  • landmarks (array-like) – The landmark points.

  • sigma (float, optional) – Noise standard deviation of the data we condition on. Defaults to 0.

  • jitter (float, optional) – A small amount to add to the diagonal. Defaults to 1e-6.

Returns:

\(L_p\) - A matrix such that \(L_p L_p^\top = \Sigma_p\), or None if using full or sparse Nyström.

Return type:

array-like or None

mellon.parameters.compute_average_cell_count(x, normalize)View on GitHub

Compute the average cell count based on the normalize parameter and the input data x.

Parameters:
  • x (jax.numpy.ndarray) – Input array with shape (n_samples, n_features). The last column is assumed to contain the time identifiers.

  • normalize (bool, list, jax.numpy.ndarray, dict, or None) –

    The parameter controlling the normalization.

    • If True or None, returns the average cell count computed from x.

    • If a list or jax.numpy.ndarray, returns the average of the list or array.

    • If a dict, returns the average of the dict values.

Returns:

The average cell count computed based on the normalize parameter and x.

Return type:

float

Raises:

ValueError – If the type of normalize is not recognized.

mellon.parameters.compute_cov_func(cov_func_curry, ls, ls_time=None)View on GitHub

Computes the Gaussian process covariance function from its generator and length scales.

Parameters:
  • cov_func_curry (function or type) – The covariance function generator.

  • ls (float) – The length scale of the covariance function.

  • ls_time (float, optional) – The time-specific length scale of the covariance function. If provided, the returned covariance function will account for the time-specific dimension of the input data (last dimension assumed to be time). Defaults to None.

Returns:

cov_func – The resulting Gaussian process covariance function. If ls_time is provided, the covariance function is a product of two covariance functions, one for the feature dimensions and one for the time dimension. Otherwise, it’s a single covariance function considering only the feature dimensions.

Return type:

mellon.Covariance instance

mellon.parameters.compute_d(x)View on GitHub

Computes the dimensionality of the data equal to the size of axis 1. :param x: The training instances. :type x: array-like

mellon.parameters.compute_d_factal(x, k=10, n=500, seed=432)View on GitHub

Computes the dimensionality of the data based on the average fractal dimension around n randomly selected cells.

Parameters:
  • x (array-like) – The training instances. Shape must be (n_samples, n_features).

  • k (int, optional) – Number of nearest neighbors to use in the algorithm. Defaults to 10.

  • n (int, optional) – Number of samples to randomly select. Defaults to 500.

  • seed (int, optional) – Random seed for sampling. Defaults to 432.

Returns:

The average fractal dimension of the data.

Return type:

float

Warning

If k is greater than the number of samples in x, a warning will be logged, and k will be set to the number of samples.

mellon.parameters.compute_density_diffusion(predictor, x, times=None)View on GitHub
mellon.parameters.compute_density_gradient(predictor, x, times=None)View on GitHub
mellon.parameters.compute_distances(x, k)View on GitHub

Computes the distance to the k nearest neighbor for each training instance.

Parameters:
  • x (array-like) – The training instances.

  • k – The number of nearest neighbors to consider.

Returns:

distances - The k observed nearest neighbor distances.

Return type:

array-like

mellon.parameters.compute_gp_type(n_landmarks, rank, n_samples)View on GitHub

Determines the type of Gaussian Process based on the landmarks, rank, and number of samples.

Parameters:
  • landmarks (array-like or None) – The landmark points for sparse computation.

  • rank (int or float) – The rank of the approximate covariance matrix.

  • n_samples (array-like) – The number of samples/cells.

Returns:

One of the Gaussian Process types defined in the GaussianProcessType Enum.

Return type:

GaussianProcessType

mellon.parameters.compute_initial_dimensionalities(x, mu_dim, mu_dens, L, nn_distances, d)View on GitHub

Computes an initial guess for the log dimensionality and log density at every cell state with Ridge regression.

Parameters:
  • x (array-like) – The cell states.

  • mu (int) – The Gaussian Process mean.

  • L (array-like) – A matrix such that \(L L^\top \approx K\), where \(K\) is the covariance matrix.

  • nn_distances (array-like) – The observed nearest neighbor distances.

  • d (array-like) – The local dimensionality of the data.

Returns:

initial_value

Return type:

array-like

mellon.parameters.compute_initial_ones(x, L)View on GitHub
mellon.parameters.compute_initial_value(nn_distances, d, mu, L)View on GitHub

Computes the initial value for Maximum A Posteriori optimization with Ridge regression, such that the initial value \(z\) minimizes \(||Lz + mu - mle(nn\text{_}distances, d)|| + ||z||\).

Parameters:
  • nn_distances (array-like) – The observed nearest neighbor distances.

  • d (int) – The local dimensionality of the data.

  • mu (int) – The Gaussian Process mean.

  • L (array-like) – A matrix such that \(L L^\top \approx K\), where \(K\) is the covariance matrix.

Returns:

initial_value - The argmin \(z\).

Return type:

array-like

mellon.parameters.compute_initial_zeros(x, L)View on GitHub
mellon.parameters.compute_landmarks(x, n_landmarks=5000)View on GitHub

Computes the landmark points as k-means centroids.

Landmark points are used to approximate the underlying structure of the input space. If the number of landmarks (n_landmarks) is zero or exceeds the number of available data points, the function will return None.

Parameters:
  • x (array-like) – The input data for which landmarks should be computed. Shape must be (n_samples, n_features).

  • n_landmarks (int, optional) – The desired number of landmark points. If less than 2 or greater than the number of data points, the function will return None. Defaults to DEFAULT_N_LANDMARKS.

Returns:

landmark_points – The coordinates of the computed landmark points, represented as k-means centroids. If no landmarks are computed, the function returns None. Shape is (n_landmarks, n_features).

Return type:

array-like or None

mellon.parameters.compute_landmarks_rescale_time(x, ls, ls_time, times=None, n_landmarks=5000)View on GitHub

Computes landmark points for time-rescaled input data using k-means centroids.

This function first rescales the temporal dimension of the input data by a factor derived from the spatial and temporal length scales (ls and ls_time). It then computes landmark points from the rescaled data. The last dimension of the landmarks is re-scaled back to the original time scale before being returned.

Parameters:
  • x (array-like) – The input data for which landmarks should be computed. If ‘times’ is None, the last column of ‘x’ is interpreted as the times. Shape must be (n_samples, n_features).

  • ls (float) – Length scale of the spatial covariance kernel. Must be positive.

  • ls_time (float) – Length scale of the temporal covariance kernel. Must be positive.

  • times (array-like, optional) – An array encoding the time points associated with each sample in ‘x’. If provided, it overrides the last column of ‘x’ as the times. Shape must be either (n_samples,) or (n_samples, 1).

  • n_landmarks (int, optional) – The desired number of landmark points. Defaults to DEFAULT_N_LANDMARKS.

Returns:

landmark_points – The coordinates of the computed landmark points, represented as k-means centroids in the original space, including the re-scaled temporal dimension. If no landmarks are computed, the function returns None. Shape is (n_landmarks, n_features).

Return type:

array-like or None

mellon.parameters.compute_ls(nn_distances)View on GitHub

Computes a length scale (ls) equal to the geometric mean of the positive nearest neighbor distances times a constant.

Parameters:

nn_distances (array-like) – The observed nearest neighbor distances. Must be non-empty.

Returns:

ls - The geometric mean of the nearest neighbor distances (after adjustment) times a constant.

Return type:

float

mellon.parameters.compute_mu(nn_distances, d)View on GitHub

Computes mu equal to the 1th percentile of \(mle(nn\text{_}distances, d) - 10\), where \(mle = \log(\text{gamma}(d/2 + 1)) - (d/2) \cdot \log(\pi) - d \cdot \log(nn\text{_}distances)\)

Parameters:

nn_distances (array-like) – The observed nearest neighbor distances.

Returns:

mu - The 1th percentile of \(mle(nn\text{_}distances, d) - 10\).

Return type:

float

mellon.parameters.compute_n_landmarks(gp_type, n_samples, landmarks)View on GitHub

Compute the number of landmarks based on the given Gaussian Process type and landmarks.

Parameters:
  • gp_type (GaussianProcessType) – The type of the Gaussian Process. It helps to decide the number of landmarks.

  • n_samples (array-like) – The number of samples/cells.

  • landmarks (array-like or None) – The given landmarks. If specified, its shape determines the number of landmarks, unless conflicting with n_landmarks.

Returns:

computed_n_landmarks – The computed number of landmarks based on the gp_type, n_landmarks, shape of x, and landmarks.

Return type:

int

Raises:

ValueError – If the given number of landmarks, Gaussian Process type, and landmarks conflict with each other.

mellon.parameters.compute_nn_distances(x, save=True)View on GitHub

Compute the distance to the nearest neighbor for each instance in the provided training dataset.

This function calculates the Euclidean distance between each instance in the dataset and its closest neighbor. If save=True, any non-positive distances will be replaced with the minimum positive distance.

Parameters:
  • x (array-like of shape (n_samples, n_features)) – An array-like object representing the training instances.

  • save (bool, optional) – Whether to replace non-positive distances with the minimum positive distance. Default is True.

Returns:

nn_distances – An array of the Euclidean distances from each instance to its nearest neighbor in the input dataset. The ordering of the distances in this array corresponds to the ordering of the instances in the input data.

Return type:

array-like of shape (n_samples,)

:raises ValueError : if all distances are non-positive and save=True.:

mellon.parameters.compute_nn_distances_within_time_points(x, times=None, d=None, normalize=False)View on GitHub

Computes the distance to the nearest neighbor for each training instance within the same time point group. It retains the original order of instances in x.

Parameters:
  • x (array-like) – The training instances. If ‘times’ is None, the last column of ‘x’ is interpreted as the times. Shape must be (n_samples, n_features).

  • times (array-like, optional) – An array encoding the time points associated with each cell/row in ‘x’. If provided, it overrides the last column of ‘x’ as the times. Shape must be either (n_samples,) or (n_samples, 1).

  • d (int, array-like or None) – The intrinsic dimensionality of the data, i.e., the dimensionality of the embedded manifold. Only required for the normalization. Defaults to None.

  • normalize (bool, list, array-like, or dict, optional) –

    Controls the normalization for varying cell counts across time points to adjust for sampling bias by modifying the nearest neighbor distances.

    • If True, normalizes to simulate a constant total cell count divided by the number of time points.

    • If False, the raw cell counts per time point is reflected in the nearest neighbor distances.

    • If a list or array-like, assumes total cell counts for time points, ordered from earliest to latest.

    • If a dict, maps each time point to its total cell count. Must cover all unique time points.

    Default is False.

Returns:

nn_distances – The observed nearest neighbor distances within the same time point group, preserving the order of instances in x.

Return type:

array-like

mellon.parameters.compute_rank(gp_type)View on GitHub

Compute the appropriate rank reduction based on the given Gaussian Process type.

Parameters:

gp_type (GaussianProcessType) – The type of the Gaussian Process. It helps to decide the rank.

Returns:

computed_rank – The computed rank value based on the gp_type, rank, and shape of x.

Return type:

float or int or None

Raises:

ValueError – If the given rank and Gaussian Process type conflict with each other.

mellon.parameters.compute_time_derivatives(predictor, x, times=None)View on GitHub