Time-Continuous Density Estimation¶
This tutorial aims to guide you through the process of Time-Continuous Density Estimation using Mellon. The density estimate helps in understanding how the probability of a specific cell state existing in the tissue varies over time.
Let’s start by importing the necessary modules:
[8]:
import numpy as np
import pandas as pd
import mellon
import scanpy as sc
import matplotlib
import matplotlib.pyplot as plt
Here’s how you can set up your notebook to display matplotlib graphics and configure some aesthetic parameters.
[9]:
# This line enables the interactive display of figures in Jupyter notebook.
%matplotlib inline
# Customize matplotlib settings
matplotlib.rcParams["figure.dpi"] = 100 # Adjusts the resolution of the figure.
matplotlib.rcParams["image.cmap"] = "Spectral_r" # Sets the default colormap.
matplotlib.rcParams["figure.figsize"] = [6, 6] # Defines the default figure size.
# Disabling the axes for a more minimalistic plot look:
matplotlib.rcParams["axes.spines.bottom"] = False
matplotlib.rcParams["axes.spines.top"] = False
matplotlib.rcParams["axes.spines.left"] = False
matplotlib.rcParams["axes.spines.right"] = False
Step 1: Reading and Displaying the Dataset¶
Initially, we will load the RNA AnnData file from a provided URL, saving it to a local disk for further analysis. If you prefer to store this file at a different location, modify the rna_annData_file path accordingly. The dataset contains preprocessed information pertaining to the mouse gastrulation atlas.
Note: The dataset with all the information is ~20G.
[10]:
url = "https://fh-pi-setty-m-eco-public.s3.amazonaws.com/mellon-tutorial/preprocessed_mouse-gastrulation-atlas.h5ad"
annData_file = "data/preprocessed_mouse-gastrulation-atlas.h5ad"
ad = sc.read(annData_file, backup_url=url)
sc.pl.scatter(ad, basis="umap", color="colored_celltypes")
[11]:
ad
[11]:
AnnData object with n_obs × n_vars = 98192 × 16695
obs: 'barcode', 'sample', 'stage', 'sequencing.batch', 'theiler', 'doub.density', 'doublet', 'cluster', 'cluster.sub', 'cluster.stage', 'cluster.theiler', 'stripped', 'celltype', 'colour', 'umapX', 'umapY', 'haem_gephiX', 'haem_gephiY', 'haem_subclust', 'endo_gephiX', 'endo_gephiY', 'endo_trajectoryName', 'endo_trajectoryDPT', 'endo_gutX', 'endo_gutY', 'endo_gutDPT', 'endo_gutCluster', 'TP_density', 'custom_exclude', 'exclude_mask', 'exclude', 'colored_celltypes', 'palantir_pseudotime', 'palantir_entropy'
var: 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'DM_EigenValues', 'celltype_colors', 'colored_celltypes_colors', 'exclude_colors', 'hvg', 'neighbors', 'palantir_waypoints', 'pca', 'stage_colors', 'umap'
obsm: 'DM_EigenVectors', 'DM_EigenVectors_multiscaled', 'X_pca', 'X_umap', 'X_umap_paper', 'branch_masks', 'palantir_fate_probabilities'
varm: 'PCs'
layers: 'MAGIC_imputed_data', 'local_variability'
obsp: 'DM_Kernel', 'DM_Similarity', 'connectivities', 'distances'
Step 2: Numerical Representation of Time¶
The time at which the samples were taken must be present as a numerical value. This will help us assess the similarity of different samples based on their time of collection.
[12]:
stage_to_numerical = {
"E6.5": 6.5,
"E6.75": 6.75,
"E7.0": 7.0,
"E7.25": 7.25,
"E7.5": 7.5,
"E7.75": 7.75,
"E8.0": 8.0,
"E8.25": 8.25,
"E8.5": 8.5,
}
numerical_to_stage = {n: s for s, n in stage_to_numerical.items()}
ad.obs["stage_numerical"] = pd.Series(
ad.obs["stage"].map(stage_to_numerical), dtype=float
)
sampled_times = ad.obs[["stage", "stage_numerical"]].drop_duplicates(ignore_index=True)
sampled_times
[12]:
stage | stage_numerical | |
---|---|---|
0 | E6.5 | 6.50 |
1 | E7.5 | 7.50 |
2 | E6.75 | 6.75 |
3 | E7.75 | 7.75 |
4 | E7.0 | 7.00 |
5 | E8.0 | 8.00 |
6 | E8.5 | 8.50 |
7 | E7.25 | 7.25 |
8 | E8.25 | 8.25 |
Step 3: Time-Continuous Density Estimation¶
In this step, we utilize the Time-Sensitive Density Estimator to examine the evolution of densities within our dataset over time.
We set the intrinsic dimensionality to 2. This parameter setting doesn’t significantly affect the final density ranking, but it plays a role in facilitating the conversion from log-density to linear density later in our analysis.
The time length scale ls_time
is set using a heuristic defined as \(1.5\cdot\Delta_t=0.375\), where \(\Delta_t\) represents the distance between consecutive time steps. This parameter adjusts the smoothness of the density over time, and our heuristic choice ensures a smooth interpolation between the available samples.
If you prefer a more data-driven approach, albeit computationally intensive, you can either leave the ls_time
parameter unspecified or set it to None
. In this case Mellon will compute and compare the density for each time point individually, thus inferring the rate of change in similarity over time.
[13]:
%%time
X = ad.obsm["DM_EigenVectors"]
X_times = ad.obs["stage_numerical"]
# Initialize the time-sensitive density estimator with an intrinsic dimensionality of 2
t_est = mellon.TimeSensitiveDensityEstimator(d=2, ls_time=0.375)
# Fit the estimator to the data
t_est.fit(X, X_times)
# Save the predictor for later density evaluations
density_predictor = t_est.predict
[2023-07-03 12:21:10,863] [INFO ] Computing nearest neighbor distances within time points.
[2023-07-03 12:21:21,749] [INFO ] Using covariance function (Matern52(ls=0.0028995675014713263, active_dims=slice(None, -1, None)) * Matern52(ls=0.375, active_dims=-1)).
[2023-07-03 12:21:21,888] [INFO ] Computing 5,000 landmarks with k-means clustering.
[2023-07-03 12:22:05,270] [INFO ] Doing low-rank Cholesky decomposition for 98,192 samples and 5,000 landmarks.
[2023-07-03 12:23:18,595] [INFO ] Estimating approximation accuracy since 98,192 samples are more than 10 x 5,000 landmarks.
[2023-07-03 12:24:11,710] [WARNING ] High approx. rank fraction (91.2%). Consider increasing 'n_landmarks'.
[2023-07-03 12:24:11,713] [INFO ] Using rank 5,000 covariance representation.
[2023-07-03 12:24:19,101] [INFO ] Running inference using L-BFGS-B.
[2023-07-03 12:26:36,361] [INFO ] Computing predictive function.
CPU times: user 21min 14s, sys: 8min 44s, total: 29min 59s
Wall time: 5min 29s
Step 4: Density Evaluation and Plotting¶
The density_predictor
is a subclass of the mellon.base_predictor.PredictorTime. Thus the cell-state density can be evaluated for any time point. For instance, we can evaluate the density for all cell states at stage E7.10 wich is notably not covered by a sample.
[14]:
%%time
time = 7.10
ad.obs[f"log_density_at_E{time}"] = density_predictor(X, time)
sc.pl.embedding(ad, basis="umap", color=f"log_density_at_E{time}")
CPU times: user 45 s, sys: 1min 23s, total: 2min 8s
Wall time: 47.7 s
Similarly, we can compute the rate of density change for any time point and cell state.
[15]:
%%time
ad.obs[f"log_density_change_at_E{time}"] = density_predictor.time_derivative(X, time)
sc.pl.embedding(
ad,
basis="umap",
color=f"log_density_change_at_E{time}",
color_map="RdBu_r",
vcenter=0,
)
CPU times: user 29 s, sys: 27.8 s, total: 56.8 s
Wall time: 16.8 s
Step 5: Generating a Continuous Trajectory¶
We can use mellon’s FunctionEstimator to determine a continuous differentiation trajectory through the selected cell-state space. For illustration, we will start by finding a continuous trajectory through the UMAP state representation of cells.
[16]:
%%time
idx = ad.obsm["branch_masks"]["Erythroid3"]
pseudotime = ad.obs.loc[idx, "palantir_pseudotime"].values
pseudotime_grid = np.linspace(np.min(pseudotime), np.max(pseudotime), 200)
umap_est = mellon.FunctionEstimator(ls=1, sigma=5, n_landmarks=10)
umap_trajectory = umap_est.fit_predict(
pseudotime, ad.obsm["X_umap"][idx, :], pseudotime_grid
)
[2023-07-03 12:27:44,288] [INFO ] Using covariance function Matern52(ls=1.0).
[2023-07-03 12:27:44,309] [INFO ] Computing 10 landmarks with k-means clustering.
CPU times: user 2.75 s, sys: 806 ms, total: 3.56 s
Wall time: 1.55 s
This trajectory can be visualized using matplotlib.
[17]:
fig, ax = plt.subplots(1, 1)
ax.scatter(ad.obsm["X_umap"][~idx, 0], ad.obsm["X_umap"][~idx, 1], c="#CFD5E2", s=1)
ax.scatter(ad.obsm["X_umap"][idx, 0], ad.obsm["X_umap"][idx, 1], c="#377EB8", s=1)
ax.plot(umap_trajectory[:, 0], umap_trajectory[:, 1], color="black")
ax.set_xticks([])
ax.set_yticks([])
plt.show()
Generating a Trajectory in State Space¶
Similar to how we created a trajectory through the UMAP state representation, we are now going to create a trajectory within the more complex, higher-dimensional space of the diffusion map of the cells. However, given the high-dimensionality of this space, we will not attempt to provide a comprehensive visualization. Additionally, to address the reduced variability observed within the diffusion components, we modulate the sigma
parameter to 0.1.
[18]:
%%time
diffcomp_est = mellon.FunctionEstimator(ls=1, sigma=0.1, n_landmarks=100)
diffcomp_trajectory = diffcomp_est.fit_predict(
pseudotime, ad.obsm["DM_EigenVectors"][idx, :], pseudotime_grid
)
[2023-07-03 12:27:46,218] [INFO ] Using covariance function Matern52(ls=1.0).
[2023-07-03 12:27:46,219] [INFO ] Computing 100 landmarks with k-means clustering.
CPU times: user 4.1 s, sys: 2.31 s, total: 6.41 s
Wall time: 1.47 s
Step 6: Evaluating Density Along the Trajectory Over Time and Visualization¶
We can evaluate the density predictor along the trajectory represented by diffcomp_trajectory
across various time points denoted by time_grid_raw
using the multi_time
parameter.
[19]:
%%time
time_grid_raw = np.linspace(
X_times.astype(float).min(), X_times.astype(float).max(), 200
)
densities = density_predictor(diffcomp_trajectory, multi_time=time_grid_raw)
CPU times: user 16.9 s, sys: 29.8 s, total: 46.7 s
Wall time: 15.4 s
To illustrate temporal changes in density along the trajectory, a 2D density map can be produced using pyplot.matshow
. Grey vertical lines are added to denote the specific time points when the samples were collected.
[20]:
aspect = densities.shape[1] / densities.shape[0]
plt.matshow(densities, origin="lower", aspect=aspect)
# Function to convert time to corresponding matrix position
def time_to_matrix_position(times):
return (
len(time_grid_raw) * (times - np.min(times)) / (np.max(times) - np.min(times))
)
plt.xlabel("Time")
plt.ylabel("Erythroid trajectory")
num_time = sampled_times["stage_numerical"].astype(float).values
plt.xticks(time_to_matrix_position(num_time), sampled_times["stage"])
plt.vlines(time_to_matrix_position(num_time), 0, len(pseudotime_grid), color="grey")
plt.gca().xaxis.tick_bottom()
plt.yticks(
np.quantile(np.arange(len(pseudotime_grid)), np.array([0.2, 0.8])),
np.array([0.2, 0.8]),
)
plt.show()
In this section, we generate a trajectory in the state space, which offers a more nuanced view of the cell states. The density predictor is evaluated along this trajectory for multiple time points. Subsequently, we visualize the resulting 2D density map to observe how cell-state densities vary along the trajectory and across different time points.
Step 7: Saving the Predictor¶
The density predictor can be serialized and saved for future use. It can be stored in the AnnData uns
attribute, and the AnnData can be saved with the write
method. The predictor can be reconstructed using the from_dict method of mellon.Predictor.
[21]:
ad.uns["density_predictor"] = density_predictor.to_dict()
To save the AnnData:
[22]:
ad.write("data/preprocessed_mouse-gastrulation-atlas_with_density_predictor.h5ad")
To reconstruct the predictor:
[23]:
density_predictor = mellon.Predictor.from_dict(ad.uns["density_predictor"])
By the end of this tutorial, you should have a basic understanding of time-continuous density estimation with Mellon. This knowledge can be valuable in many areas of single-cell analysis of developmental and regenerative tissue, where understanding the dynamics of cell states over time is crucial.