genetools package¶
Submodules¶
genetools.helpers module¶
Pandas/Numpy common recipes.
- genetools.helpers.barcode_split(obs_names, separator='-', colname_barcode='barcode', colname_library='library_id')[source]¶
Split single cell barcodes such as ATGC-1 into a barcode column with value “ATGC” and a library ID column with value 1.
Recommended usage with scanpy:
adata.obs = genetools.helpers.horizontal_concat( adata.obs, genetools.helpers.barcode_split(adata.obs_names) )
- Parameters
obs_names (pandas.Series or pandas.Index) – Cell barcodes with a library ID suffix.
separator (str, optional) – library ID separator, defaults to ‘-’
colname_barcode (str, optional) – output column name containing barcode without library ID suffix, defaults to ‘barcode’
colname_library (str, optional) – output column name containing library ID suffix as an int, defaults to ‘library_id’
- Returns
Two-column dataframe containing barcode prefix and library ID suffix.
- Return type
pandas.DataFrame
- genetools.helpers.get_off_diagonal_values(arr)[source]¶
Get off-diagonal values of a numpy 2d array as a flattened 1d array.
- Parameters
arr (numpy.ndarray) – input numpy 2d array
- Returns
flattened 1d array of non-diagonal values only
- Return type
numpy.ndarray
- genetools.helpers.horizontal_concat(df_left, df_right)[source]¶
Concatenate
df_right
horizontally todf_left
, with no checks for whether the indexes match, but confirming final shape.- Parameters
df_left (pandas.DataFrame or pandas.Series) – Left data
df_right (pandas.DataFrame or pandas.Series) – Right data
- Returns
Copied dataframe with df_right’s columns glued onto the right side of df_left’s columns
- Return type
pandas.DataFrame
- genetools.helpers.make_slurm_command(script, job_name, log_path, env=None, options={}, job_group_name='', wrap_script=True)[source]¶
Generate slurm sbatch command. Should be pipe-able straight to bash.
- Automatic log filenames will take the format:
{{ log_path }}/{{ job_group_name (optional) }}/{{ job_name }}.out
for stdout{{ log_path }}/{{ job_group_name (optional) }}/{{ job_name }}.err
for stderr
You can override automatic log filenames by manually supplying “output” and “error” values in the
options
dict.- Parameters
script (str) – path to an executable script, or inline script (if wrap_script is True)
job_name (str) – job name, used for naming log files
log_path (str) – destination for log files.
env (dict, optional) – any environment variables to pass to script, defaults to None
options (dict, optional) – any CLI options for sbatch, defaults to {}
job_group_name (str, optional) – optional group name for this job and related jobs, used for naming log files, defaults to “”
wrap_script (bool, optional) – whether the script is inline as opposed to a file on disk, defaults to True
- Returns
an sbatch command
- Return type
str
- genetools.helpers.merge_into_left(left, right, **kwargs)[source]¶
Defensively merge
right
series or dataframe intoleft
by index, preservingleft
’s index exactly.right
data will be reordered to matchleft
index.- Parameters
left (pandas.DataFrame or pandas.Series) – left data whose index will be preserved
right (pandas.DataFrame or pandas.Series) – right data which will be reordered based on left index.
**kwargs – passed to pandas.merge
- Returns
left-merged DataFrame with
left
’s index- Return type
pandas.DataFrame
- genetools.helpers.rename_duplicates(series, delim='-')[source]¶
Rename duplicate values to be unique.
['a', 'a']
will become['a', 'a-1']
, for example.- Parameters
series (pandas.Series) – series with values to rename
delim (str, optional) – delimeter before duplicate-number index, defaults to “-”
- Returns
series where original duplicates have been renamed to -1, -2, etc.
- Return type
pandas.Series
- genetools.helpers.vertical_concat(df_top, df_bottom, reset_index=False)[source]¶
Concatenate df_bottom vertically to df_top, with no checks for whether the columns match, but confirming final shape.
- Parameters
df_top (pandas.DataFrame) – Top data
df_bottom (pandas.DataFrame) – Bottom data
reset_index (bool, optional) – Reset index values after concat, defaults to False
- Returns
Copied dataframe with df_bottom’s rows glued onto the bottom of df_top’s rows
- Return type
pandas.DataFrame
genetools.palette module¶
- class genetools.palette.HueValueStyle(color: str, marker: Optional[str] = None, marker_size_scale_factor: float = 1.0, legend_size_scale_factor: float = 1.0, facecolors: Optional[str] = None, edgecolors: Optional[str] = None, linewidths: Optional[float] = None, zorder: int = 1, alpha: Optional[float] = None, hatch: Optional[str] = None)[source]¶
Bases:
object
Describes how to style a particular value (category) of a categorical hue column.
Use palettes mapping hue values to HueValueStyles to make plots with different marker shapes, transparencies, z-orders, etc. for different groups.
The plotting functions accept a hue_key, which identifies a dataframe column that contains hue values. They also accept a palette mapping each hue value to a HueValueStyle that defines not just the color to use for that hue value, but also other styles:
Scatterplot marker shape, primary color, face color, edge color, line width, transparency, and line width.
Rectangle/barplot color and hatch pattern.
Size scale factor for scatterplot markers and legend entries. (The palette of HueValueStyles is defined separately from choosing marker size, and can be plotted at any selected base marker size.)
Here’s an example of assigning a custom HueValueStyle to a hue value in a color palette. This defines a custom unfilled shape, a custom z-order, and more:
palette = { "group_A": genetools.palette.HueValueStyle( color=sns.color_palette("bright")[0], edgecolors=sns.color_palette("bright")[0], facecolors="none", marker="^", marker_size_scale_factor=1.5, linewidths=1.5, zorder=10, ), ... }
For face and edge colors,
None
is the default value; to disable them, set to string'none'
.- apply_defaults(defaults: genetools.palette.HueValueStyle)[source]¶
Returns new HueValueStyle that applies defaults: Modifies this style to fill any missing values with the values from another HueValueStyle.
Use case: supply global style defaults for an entire scatterplot, then override with customizations in any individual hue value style.
- classmethod from_color(s)[source]¶
Construct from color string only; keep all other marker parameters set to defaults. If already a HueValueStyle, pass through without modification.
- static huestyles_to_colors_dict(d: dict) dict [source]¶
Cast any HueValueStyle values in dict to be color strings.
- render_scatter_continuous_props(marker_size=None)[source]¶
Returns kwargs to pass to ax.scatter() to apply this style, in the context of continuous cmap scatterplots.
genetools.plots module¶
- genetools.plots.add_sample_size_to_labels(labels: list, data: pandas.core.frame.DataFrame, hue_key: str) list [source]¶
Add sample size to tick labels on any plot with categorical groups.
Sample size for each label is extracted from the
hue_key
column of dataframedata
.Pairs well with
genetools.plots.wrap_tick_labels(ax)
.Example usage:
ax.set_xticklabels( genetools.plots.add_sample_size_to_labels( ax.get_xticklabels(), df, "Group" ) )
- Parameters
labels (list) – list of tick labels corresponding to groups in
data[hue_key]
data (pd.DataFrame) – dataset with categorical groups
hue_key (str) – column name specifying categorical groups in dataset
data
- Returns
modified tick labels with group sample sizes attached
- Return type
list
- genetools.plots.savefig(fig: matplotlib.figure.Figure, *args, **kwargs)[source]¶
Save figure with smart defaults:
Tight bounding box – necessary for legends outside of figure
Determinsistic PDF output by fixing SOURCE_DATE_EPOCH to Jan 1, 2000
Editable text objects when outputing a vector PDF
Example usage:
genetools.plots.savefig(fig, "my_plot.png", dpi=300)
.Any positional or keyword arguments are passed to
matplotlib.pyplot.savefig
.- Parameters
fig (matplotlib.figure.Figure) – Figure to save.
- genetools.plots.scatterplot(data, x_axis_key, y_axis_key, hue_key, continuous_hue=False, continuous_cmap='viridis', discrete_palette: Optional[Union[Dict[str, Union[genetools.palette.HueValueStyle, str]], List[Union[genetools.palette.HueValueStyle, str]]]] = None, ax: Optional[matplotlib.axes._axes.Axes] = None, figsize=(8, 8), marker_size=25, alpha=1.0, na_color='lightgray', marker='o', marker_edge_color='none', enable_legend=True, legend_hues=None, legend_title=None, sort_legend_hues=True, autoscale=True, equal_aspect_ratio=False, plotnonfinite=False, label_key=None, label_z_order=100, label_color='k', label_alpha=0.8, label_size=15, remove_x_ticks=False, remove_y_ticks=False, tight_layout=True, despine=True, **kwargs) Tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes] [source]¶
Scatterplot colored by a discrete or continuous “hue” grouping variable.
For discrete hues, pass continuous_hue=False and a dictionary of colors and/or HueValueStyle objects in discrete_palette.
Figure size will grow beyond the figsize parameter setting, because the legend is pulled out of figure. So you must use
fig.savefig('filename', bbox_inches='tight')
. This is provided automatically bygenetools.plots.savefig(fig, 'filename')
.If using with scanpy, to get umap data from adata.obsm into adata.obs, try:
data = adata.obs.assign(umap_1=adata.obsm["X_umap"][:, 0], umap_2=adata.obsm["X_umap"][:, 1])
- Parameters
data (pandas.DataFrame) – Input data, e.g. anndata.obs
x_axis_key (str) – Column name to plot on X axis
y_axis_key (str) – Column name to plot on Y axis
hue_key (str) – Column name with hue groups that will be used to color points
continuous_hue (bool, optional) – Whether the hue column takes continuous or discrete/categorical values, defaults to False.
continuous_cmap (str, optional) – Colormap to use for plotting continuous hue grouping variable, defaults to “viridis”
discrete_palette (
Union[ Dict[str, Union[HueValueStyle, str]], List[Union[HueValueStyle, str]] ]
, optional) – Palette of colors and/or HueValueStyle objects to use for plotting discrete/categorical hue groups, defaults to None. Supply a matplotlib palette name, list of colors, or dict mapping hue values to colors or to HueValueStyle objects (or a mix of the two).ax (matplotlib.axes.Axes, optional) – Existing matplotlib Axes to plot on, defaults to None
figsize (tuple, optional) – Size of figure to generate if no existing ax was provided, defaults to (8, 8)
marker_size (int, optional) – Base marker size. Maybe scaled by individual HueValueStyles. Defaults to 25
alpha (float, optional) – Default point transparency, unless overriden by a HueValueStyle, defaults to 1.0
na_color (str, optional) – Fallback color to use for discrete hue categories that do not have an assigned style in discrete_palette, defaults to “lightgray”
marker (str, optional) – Default marker style, unless overriden by a HueValueStyle, defaults to “o”. For plots with many points, try “.” instead.
marker_edge_color (str, optional) – Default marker edge color, unless overriden by a HueValueStyle, defaults to “none” (no edge border drawn). Another common choice is “face”, so the edge color matches the face color.
enable_legend (bool, optional) – Whether legend (or colorbar if continuous_hue) should be drawn. Defaults to True. May want to disable if plotting multiple subplots/panels.
legend_hues (list, optional) – Optionally override the list of hue values to include in legend, e.g. to add any hue values missing from the plotted subset of data; defaults to None
legend_title (str, optional) – Specify a title for the legend. Defaults to None, in which case the hue_key is used.
sort_legend_hues (bool, optional) – Enable sorting of legend hues, defaults to True
autoscale (bool, optional) – Enable automatic zoom in, defaults to True
equal_aspect_ratio (bool, optional) – Plot with equal aspect ratio, defaults to False
plotnonfinite (bool, optional) – For continuous hues, whether to plot points with inf or nan value, defaults to False
label_key (str, optional) – Optional column name specifying group text labels to superimpose on plot, defaults to None
label_z_order (int, optional) – Z-index for superimposed group text labels, defaults to 100
label_color (str, optional) – Color for superimposed group text labels, defaults to “k”
label_alpha (float, optional) – Opacity for superimposed group text labels, defaults to 0.8
label_size (int, optional) – Text size of superimposed group labels, defaults to 15
remove_x_ticks (bool, optional) – Remove X axis tick marks and labels, defaults to False
remove_y_ticks (bool, optional) – Remove Y axis tick marks and labels, defaults to False
tight_layout (bool, optional) – whether to format the figure with tight_layout, defaults to True
despine (bool, optional) – whether to despine (remove the top and right figure borders), defaults to True
- Raises
ValueError – Must specify correct number of colors if supplying a custom palette
- Returns
Matplotlib Figure and Axes
- Return type
Tuple[matplotlib.figure.Figure, matplotlib.axes.Axes]
- genetools.plots.stacked_bar_plot(data, index_key, hue_key, value_key=None, ax: Optional[matplotlib.axes._axes.Axes] = None, figsize=(8, 8), normalize=True, vertical=False, palette: Optional[Union[Dict[str, Union[genetools.palette.HueValueStyle, str]], List[Union[genetools.palette.HueValueStyle, str]]]] = None, na_color='lightgray', hue_order=None, axis_label='Frequency', enable_legend=True, legend_title=None) Tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes] [source]¶
Stacked bar chart.
The
index_key
groups form the bars, and thehue_key
groups subdivide the bars. Thevalue_key
determines the subdivision sizes, and is computed automatically if not provided.See https://observablehq.com/@d3/stacked-normalized-horizontal-bar for inspiration and colors.
Figure size will grow beyond the figsize parameter setting, because the legend is pulled out of figure. So you must use
fig.savefig('filename', bbox_inches='tight')
. This is provided automatically bygenetools.plots.savefig(fig, 'filename')
.- Parameters
data (pandas.DataFrame) – Plot data containing at minimum the columns identified by
index_key
,hue_key
, and optionallyvalue_key
.index_key (str) – Column name defining the rows.
hue_key (str) – Column name defining the horizontal bar categories.
value_key (str, optional.) – Column name defining the bar sizes. If not supplied, this method will calculate group frequencies automatically
ax (matplotlib.axes.Axes, optional) – Existing matplotlib Axes to plot on, defaults to None
figsize (tuple, optional) – Size of figure to generate if no existing ax was provided, defaults to (8, 8)
normalize (bool, optional) – Normalize each row’s frequencies to sum to 1, defaults to True
vertical (bool, optional) – Plot stacked bars vertically, defaults to False (horizontal)
palette (
Union[ Dict[str, Union[HueValueStyle, str]], List[Union[HueValueStyle, str]] ]
, optional) – Palette of colors and/or HueValuStyle objects to style the bars corresponding to each hue value, defaults to None (in which case default palette used). Supply a matplotlib palette name, list of colors, or dict mapping hue values to colors or to HueValueStyle objects (or a mix of the two).na_color (str, optional) – Fallback color to use for hue values that do not have an assigned style in palette, defaults to “lightgray”
hue_order (list, optional) – Optionally specify order of bar subdivisions. This order is applied from the beginning (bottom or left) to the end (top or right) of the bar. Defaults to None
axis_label (str, optional) – Label for the axis along which the frequency values are drawn, defaults to “Frequency”
enable_legend (bool, optional) – Whether legend should be drawn. Defaults to True. May want to disable if plotting multiple subplots/panels.
legend_title (str, optional) – Specify a title for the legend. Defaults to None, in which case the hue_key is used.
- Raises
ValueError – Must specify correct number of colors if supplying a custom palette
- Returns
Matplotlib Figure and Axes
- Return type
(matplotlib.figure.Figure, matplotlib.axes.Axes)
- genetools.plots.wrap_tick_labels(ax: matplotlib.axes._axes.Axes, wrap_x_axis=True, wrap_y_axis=True, wrap_amount=20) matplotlib.axes._axes.Axes [source]¶
Add text wrapping to tick labels on x and/or y axes on any plot.
May override existing line breaks in tick labels.
- Parameters
ax (matplotlib.axes.Axes) – existing plot with tick labels to be wrapped
wrap_x_axis (bool, optional) – whether to wrap x-axis tick labels, defaults to True
wrap_y_axis (bool, optional) – whether to wrap y-axis tick labels, defaults to True
wrap_amount (int, optional) – length of each line of text, defaults to 20
- Returns
plot with modified tick labels
- Return type
matplotlib.axes.Axes
genetools.scanpy_helpers module¶
Scanpy common recipes.
- genetools.scanpy_helpers.clr_normalize(adata, axis=0, inplace=True)[source]¶
Centered log ratio transformation for Cite-seq data, normalizing:
each protein’s count vectors across cells (axis=0, normalizing each column of the cells x proteins matrix, default)
or the antibody count vector for each cell (axis=1, normalizing each row of the cells x proteins matrix)
This is a wrapper of genetools.stats.clr_normalize(matrix, axis).
- Parameters
adata (anndata.AnnData) – Protein counts anndata
axis (int, optional) – normalize each antibody independently (axis=0) or normalize each cell independently (axis=1), defaults to 0
inplace (bool, optional) – whether to modify input anndata, defaults to True
- Returns
Transformed anndata
- Return type
anndata.AnnData
- genetools.scanpy_helpers.find_all_markers(adata, cluster_key, pval_cutoff=0.05, log2fc_min=0.25, key_added='rank_genes_groups', test='wilcoxon', use_raw=True)[source]¶
Find differentially expressed marker genes for each group of cells.
- Parameters
adata (anndata.AnnData) – Scanpy/anndata object
cluster_key (str) – The adata.obs column name that defines groups for finding distinguishing marker genes.
pval_cutoff (float, optional) – Only return markers that have an adjusted p-value below this threshold. Defaults to 0.05. Set to None to disable filtering.
log2fc_min (float, optional) – Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. Defaults to 0.25. Set to None to disable filtering.
key_added (str, optional) – The key in adata.uns information is saved to, defaults to “rank_genes_groups”
test (str, optional) – Statistical test to use, defaults to “wilcoxon” (Wilcoxon rank-sum test), see scanpy.tl.rank_genes_groups documentation for other options
use_raw (bool, optional) – Use raw attribute of adata if present, defaults to True
- Returns
Dataframe with ranked marker genes for each cluster. Important columns: gene, rank, [cluster_key] (same as argument value)
- Return type
pandas.DataFrame
genetools.stats module¶
- genetools.stats.accept_series(func)[source]¶
Decorator to seamlessly accept pandas Series in place of a numpy array, and returns with original Series index.
- genetools.stats.clr_normalize(mat, axis=0)[source]¶
Centered log ratio transformation for Cite-seq data, normalizing:
each protein’s count vectors across cells (axis=0, normalizing each column of the cells x proteins matrix, default)
or the antibody count vector for each cell (axis=1, normalizing each row of the cells x proteins matrix)
To use with anndata:
genetools.scanpy_helpers.clr_normalize(adata, axis)
Notes:
Output will be densified.
We use Seurat’s [modified CLR implementation](https://github.com/satijalab/seurat/issues/2624) to handle pseudocounts:
log1p(x = x / (exp(x = sum(log1p(x = x[x > 0]), na.rm = TRUE) / length(x = x))))
.
This is almost the same as
log(x) - 1/D * sum( log(product of x's) )
, which is the same aslog(x) - log ( [ product of x's] ^ (1/D) )
, whereD = len(x)
.The general definition is:
from scipy.stats.mstats import gmean return np.log(x) - np.log(gmean(x))
But geometric mean only applies to positive numbers (otherwise the inner product will be 0). So you want to use pseudocounts or drop 0 counts. That’s what Seurat’s modification does.
See also https://github.com/theislab/scanpy/pull/1117 for other approaches.
Do you run this normalization cell-wise or gene-wise (i.e. protein-wise)? See discussion here:
Unfortunately there is not a single answer. In some cases, cell-based normalization fails. This is because cell-normalization makes an assumption that the total ADT counts should be constant across cells. That can become a significant issue if you have cell populations in your data, but did not add protein markers for them (this is also an issue for scRNA-seq, but is significantly mitigated because at least you measure many genes). However, gene-based normalization can fail when there is significant heterogeneity in sequencing depth, or cell size. The optimal strategy depends on the AB panel, and heterogeneity of your sample.
In this implementation, protein-wise is axis=0 and cell-wise is axis=1. Seurat’s default is protein-wise, i.e. axis=0.
The default is “protein-wise” (axis=0), i.e. normalize each protein independently.
- Parameters
mat (numpy array or scipy sparse matrix) – Counts matrix (cells x proteins)
axis (int, optional) – normalize each antibody independently (axis=0) or normalize each cell independently (axis=1), defaults to 0
- Returns
Transformed counts matrix
- Return type
numpy array
- genetools.stats.coclustering(cluster_ids_1, cluster_ids_2)[source]¶
Compute coclustering percentage between two sets of cluster IDs for the same cells: Of the cell pairs clustered together by either or both methods, what percentage are clustered together by both methods?
(The clusters are allowed to have different names across methods, and don’t necessarily need to be ints.)
- Parameters
cluster_ids_1 (numpy array-like) – One set of cluster IDs.
cluster_ids_2 (numpy array-like) – Another set of cluster IDs.
- Returns
Percentage of cell pairs clustered together by one or both methods that are also clustered together by the other method.
- Return type
float
- genetools.stats.intersect_marker_genes(reference_data, query_data, low_confidence_threshold=0.035, low_confidence_suffix='?')[source]¶
Map cluster marker genes against reference lists to find top hits.
- query_data and reference_data should both be dictionaries where:
keys are cluster names or IDs
values are lists of genes associated with that cluster
Or if you have a dataframe where each row contains a cluster ID and a gene name, you can convert to dict with
df.groupby('cluster')['gene'].apply(list).to_dict()
Usage with an anndata/scanpy object on groups defined by
adata.obs['louvain']
:# find marker genes for all clusters cluster_markers_df = genetools.scanpy_helpers.find_all_markers(adata, cluster_key='louvain') # convert to dict of clusters -> gene lists mapping cluster_marker_lists = cluster_markers_df.groupby('louvain')['gene'].apply(list).to_dict() # intersect with known marker gene lists results, label_map, low_confidence_percentage = genetools.stats.intersect_marker_genes(reference_marker_lists, cluster_marker_lists) # rename clusters in your anndata/scanpy object adata.obs['louvain_annotated'] = adata.obs['louvain'].copy().cat.rename_categories(label_map)
- Behavior:
Intersection scores are normalized to marker gene list sizes.
Resulting duplicate cluster names are renamed, ensuring that N original query clusters will map to N renamed clusters.
- Parameters
reference_data (dict) – reference marker gene lists
query_data (dict) – query marker gene lists
low_confidence_threshold (float, optional) – Minimal difference between top and subsequent hits for a confident call, defaults to 0.035
low_confidence_suffix (str, optional) – Suffix for low-confidence cluster renamings, defaults to “?”
- Returns
dataframe with cluster mapping details, a dictionary for renaming query cluster names, and percentage of low-confidence calls.
- Return type
(pandas.DataFrame, dict, float) tuple
- genetools.stats.normalize_columns(df)[source]¶
Make columns sum to 1.
- Parameters
df (pandas.DataFrame) – dataframe
- Returns
column-normalized dataframe
- Return type
pandas.DataFrame
- genetools.stats.normalize_rows(df)[source]¶
Make rows sum to 1.
- Parameters
df (pandas.DataFrame) – dataframe
- Returns
row-normalized dataframe
- Return type
pandas.DataFrame
Module contents¶
Top-level package for genetools.