genetools package#
Submodules#
genetools.arrays module#
- genetools.arrays.convert_matrix_to_one_element_per_row(arr: ndarray) DataFrame [source]#
record each element of 2d matrix as one entry in dataframe, with the row and column ids stored as well as the value
- genetools.arrays.get_top_n(df: DataFrame, col: str, n: int) DataFrame [source]#
Get top amount n of a dataframe df by a specific column col
- genetools.arrays.get_top_n_percent(df: DataFrame, col: str, fraction: float) DataFrame [source]#
Get top fraction n of a dataframe df by a specific column col
- genetools.arrays.get_trim_both_sides_mask(a: ndarray | DataFrame, proportiontocut: float, axis: int = 0) ndarray [source]#
returns mask that applies consistent trim-both-sides learned on one array.
suppose you have a data array and a weights array. you want to trimboth() the data array but keep the element weights aligned.
solution:
trimming_mask = genetools.arrays.get_trim_both_sides_mask(data, proportiontocut=0.1) return data[trimming_mask], weights[trimming_mask]
- genetools.arrays.groupby_apply_weighted_value_counts(df: DataFrame, *groupby_args, category_column_name: str, weight_column_name: str, normalize: bool = False, **groupby_kwargs) Series [source]#
This is to be used instead of:
df.groupby(["columnA", "columnB"], observed=True) .apply( lambda grp: malid.external.genetools_arrays.weighted_value_counts( grp, "category_column", "weight_column", normalize=True, ) )
The preferred call is:
genetools.arrays.groupby_apply_weighted_value_counts( df, ["columnA", "columnB"], observed=True, category_column_name="category_column", weight_column_name="weight_column", normalize=True )
It’s the same behavior with extra checks to make sure the output format has the expected shape: it should be a Series, like in standard groupby-value_counts.
(Sometimes Pandas returns a DataFrame instead of a Series.)
- genetools.arrays.make_consensus_sequence(sequences: ndarray | Series | List[str], frequencies: ndarray | Series | List[int]) str [source]#
Get weighted mode for each character across a set of equal-length input strings.
- genetools.arrays.make_consensus_vector(matrix: ndarray, frequencies: ndarray | List[int] | Series) ndarray [source]#
Get weighted mode for each position across a set of equal-length vectors.
- genetools.arrays.make_dummy_variables_in_specific_order(values: Series | List[str], expected_list: List[str], allow_missing_entries: bool) DataFrame [source]#
Create dummy variables in a defined order. All the values are confirmed to be in the “expected order” list. If an entry from “expected order” list is not present, still include as a dummy variable with all 0s if allow_missing_entires is True, or throw error otherwise.
- genetools.arrays.masked_argmax(masked_arr: MaskedArray, axis: int | None = None) ndarray | float | int [source]#
argmax on masked array. return nan for row/column (depending on axis setting) of all nans
- genetools.arrays.masked_argmin(masked_arr: MaskedArray, axis: int | None = None) ndarray | float | int [source]#
argmin on masked array. return nan for row/column (depending on axis setting) of all nans
- genetools.arrays.numeric_vectors_to_character_arrays(arr: ndarray) ndarray [source]#
Reverse operation of strings_to_numeric_vectors
- genetools.arrays.strings_to_character_arrays(strs: ndarray | List[str] | Series, validate_equal_lengths: bool = True) ndarray [source]#
Create character matrix by “viewing” strings as 1-character string arrays, then reshaping
- genetools.arrays.strings_to_numeric_vectors(strs: ndarray | List[str] | Series, validate_equal_lengths: bool = True) ndarray [source]#
Convert strings to numeric vectors (one entry per character)
- genetools.arrays.weighted_median(values: ndarray, weights: ndarray) int | float [source]#
Weighted median: factor in the weights when finding center of the array.
- genetools.arrays.weighted_mode(arr: list | ndarray | Series, weights: List[int] | ndarray | Series) Any [source]#
Get weighted mode (most common value) in array. Faster than sklearn.utils.extmath.weighted_mode but does not support axis vectorization.
- genetools.arrays.weighted_value_counts(df: DataFrame, category_column_name: str, weight_column_name: str, normalize: bool = False, **groupby_kwargs) Series [source]#
Weighted value counts. Basically a sum of weight_column_name within each category_column_name group.
If normalize is True (default False), the returned value counts will sum to 1.
The optional groupby_kwargs are passed to the groupby procedure. For example, if category_column_name is a Categorical, passing observed=True will make sure that any unused categories are not included in the value counts.
genetools.helpers module#
Pandas/Numpy common recipes.
- genetools.helpers.apply_pipeline_transforms(fit_pipeline: Pipeline, data: ndarray | DataFrame) ndarray | DataFrame [source]#
apply all transformations in an already-fit sklearn Pipeline, except the final estimator
only use for pipelines that have an estimator (e.g. classifier) at the last step. in these cases, we cannot call pipeline.transform(). use this method instead to apply all steps except the final classifier/regressor/estimator.
otherwise, if you have a pipeline of all transformers, you should just call .transform() on the pipeline.
- genetools.helpers.barcode_split(obs_names, separator='-', colname_barcode='barcode', colname_library='library_id')[source]#
Split single cell barcodes such as ATGC-1 into a barcode column with value “ATGC” and a library ID column with value 1.
Recommended usage with scanpy:
adata.obs = genetools.helpers.horizontal_concat( adata.obs, genetools.helpers.barcode_split(adata.obs_names) )
- Parameters:
obs_names (pandas.Series or pandas.Index) – Cell barcodes with a library ID suffix.
separator (str, optional) – library ID separator, defaults to ‘-’
colname_barcode (str, optional) – output column name containing barcode without library ID suffix, defaults to ‘barcode’
colname_library (str, optional) – output column name containing library ID suffix as an int, defaults to ‘library_id’
- Returns:
Two-column dataframe containing barcode prefix and library ID suffix.
- Return type:
pandas.DataFrame
- genetools.helpers.get_off_diagonal_values(arr)[source]#
Get off-diagonal values of a numpy 2d array as a flattened 1d array.
- Parameters:
arr (numpy.ndarray) – input numpy 2d array
- Returns:
flattened 1d array of non-diagonal values only
- Return type:
numpy.ndarray
- genetools.helpers.horizontal_concat(df_left, df_right)[source]#
Concatenate
df_right
horizontally todf_left
, with no checks for whether the indexes match, but confirming final shape.- Parameters:
df_left (pandas.DataFrame or pandas.Series) – Left data
df_right (pandas.DataFrame or pandas.Series) – Right data
- Returns:
Copied dataframe with df_right’s columns glued onto the right side of df_left’s columns
- Return type:
pandas.DataFrame
- genetools.helpers.make_slurm_command(script, job_name, log_path, env=None, options={}, job_group_name='', wrap_script=True)[source]#
Generate slurm sbatch command. Should be pipe-able straight to bash.
- Automatic log filenames will take the format:
{{ log_path }}/{{ job_group_name (optional) }}/{{ job_name }}.out
for stdout{{ log_path }}/{{ job_group_name (optional) }}/{{ job_name }}.err
for stderr
You can override automatic log filenames by manually supplying “output” and “error” values in the
options
dict.- Parameters:
script (str) – path to an executable script, or inline script (if wrap_script is True)
job_name (str) – job name, used for naming log files
log_path (str) – destination for log files.
env (dict, optional) – any environment variables to pass to script, defaults to None
options (dict, optional) – any CLI options for sbatch, defaults to {}
job_group_name (str, optional) – optional group name for this job and related jobs, used for naming log files, defaults to “”
wrap_script (bool, optional) – whether the script is inline as opposed to a file on disk, defaults to True
- Returns:
an sbatch command
- Return type:
str
- genetools.helpers.merge_into_left(left, right, **kwargs)[source]#
Defensively merge
right
series or dataframe intoleft
by index, preservingleft
’s index exactly.right
data will be reordered to matchleft
index.- Parameters:
left (pandas.DataFrame or pandas.Series) – left data whose index will be preserved
right (pandas.DataFrame or pandas.Series) – right data which will be reordered based on left index.
**kwargs – passed to pandas.merge
- Returns:
left-merged DataFrame with
left
’s index- Return type:
pandas.DataFrame
- genetools.helpers.parallel_groupby_apply(df_grouped: DataFrameGroupBy, func: Callable, **kwargs) Series [source]#
Parallelize apply() on a pandas groupby object.
Each subprocesses is given one group to process. This approach isn’t appropriate if your applied function is very fast but you have many, many groups. In that scenario, the parallelization of groups will simply introduce a lot of unnecessary overhead. Make sure to benchmark with and without parallelization. May want to first split full dataframe into big chunks containing many groups, then run groupby-apply on each chunk in parallel.
Also, transferring big groups to subprocesses can be slow. Again consider chunking the dataframe first.
Func cannot be a lambda, since lambda functions can’t be pickled for subprocesses.
Kwargs are passed to
joblib.Parallel(...)
- genetools.helpers.rename_duplicates(series, delim='-')[source]#
Rename duplicate values to be unique.
['a', 'a']
will become['a', 'a-1']
, for example.- Parameters:
series (pandas.Series) – series with values to rename
delim (str, optional) – delimeter before duplicate-number index, defaults to “-”
- Returns:
series where original duplicates have been renamed to -1, -2, etc.
- Return type:
pandas.Series
- genetools.helpers.vertical_concat(df_top, df_bottom, reset_index=False)[source]#
Concatenate df_bottom vertically to df_top, with no checks for whether the columns match, but confirming final shape.
- Parameters:
df_top (pandas.DataFrame) – Top data
df_bottom (pandas.DataFrame) – Bottom data
reset_index (bool, optional) – Reset index values after concat, defaults to False
- Returns:
Copied dataframe with df_bottom’s rows glued onto the bottom of df_top’s rows
- Return type:
pandas.DataFrame
genetools.palette module#
- class genetools.palette.HueValueStyle(color: str, marker: str | None = None, marker_size_scale_factor: float = 1.0, legend_size_scale_factor: float = 1.0, facecolors: str | None = None, edgecolors: str | None = None, linewidths: float | None = None, zorder: int = 1, alpha: float | None = None, hatch: str | None = None)[source]#
Bases:
object
Describes how to style a particular value (category) of a categorical hue column.
Use palettes mapping hue values to HueValueStyles to make plots with different marker shapes, transparencies, z-orders, etc. for different groups.
The plotting functions accept a hue_key, which identifies a dataframe column that contains hue values. They also accept a palette mapping each hue value to a HueValueStyle that defines not just the color to use for that hue value, but also other styles:
Scatterplot marker shape, primary color, face color, edge color, line width, transparency, and line width.
Rectangle/barplot color and hatch pattern.
Size scale factor for scatterplot markers and legend entries. (The palette of HueValueStyles is defined separately from choosing marker size, and can be plotted at any selected base marker size.)
Here’s an example of assigning a custom HueValueStyle to a hue value in a color palette. This defines a custom unfilled shape, a custom z-order, and more:
palette = { "group_A": genetools.palette.HueValueStyle( color=sns.color_palette("bright")[0], edgecolors=sns.color_palette("bright")[0], facecolors="none", marker="^", marker_size_scale_factor=1.5, linewidths=1.5, zorder=10, ), ... }
For face and edge colors,
None
is the default value; to disable them, set to string'none'
.- apply_defaults(defaults: HueValueStyle)[source]#
Returns new HueValueStyle that applies defaults: Modifies this style to fill any missing values with the values from another HueValueStyle.
Use case: supply global style defaults for an entire scatterplot, then override with customizations in any individual hue value style.
- classmethod from_color(s)[source]#
Construct from color string only; keep all other marker parameters set to defaults. If already a HueValueStyle, pass through without modification.
- static huestyles_to_colors_dict(d: dict) dict [source]#
Cast any HueValueStyle values in dict to be color strings.
- render_scatter_continuous_props(marker_size=None)[source]#
Returns kwargs to pass to ax.scatter() to apply this style, in the context of continuous cmap scatterplots.
genetools.plots module#
- genetools.plots.add_sample_size_to_labels(labels: list, data: DataFrame, hue_key: str) list [source]#
Add sample size to tick labels on any plot with categorical groups.
Sample size for each label is extracted from the
hue_key
column of dataframedata
.Pairs well with
genetools.plots.wrap_tick_labels(ax)
.Example usage:
ax.set_xticklabels( genetools.plots.add_sample_size_to_labels( ax.get_xticklabels(), df, "Group" ) )
- Parameters:
labels (list) – list of tick labels corresponding to groups in
data[hue_key]
data (pd.DataFrame) – dataset with categorical groups
hue_key (str) – column name specifying categorical groups in dataset
data
- Returns:
modified tick labels with group sample sizes attached
- Return type:
list
- genetools.plots.add_sample_size_to_legend(ax: Axes, data: DataFrame, hue_key: str) Axes [source]#
Add sample size to legend labels on any plot with categorical hues.
Sample size for each label is extracted from the
hue_key
column of dataframedata
.Example usage:
fig, ax = genetools.plots.scatterplot( data=df, x_axis_key="x", y_axis_key="y", hue_key="Group" ) genetools.plots.add_sample_size_to_legend( ax=ax, data=df, hue_key="Group" )
- Parameters:
ax (matplotlib.axes.Axes) – matplotlib Axes for existing plot
data (pd.DataFrame) – dataset with categorical groups
hue_key (str) – column name specifying categorical groups in dataset
data
- Returns:
matplotlib Axes with modified legend labels with group sample sizes attached
- Return type:
matplotlib.axes.Axes
- genetools.plots.get_point_size(sample_size: int, maximum_size: float = 100) float [source]#
get scatterplot point size based on sample size (from scanpy), but cut off at maximum_size
- genetools.plots.plot_color_and_size_dotplot(data: DataFrame, x_axis_key: str, y_axis_key: str, value_key: str, color_cmap: str | Colormap | None = None, color_and_size_vmin: float | None = None, color_and_size_vmax: float | None = None, color_and_size_vcenter: float | None = None, figsize: Tuple[float, float] | None = None, legend_text: str | None = None, extend_legend_to_vmin_vmax: bool = False, representative_values_for_legend: List[float] | None = None, min_marker_size: int = 1, marker_size_scale_factor: int = 100, grid: bool = True) Tuple[Figure, Axes] [source]#
Plot dotplot heatmap showing a key as both color and size.
- genetools.plots.plot_confusion_matrix(df: DataFrame, ax: Axes | None = None, figsize: Tuple[float, float] | None = None, outside_borders=True, inside_border_width=0.5, wrap_labels_amount: int | None = 15, wrap_x_axis_labels=True, wrap_y_axis_labels=True, draw_colorbar=False, cmap='Blues') Tuple[Figure, Axes] [source]#
- genetools.plots.plot_triangular_heatmap(df: DataFrame, cmap='Blues', colorbar_label='Value', figsize=(8, 6), vmin=None, vmax=None, annot=True, fmt='.2g') Tuple[Figure, Axes] [source]#
Plot lower triangular heatmap.
Often followed with:
genetools.plots.wrap_tick_labels( ax, wrap_x_axis=True, wrap_y_axis=True, wrap_amount=10 )
- genetools.plots.plot_two_key_color_and_size_dotplot(data: DataFrame, x_axis_key: str, y_axis_key: str, color_key: str, size_key: str, color_cmap: str | Colormap | None = None, color_vmin: float | None = None, color_vmax: float | None = None, color_vcenter: float | None = None, figsize: Tuple[float, float] | None = None, size_vmin: float | None = None, size_vmax: float | None = None, size_vcenter: float | None = None, extend_size_legend_to_vmin_vmax: bool = False, representative_sizes_for_legend: List[float] | None = None, inverse_size: bool = False, color_legend_text: str | None = None, size_legend_text: str | None = None, shared_legend_title: str | None = None, min_marker_size: int = 1, marker_size_scale_factor: int = 100, grid: bool = True) Tuple[Figure, Axes] [source]#
Plot dotplot heatmap showing two keys together.
Example with mean and standard deviation: Circle color represents the mean. Circle size represents stability (inverse of standard deviation). Suggestions for this use case:
Pass mean key as color_key and standard deviation key as size_key.
Set inverse_size=True. Big circles are trustworthy/stable across the average, while little circles aren’t
Set color_legend_text=”Mean”, size_legend_text=”Inverse std. dev.”
Set min_marker_size=20 so that the smallest circle for zero standard deviation is still visible
With a diverging colormap (e.g. color_cmap=’RdBu_r’, color_vcenter=0) bold circles are strong effects, while near-white circles are weak effects
- genetools.plots.savefig(fig: Figure, *args, **kwargs)[source]#
Save figure with smart defaults:
Tight bounding box – necessary for legends outside of figure
Determinsistic PDF output by fixing SOURCE_DATE_EPOCH to Jan 1, 2000
Editable text objects when outputing a vector PDF
Example usage:
genetools.plots.savefig(fig, "my_plot.png", dpi=300)
.Any positional or keyword arguments are passed to
matplotlib.pyplot.savefig
.- Parameters:
fig (matplotlib.figure.Figure) – Figure to save.
- genetools.plots.scatterplot(data: DataFrame, x_axis_key: str, y_axis_key: str, hue_key: str | None = None, continuous_hue=False, continuous_cmap='viridis', discrete_palette: Dict[str, HueValueStyle | str] | List[HueValueStyle | str] | None = None, ax: Axes | None = None, figsize=(8, 8), marker_size=25, alpha: float = 1.0, na_color='lightgray', marker: str = 'o', marker_edge_color: str = 'none', marker_zorder: int = 1, marker_size_scale_factor: float = 1.0, legend_size_scale_factor: float = 1.0, marker_face_color: str | None = None, marker_linewidths: float | None = None, enable_legend=True, legend_hues: List[str] | None = None, legend_title: str | None = None, sort_legend_hues=True, autoscale=True, equal_aspect_ratio=False, plotnonfinite=False, remove_x_ticks=False, remove_y_ticks=False, tight_layout=True, despine=True, **kwargs) Tuple[Figure, Axes] [source]#
Scatterplot colored by a discrete or continuous “hue” grouping variable.
For discrete hues, pass
continuous_hue = False
and a dictionary of colors and/or HueValueStyle objects indiscrete_palette
.Figure size will grow beyond the figsize parameter setting, because the legend is pulled out of figure. So you must use
fig.savefig('filename', bbox_inches='tight')
. This is provided automatically bygenetools.plots.savefig(fig, 'filename')
.If using with scanpy, to join umap data from
adata.obsm
with other plot data inadata.obs
, try:data = adata.obs.assign(umap_1=adata.obsm["X_umap"][:, 0], umap_2=adata.obsm["X_umap"][:, 1])
If
hue_key = None
, then all points will be colored byna_color
and styled with parametersalpha
,marker
,marker_size
,zorder
, andmarker_edge_color
. The legend will be disabled.- Parameters:
data (pandas.DataFrame) – Input data, e.g. anndata.obs
x_axis_key (str) – Column name to plot on X axis
y_axis_key (str) – Column name to plot on Y axis
hue_key (str, optional) – Column name with hue groups that will be used to color points. defaults to None to color all points consistently.
continuous_hue (bool, optional) – Whether the hue column takes continuous or discrete/categorical values, defaults to False.
continuous_cmap (str, optional) – Colormap to use for plotting continuous hue grouping variable, defaults to “viridis”
discrete_palette (
Union[ Dict[str, Union[HueValueStyle, str]], List[Union[HueValueStyle, str]] ]
, optional) – Palette of colors and/or HueValueStyle objects to use for plotting discrete/categorical hue groups, defaults to None. Supply a matplotlib palette name, list of colors, or dict mapping hue values to colors or to HueValueStyle objects (or a mix of the two).ax (matplotlib.axes.Axes, optional) – Existing matplotlib Axes to plot on, defaults to None
figsize (tuple, optional) – Size of figure to generate if no existing ax was provided, defaults to (8, 8)
marker_size (int, optional) – Base marker size. Maybe scaled by individual HueValueStyles. Defaults to 25
alpha (float, optional) – Default point transparency, unless overriden by a HueValueStyle, defaults to 1.0
na_color (str, optional) – Fallback color to use for discrete hue categories that do not have an assigned style in discrete_palette, defaults to “lightgray”
marker (str, optional) – Default marker style, unless overriden by a HueValueStyle, defaults to “o”. For plots with many points, try “.” instead.
marker_edge_color (str, optional) – Default marker edge color, unless overriden by a HueValueStyle, defaults to “none” (no edge border drawn). Another common choice is “face”, so the edge color matches the face color.
marker_zorder (int, optional) – Default marker z-order, unless overriden by a HueValueStyle, defaults to 1
marker_size_scale_factor (float, optional) – Default marker size scale factor, unless overriden by a HueValueStyle, defaults to 1.0
legend_size_scale_factor (float, optional) – Default legend size scale factor, unless overriden by a HueValueStyle, defaults to 1.0
marker_face_color (str, optional) – Default marker face color, unless overriden by a HueValueStyle, defaults to None (uses point color).
marker_linewidths (float, optional) – Default marker line widths, unless overriden by a HueValueStyle, defaults to None
enable_legend (bool, optional) – Whether legend (or colorbar if continuous_hue) should be drawn. Defaults to True. May want to disable if plotting multiple subplots/panels.
legend_hues (list, optional) – Optionally override the list of hue values to include in legend, e.g. to add any hue values missing from the plotted subset of data; defaults to None
legend_title (str, optional) – Specify a title for the legend. Defaults to None, in which case the hue_key is used.
sort_legend_hues (bool, optional) – Enable sorting of legend hues, defaults to True
autoscale (bool, optional) – Enable automatic zoom in, defaults to True
equal_aspect_ratio (bool, optional) – Plot with equal aspect ratio, defaults to False
plotnonfinite (bool, optional) – For continuous hues, whether to plot points with inf or nan value, defaults to False
remove_x_ticks (bool, optional) – Remove X axis tick marks and labels, defaults to False
remove_y_ticks (bool, optional) – Remove Y axis tick marks and labels, defaults to False
tight_layout (bool, optional) – whether to format the figure with tight_layout, defaults to True
despine (bool, optional) – whether to despine (remove the top and right figure borders), defaults to True
- Raises:
ValueError – Must specify correct number of colors if supplying a custom palette
- Returns:
Matplotlib Figure and Axes
- Return type:
Tuple[matplotlib.figure.Figure, matplotlib.axes.Axes]
- genetools.plots.stacked_bar_plot(data, index_key, hue_key, value_key: str | None = None, ax: Axes | None = None, figsize=(8, 8), normalize=True, vertical=False, palette: Dict[str, HueValueStyle | str] | List[HueValueStyle | str] | None = None, na_color='lightgray', hue_order=None, axis_label='Frequency', enable_legend=True, legend_title=None) Tuple[Figure, Axes] [source]#
Stacked bar chart.
The
index_key
groups form the bars, and thehue_key
groups subdivide the bars. Thevalue_key
determines the subdivision sizes, and is computed automatically if not provided.See https://observablehq.com/@d3/stacked-normalized-horizontal-bar for inspiration and colors.
Figure size will grow beyond the figsize parameter setting, because the legend is pulled out of figure. So you must use
fig.savefig('filename', bbox_inches='tight')
. This is provided automatically bygenetools.plots.savefig(fig, 'filename')
.- Parameters:
data (pandas.DataFrame) – Plot data containing at minimum the columns identified by
index_key
,hue_key
, and optionallyvalue_key
.index_key (str) – Column name defining the rows.
hue_key (str) – Column name defining the horizontal bar categories.
value_key (str, optional.) – Column name defining the bar sizes. If not supplied, this method will calculate group frequencies automatically
ax (matplotlib.axes.Axes, optional) – Existing matplotlib Axes to plot on, defaults to None
figsize (tuple, optional) – Size of figure to generate if no existing ax was provided, defaults to (8, 8)
normalize (bool, optional) – Normalize each row’s frequencies to sum to 1, defaults to True
vertical (bool, optional) – Plot stacked bars vertically, defaults to False (horizontal)
palette (
Union[ Dict[str, Union[HueValueStyle, str]], List[Union[HueValueStyle, str]] ]
, optional) – Palette of colors and/or HueValueStyle objects to style the bars corresponding to each hue value, defaults to None (in which case default palette used). Supply a matplotlib palette name, list of colors, or dict mapping hue values to colors or to HueValueStyle objects (or a mix of the two).na_color (str, optional) – Fallback color to use for hue values that do not have an assigned style in palette, defaults to “lightgray”
hue_order (list, optional) – Optionally specify order of bar subdivisions. This order is applied from the beginning (bottom or left) to the end (top or right) of the bar. Defaults to None
axis_label (str, optional) – Label for the axis along which the frequency values are drawn, defaults to “Frequency”
enable_legend (bool, optional) – Whether legend should be drawn. Defaults to True. May want to disable if plotting multiple subplots/panels.
legend_title (str, optional) – Specify a title for the legend. Defaults to None, in which case the hue_key is used.
- Raises:
ValueError – Must specify correct number of colors if supplying a custom palette
- Returns:
Matplotlib Figure and Axes
- Return type:
(matplotlib.figure.Figure, matplotlib.axes.Axes)
- genetools.plots.superimpose_group_labels(ax: Axes, data: DataFrame, x_axis_key: str, y_axis_key: str, label_key: str, label_z_order=100, label_color='k', label_alpha=0.8, label_size=15) Axes [source]#
Add group (cluster) labels to existing plot.
- Parameters:
ax (matplotlib.axes.Axes) – matplotlib Axes for existing plot
data (pd.DataFrame) – [description]
x_axis_key (str) – Column name to plot on X axis
y_axis_key (str) – Column name to plot on Y axis
label_key (str, optional) – Column name specifying categorical group text labels to superimpose on plot, defaults to None
label_z_order (int, optional) – Z-index for superimposed group text labels, defaults to 100
label_color (str, optional) – Color for superimposed group text labels, defaults to “k”
label_alpha (float, optional) – Opacity for superimposed group text labels, defaults to 0.8
label_size (int, optional) – Text size of superimposed group labels, defaults to 15
- Returns:
matplotlib Axes with superimposed group labels
- Return type:
matplotlib.axes.Axes
- genetools.plots.two_class_relative_density_plot(data: DataFrame, x_key: str, y_key: str, hue_key: str, positive_class: str, colorbar_label: str | None = None, quantile: float | None = 0.5, figsize=(8, 8), n_bins=50, range=None, continuous_cmap: str = 'RdBu_r', cmap_vcenter: float | None = 0.5, balanced_class_weights=True) Tuple[Figure, Axes, str] [source]#
Two-class relative density plot. For alternatives, see contour KDEs in seaborn’s displot function. (For general 2D density plots, see plt.hexbin, sns.jointplot, and plt.hist2d.)
- genetools.plots.wrap_tick_labels(ax: Axes, wrap_x_axis=True, wrap_y_axis=True, wrap_amount=20, break_characters=['/']) Axes [source]#
Add text wrapping to tick labels on x and/or y axes on any plot.
May override existing line breaks in tick labels.
- Parameters:
ax (matplotlib.axes.Axes) – existing plot with tick labels to be wrapped
wrap_x_axis (bool, optional) – whether to wrap x-axis tick labels, defaults to True
wrap_y_axis (bool, optional) – whether to wrap y-axis tick labels, defaults to True
wrap_amount (int, optional) – length of each line of text, defaults to 20
break_characters (list, optional) – characters at which to encourage to breaking text into lines, defaults to [‘/’]. set to None or [] to disable.
- Returns:
plot with modified tick labels
- Return type:
matplotlib.axes.Axes
genetools.scanpy_helpers module#
Scanpy common recipes.
- genetools.scanpy_helpers.clr_normalize(adata, axis=0, inplace=True)[source]#
Centered log ratio transformation for Cite-seq data, normalizing:
each protein’s count vectors across cells (axis=0, normalizing each column of the cells x proteins matrix, default)
or the antibody count vector for each cell (axis=1, normalizing each row of the cells x proteins matrix)
This is a wrapper of genetools.stats.clr_normalize(matrix, axis).
- Parameters:
adata (anndata.AnnData) – Protein counts anndata
axis (int, optional) – normalize each antibody independently (axis=0) or normalize each cell independently (axis=1), defaults to 0
inplace (bool, optional) – whether to modify input anndata, defaults to True
- Returns:
Transformed anndata
- Return type:
anndata.AnnData
- genetools.scanpy_helpers.find_all_markers(adata, cluster_key, pval_cutoff=0.05, log2fc_min=0.25, key_added='rank_genes_groups', test='wilcoxon', use_raw=True)[source]#
Find differentially expressed marker genes for each group of cells.
- Parameters:
adata (anndata.AnnData) – Scanpy/anndata object
cluster_key (str) – The adata.obs column name that defines groups for finding distinguishing marker genes.
pval_cutoff (float, optional) – Only return markers that have an adjusted p-value below this threshold. Defaults to 0.05. Set to None to disable filtering.
log2fc_min (float, optional) – Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. Defaults to 0.25. Set to None to disable filtering.
key_added (str, optional) – The key in adata.uns information is saved to, defaults to “rank_genes_groups”
test (str, optional) – Statistical test to use, defaults to “wilcoxon” (Wilcoxon rank-sum test), see scanpy.tl.rank_genes_groups documentation for other options
use_raw (bool, optional) – Use raw attribute of adata if present, defaults to True
- Returns:
Dataframe with ranked marker genes for each cluster. Important columns: gene, rank, [cluster_key] (same as argument value)
- Return type:
pandas.DataFrame
- genetools.scanpy_helpers.pca_anndata(adata: AnnData, pca_transformer: PCA | IncrementalPCA | None = None, n_components=None, inplace=True, **kwargs) Tuple[AnnData, PCA | IncrementalPCA] [source]#
PCA anndata, like with scanpy.pp.pca. Accepts pre-computed PCA transformer, so you can apply the same PCA to multiple anndatas.
Args:
pca_transformer
: pre-defined preprocessing transformer to run PCA on adata.Xn_components
: number of PCA componentsinplace
: whether to modify input adata in place
Returns:
adata, pca_transformer
- genetools.scanpy_helpers.pca_train_and_test_anndatas(adata_train, adata_test=None, n_components=None, inplace=True, **kwargs)[source]#
PCA train anndata (like with scanpy.pp.pca), then apply same PCA to test anndata – as opposed to PCAing them independently.
If
adata_test
isn’t supplied, this just scalesadata_train
independently.
- genetools.scanpy_helpers.scale_anndata(adata: AnnData, scale_transformer: StandardScaler | None = None, inplace=False, set_raw=False, **kwargs) Tuple[AnnData, StandardScaler] [source]#
Scale anndata, like with scanpy.pp.scale. Accepts pre-computed StandardScaler preprocessing transformer, so you can apply the same scaling to multiple anndatas.
Args:
scale_transformer
: pre-defined preprocessing transformer to scale adata.Xinplace
: whether to modify input adata in placeset_raw
: whether to set adata.raw equal to input adata
Returns:
adata, scale_transformer
- genetools.scanpy_helpers.scale_train_and_test_anndatas(adata_train, adata_test=None, inplace=False, set_raw=False, **kwargs)[source]#
Scale train anndata (like with scanpy.pp.scale), then apply same scaling to test anndata – as opposed to scaling them independently.
If
adata_test
isn’t supplied, this just scalesadata_train
indpendently.
- genetools.scanpy_helpers.umap_anndata(adata, umap_transformer=None, n_neighbors: int | None = None, n_components: int | None = None, inplace=True, use_rapids=False, use_pca=False, **kwargs)[source]#
UMAP anndata, like with scanpy.tl.umap. Accepts pre-computed UMAP transformer, so you can apply the same UMAP to multiple anndatas.
Args:
umap_transformer
: pre-defined preprocessing transformer to run UMAP on adata.Xn_components
: number of UMAP componentsinplace
: whether to modify input adata in place
Anndata should already be scaled.
Returns:
adata, umap_transformer
- genetools.scanpy_helpers.umap_train_and_test_anndatas(adata_train, adata_test=None, n_neighbors: int | None = None, n_components: int | None = None, inplace=True, use_rapids=False, use_pca=False, **kwargs)[source]#
UMAP train anndata (like with scanpy.tl.umap), then apply same UMAP to test anndata – as opposed to PCAing them independently.
If
adata_test
isn’t supplied, this just scalesadata_train
independently.
genetools.stats module#
- class genetools.stats.ConfusionMatrixValues(true_positives: float, false_negatives: float, false_positives: float, true_negatives: float)[source]#
Bases:
object
- genetools.stats.accept_series(func)[source]#
Decorator to seamlessly accept pandas Series in place of a numpy array, and returns with original Series index.
- genetools.stats.clr_normalize(mat, axis=0)[source]#
Centered log ratio transformation for Cite-seq data, normalizing:
each protein’s count vectors across cells (axis=0, normalizing each column of the cells x proteins matrix, default)
or the antibody count vector for each cell (axis=1, normalizing each row of the cells x proteins matrix)
To use with anndata:
genetools.scanpy_helpers.clr_normalize(adata, axis)
Notes:
Output will be densified.
We use Seurat’s [modified CLR implementation](https://github.com/satijalab/seurat/issues/2624) to handle pseudocounts:
log1p(x = x / (exp(x = sum(log1p(x = x[x > 0]), na.rm = TRUE) / length(x = x))))
.
This is almost the same as
log(x) - 1/D * sum( log(product of x's) )
, which is the same aslog(x) - log ( [ product of x's] ^ (1/D) )
, whereD = len(x)
.The general definition is:
from scipy.stats.mstats import gmean return np.log(x) - np.log(gmean(x))
But geometric mean only applies to positive numbers (otherwise the inner product will be 0). So you want to use pseudocounts or drop 0 counts. That’s what Seurat’s modification does.
See also https://github.com/theislab/scanpy/pull/1117 for other approaches.
Do you run this normalization cell-wise or gene-wise (i.e. protein-wise)? See discussion here:
Unfortunately there is not a single answer. In some cases, cell-based normalization fails. This is because cell-normalization makes an assumption that the total ADT counts should be constant across cells. That can become a significant issue if you have cell populations in your data, but did not add protein markers for them (this is also an issue for scRNA-seq, but is significantly mitigated because at least you measure many genes). However, gene-based normalization can fail when there is significant heterogeneity in sequencing depth, or cell size. The optimal strategy depends on the AB panel, and heterogeneity of your sample.
In this implementation, protein-wise is axis=0 and cell-wise is axis=1. Seurat’s default is protein-wise, i.e. axis=0.
The default is “protein-wise” (axis=0), i.e. normalize each protein independently.
- Parameters:
mat (numpy array or scipy sparse matrix) – Counts matrix (cells x proteins)
axis (int, optional) – normalize each antibody independently (axis=0) or normalize each cell independently (axis=1), defaults to 0
- Returns:
Transformed counts matrix
- Return type:
numpy array
- genetools.stats.coclustering(cluster_ids_1, cluster_ids_2)[source]#
Compute coclustering percentage between two sets of cluster IDs for the same cells: Of the cell pairs clustered together by either or both methods, what percentage are clustered together by both methods?
(The clusters are allowed to have different names across methods, and don’t necessarily need to be ints.)
- Parameters:
cluster_ids_1 (numpy array-like) – One set of cluster IDs.
cluster_ids_2 (numpy array-like) – Another set of cluster IDs.
- Returns:
Percentage of cell pairs clustered together by one or both methods that are also clustered together by the other method.
- Return type:
float
- genetools.stats.interpolate_prc(y_true, y_score, n_points=1000)[source]#
Interpolate precision-recall curve.
Returns:
interpolated recall
interpolated precision
original recall
original precision
To plot:
fig, ax = plt.subplots(figsize=(4, 4)) plt.plot( recall, precision, label="Real", color="k", alpha=0.5, drawstyle="steps-post", ) plt.plot( interpolated_recall, interpolated_precision, label="Interpolated", color="r", alpha=0.5, drawstyle="steps-post", ) plt.xlabel("Recall") plt.ylabel("Precision") plt.legend(bbox_to_anchor=(1, 1))
- genetools.stats.interpolate_roc(y_true, y_score, n_points=1000)[source]#
Interpolate receiver operating characteristic curve.
Returns:
interpolated FPR
interpolated TPR
original FPR
original TPR
To plot:
fig, ax = plt.subplots(figsize=(4, 4)) plt.plot( fpr, tpr, label="Real", color="k", alpha=0.5, drawstyle="steps-post", ) plt.plot( interpolated_fpr, interpolated_tpr, label="Interpolated", color="r", alpha=0.5, drawstyle="steps-post", ) plt.xlabel("FPR") plt.ylabel("TPR") plt.legend(bbox_to_anchor=(1, 1))
- genetools.stats.intersect_marker_genes(reference_data, query_data, low_confidence_threshold=0.035, low_confidence_suffix='?')[source]#
Map cluster marker genes against reference lists to find top hits.
- query_data and reference_data should both be dictionaries where:
keys are cluster names or IDs
values are lists of genes associated with that cluster
Or if you have a dataframe where each row contains a cluster ID and a gene name, you can convert to dict with
df.groupby('cluster')['gene'].apply(list).to_dict()
Usage with an anndata/scanpy object on groups defined by
adata.obs['louvain']
:# find marker genes for all clusters cluster_markers_df = genetools.scanpy_helpers.find_all_markers(adata, cluster_key='louvain') # convert to dict of clusters -> gene lists mapping cluster_marker_lists = cluster_markers_df.groupby('louvain')['gene'].apply(list).to_dict() # intersect with known marker gene lists results, label_map, low_confidence_percentage = genetools.stats.intersect_marker_genes(reference_marker_lists, cluster_marker_lists) # rename clusters in your anndata/scanpy object adata.obs['louvain_annotated'] = adata.obs['louvain'].copy().cat.rename_categories(label_map)
- Behavior:
Intersection scores are normalized to marker gene list sizes.
Resulting duplicate cluster names are renamed, ensuring that N original query clusters will map to N renamed clusters.
- Parameters:
reference_data (dict) – reference marker gene lists
query_data (dict) – query marker gene lists
low_confidence_threshold (float, optional) – Minimal difference between top and subsequent hits for a confident call, defaults to 0.035
low_confidence_suffix (str, optional) – Suffix for low-confidence cluster renamings, defaults to “?”
- Returns:
dataframe with cluster mapping details, a dictionary for renaming query cluster names, and percentage of low-confidence calls.
- Return type:
(pandas.DataFrame, dict, float) tuple
- genetools.stats.make_confusion_matrix(y_true: ndarray | list, y_pred: ndarray | list, true_label: str, pred_label: str, label_order: List[str] | None = None) DataFrame [source]#
Make a confusion matrix. Pairs with
genetools.plots.plot_confusion_matrix
.
- genetools.stats.normalize_columns(df)[source]#
Make columns sum to 1. If a column is all zeroes, this will return NaNs for the column entries, since there is no way to make the column sum to 1.
- Parameters:
df (pandas.DataFrame) – dataframe
- Returns:
column-normalized dataframe
- Return type:
pandas.DataFrame
- genetools.stats.normalize_rows(df)[source]#
Make rows sum to 1. If a row is all zeroes, this will return NaNs for the row entries, since there is no way to make the row sum to 1.
- Parameters:
df (pandas.DataFrame) – dataframe
- Returns:
row-normalized dataframe
- Return type:
pandas.DataFrame
- genetools.stats.percentile_normalize(values)[source]#
Percentile normalize.
- Parameters:
values (numpy.ndarray or pandas.Series) – values to normalize
- Returns:
percentile-normalized values
- Return type:
numpy.ndarray or pandas.Series
- genetools.stats.rank_normalize(values)[source]#
Rank normalize, starting with rank 1. All ranks must be unique.
- Parameters:
values (numpy.ndarray or pandas.Series) – values to normalize
- Returns:
rank-normalized values
- Return type:
numpy.ndarray or pandas.Series
- genetools.stats.run_sigmoid_if_binary_and_softmax_if_multiclass(arr: ndarray) ndarray [source]#
Convert logits to probabilities using sigmoid if binary, or softmax if multiclass.
Binary is standard logistic regression. Input should have shape (n_samples, ) (as computed by decision_function() in sklearn), but the output will follow predict_proba’s conventional output shape of (n_samples, 2).
Multiclass approach is softmax regression as in R glmnet and sklearn. See https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_set_of_independent_binary_regressions
Module contents#
Top-level package for genetools.