Python - kallisto | bustools (2024)

Python - kallisto | bustools (1)

In this notebook, we will perform pre-processing and analysis of 10x Genomics pbmc_1k_protein_v3 feature barcoding dataset using the Kallisto Indexing and Tag Extraction (KITE) workflow, implemented with a wrapper called kb. It was developed by Kyung Hoi (Joseph) Min and A. Sina Booeshaghi.

In Feature Barcoding assays, cellular data are recorded as short DNA sequences using procedures adapted from single-cell RNA-seq. The KITE workflow generates a "Mismatch Map" containing the sequences of all Feature Barcodes used in the experiment as well as all of their single-base mismatches. The Mismatch Map is used to produce transcipt-to-gene (.t2g) and fasta (.fa) files to be used as inputs for kallisto. An index is made with kallisto index, then kallisto | bustools effectively searches the sequencing data for the sequences in the Mismatch Map.

Pre-processing

Download the data

Note: We use the -O option for wget to rename the files to easily identify them.

123456
%%time!wget -q https://caltech.box.com/shared/static/asmj4nu90ydhsrk3pm7aaxu00cnnfige.txt -O checksums.txt!wget -q https://caltech.box.com/shared/static/mp2vr3p6dztdyatuag8ir3cektmrztg8.gz -O pbmc_1k_protein_v3_antibody_S2_L001_R1_001.fastq.gz!wget -q https://caltech.box.com/shared/static/f3payi1za7mn0jfai7vm10sy3yqwgpqh.gz -O pbmc_1k_protein_v3_antibody_S2_L001_R2_001.fastq.gz!wget -q https://caltech.box.com/shared/static/e112bbczh9o1rl6gfin36bqp0ga7uvdy.gz -O pbmc_1k_protein_v3_antibody_S2_L002_R1_001.fastq.gz!wget -q https://caltech.box.com/shared/static/3ve2axc8dr8v5nnrhmynrdgpqj6xg42k.gz -O pbmc_1k_protein_v3_antibody_S2_L002_R2_001.fastq.gz
CPU times: user 477 ms, sys: 86.2 ms, total: 563 msWall time: 1min

Then, we verify the integrity of the files we downloaded to make sure they were not corrupted during the download.

1
!md5sum -c checksums.txt --ignore-missing
pbmc_1k_protein_v3_antibody_S2_L001_R1_001.fastq.gz: OKpbmc_1k_protein_v3_antibody_S2_L001_R2_001.fastq.gz: OKpbmc_1k_protein_v3_antibody_S2_L002_R1_001.fastq.gz: OKpbmc_1k_protein_v3_antibody_S2_L002_R2_001.fastq.gz: OK

Install kb

Install kb for running the kallisto|bustools workflow.

1
!pip install --quiet git+https://github.com/pachterlab/kb_python@devel
[K |████████████████████████████████| 51kB 2.6MB/s [K |████████████████████████████████| 13.2MB 314kB/s [K |████████████████████████████████| 112kB 43.4MB/s [?25h Building wheel for kb-python (setup.py) ... [?25l[?25hdone Building wheel for loompy (setup.py) ... [?25l[?25hdone Building wheel for numpy-groupies (setup.py) ... [?25l[?25hdone

Build the feature barcode mismatch index

kb is able to generate a FASTA file containing all hamming distance < 2 variants of the feature barcodes and create a kallisto index of these sequences. But it in order to do so, we first need to prepare a TSV containing feature barcode sequences in the first column and the feature barcode names in the second.

First, we download the feature reference file provided by 10x Genomics.

1
!wget -q http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_protein_v3/pbmc_1k_protein_v3_feature_ref.csv

Let's load it in as a Pandas DataFrame.

1234
import pandas as pddf = pd.read_csv('pbmc_1k_protein_v3_feature_ref.csv')df
id name read pattern sequence feature_type
0 CD3 CD3_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN AACAAGACCCTTGAG Antibody Capture
1 CD4 CD4_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN TACCCGTAATAGCGT Antibody Capture
2 CD8a CD8a_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN ATTGGCACTCAGATG Antibody Capture
3 CD14 CD14_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN GAAAGTCAAAGCACT Antibody Capture
4 CD15 CD15_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN ACGAATCAATCTGTG Antibody Capture
5 CD16 CD16_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN GTCTTTGTCAGTGCA Antibody Capture
6 CD56 CD56_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN GTTGTCCGACAATAC Antibody Capture
7 CD19 CD19_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN TCAACGCTTGGCTAG Antibody Capture
8 CD25 CD25_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN GTGCATTCAACAGTA Antibody Capture
9 CD45RA CD45RA_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN GATGAGAACAGGTTT Antibody Capture
10 CD45RO CD45RO_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN TGCATGTCATCGGTG Antibody Capture
11 PD-1 PD-1_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN AAGTCGTGAGGCATG Antibody Capture
12 TIGIT TIGIT_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN TGAAGGCTCATTTGT Antibody Capture
13 CD127 CD127_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN ACATTGACGCAACTA Antibody Capture
14 IgG2a IgG2a_control_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN CTCTATTCAGACCAG Antibody Capture
15 IgG1 IgG1_control_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN ACTCACTGGAGTCTC Antibody Capture
16 IgG2b IgG2b_control_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN ATCACATCGTTGCCA Antibody Capture

We'll convert this dataframe into a TSV format that kb requires.

12
df[['sequence', 'id']].to_csv('features.tsv', index=None, header=None, sep='\t')!cat features.tsv
AACAAGACCCTTGAG CD3TACCCGTAATAGCGT CD4ATTGGCACTCAGATG CD8aGAAAGTCAAAGCACT CD14ACGAATCAATCTGTG CD15GTCTTTGTCAGTGCA CD16GTTGTCCGACAATAC CD56TCAACGCTTGGCTAG CD19GTGCATTCAACAGTA CD25GATGAGAACAGGTTT CD45RATGCATGTCATCGGTG CD45ROAAGTCGTGAGGCATG PD-1TGAAGGCTCATTTGT TIGITACATTGACGCAACTA CD127CTCTATTCAGACCAG IgG2aACTCACTGGAGTCTC IgG1ATCACATCGTTGCCA IgG2b

Finally, we use kb to generate the mismatch kallisto index.

1
!kb ref -i mismatch.idx -f1 mismatch.fa -g t2g.txt --workflow kite features.tsv
[2021-03-31 23:30:15,970] INFO Generating mismatch FASTA at mismatch.fa[2021-03-31 23:30:15,982] INFO Creating transcript-to-gene mapping at t2g.txt[2021-03-31 23:30:15,986] INFO Indexing mismatch.fa to mismatch.idx

Generate a feature count matrix in H5AD format

The following command will generate an RNA count matrix of cells (rows) by genes (columns) in H5AD format, which is a binary format used to store Anndata objects. Notice we are providing the index and transcript-to-gene mapping we generated in the previous step to the -i and -g arguments respectively. Also, these reads were generated with the 10x Genomics Chromium Single Cell v3 Chemistry, hence the -x 10xv3 argument. To view other supported technologies, run kb --list.

Note: If you would like a Loom file instead, replace the --h5ad flag with --loom. If you want to use the raw matrix output by kb instead of their H5AD or Loom converted files, omit these flags.

123456
%%time!kb count --h5ad -i mismatch.idx -g t2g.txt -x 10xv3 --workflow kite -t 2 \pbmc_1k_protein_v3_antibody_S2_L001_R1_001.fastq.gz \pbmc_1k_protein_v3_antibody_S2_L001_R2_001.fastq.gz \pbmc_1k_protein_v3_antibody_S2_L002_R1_001.fastq.gz \pbmc_1k_protein_v3_antibody_S2_L002_R2_001.fastq.gz
[2021-03-31 23:30:17,474] INFO Using index mismatch.idx to generate BUS file to . from[2021-03-31 23:30:17,474] INFO pbmc_1k_protein_v3_antibody_S2_L001_R1_001.fastq.gz[2021-03-31 23:30:17,474] INFO pbmc_1k_protein_v3_antibody_S2_L001_R2_001.fastq.gz[2021-03-31 23:30:17,474] INFO pbmc_1k_protein_v3_antibody_S2_L002_R1_001.fastq.gz[2021-03-31 23:30:17,474] INFO pbmc_1k_protein_v3_antibody_S2_L002_R2_001.fastq.gz[2021-03-31 23:32:02,567] INFO Sorting BUS file ./output.bus to ./tmp/output.s.bus[2021-03-31 23:32:18,183] INFO Whitelist not provided[2021-03-31 23:32:18,183] INFO Copying pre-packaged 10XV3 whitelist to .[2021-03-31 23:32:19,157] INFO Inspecting BUS file ./tmp/output.s.bus[2021-03-31 23:32:31,284] INFO Correcting BUS records in ./tmp/output.s.bus to ./tmp/output.s.c.bus with whitelist ./10xv3_whitelist.txt[2021-03-31 23:32:49,746] INFO Sorting BUS file ./tmp/output.s.c.bus to ./output.unfiltered.bus[2021-03-31 23:33:01,416] INFO Generating count matrix ./counts_unfiltered/cells_x_features from BUS file ./output.unfiltered.bus[2021-03-31 23:33:03,924] INFO Reading matrix ./counts_unfiltered/cells_x_features.mtx[2021-03-31 23:33:05,698] INFO Writing matrix to h5ad ./counts_unfiltered/adata.h5adCPU times: user 1.3 s, sys: 180 ms, total: 1.48 sWall time: 2min 49s

Analysis

In this part of the tutorial, we will load the RNA count matrix generated by kb count into Python and cluster the cells with Leiden.

Install packages

Google Colab does not come with Scanpy, python-igraph, or louvain (but comes with matplotlib, numpy, pandas, and scipy).

1
!pip --quiet install leidenalg scanpy MulticoreTSNE

Import packages

123
import anndataimport numpy as npimport scanpy as sc
1
adata = anndata.read_h5ad('counts_unfiltered/adata.h5ad')
1
adata
AnnData object with n_obs × n_vars = 124716 × 17 var: 'feature_name'
1
adata.obs.head()
barcode
AAACCCAAGAAACCCA
AAACCCAAGACGAGGA
AAACCCAAGAGTGTGT
AAACCCAAGAGTGTTG
AAACCCAAGATAGCAC
1
adata.var
feature_name
feature_id
CD3 CD3
CD4 CD4
CD8a CD8a
CD14 CD14
CD15 CD15
CD16 CD16
CD56 CD56
CD19 CD19
CD25 CD25
CD45RA CD45RA
CD45RO CD45RO
PD-1 PD-1
TIGIT TIGIT
CD127 CD127
IgG2a IgG2a
IgG1 IgG1
IgG2b IgG2b

Plot counts

1
sc.pp.filter_cells(adata, min_counts=0)
1
sc.pp.filter_genes(adata, min_counts=0)
1
sc.pl.violin(adata, keys='n_counts')

Python - kallisto | bustools (2)

1
adata.obs['n_countslog'] = np.log1p(adata.obs['n_counts'])
1
sc.pl.violin(adata, keys='n_countslog')

Python - kallisto | bustools (3)

1
adata.obs.index
Index(['AAACCCAAGAAACCCA', 'AAACCCAAGACGAGGA', 'AAACCCAAGAGTGTGT', 'AAACCCAAGAGTGTTG', 'AAACCCAAGATAGCAC', 'AAACCCAAGATGAGTC', 'AAACCCAAGATGGTAC', 'AAACCCAAGATTCGTT', 'AAACCCAAGATTTGGG', 'AAACCCAAGCAAGCAT', ... 'TTTGTTGTCGTCGCCT', 'TTTGTTGTCGTTGACG', 'TTTGTTGTCTAACCGG', 'TTTGTTGTCTATGTAG', 'TTTGTTGTCTCAACAA', 'TTTGTTGTCTCACTCA', 'TTTGTTGTCTCTTCGA', 'TTTGTTGTCTCTTGGT', 'TTTGTTGTCTGCACTT', 'TTTGTTGTCTGCGACA'], dtype='object', name='barcode', length=124716)
123
sc.pp.filter_cells(adata, min_counts=1000)sc.pl.violin(adata, keys='n_countslog', title="kallisto UMI counts")adata

Python - kallisto | bustools (4)

AnnData object with n_obs × n_vars = 725 × 17 obs: 'n_counts', 'n_countslog' var: 'feature_name', 'n_counts'

Here are violin plots for each Feature Barcode (antibody-oligo conjugates, x-axis) across all cells.

1
sc.pl.violin(adata, keys=list(adata.var.index)[-17:], xlabel='kallisto')

Python - kallisto | bustools (5)

Cluster with Leiden

1
sc.pp.normalize_per_cell(adata, counts_per_cell_after=10000)
12
sc.pp.neighbors(adata)sc.tl.umap(adata)
1
sc.tl.leiden(adata, resolution=0.05)
1
sc.pl.umap(adata, color='leiden', palette='tab10')

Python - kallisto | bustools (6)

Embedding and Antibody Quantification

1
sc.pl.umap(adata, color=adata.var.index)

Python - kallisto | bustools (7)

1
sc.pl.violin(adata, keys=list(adata.var.index[:2]), groupby='leiden')

Python - kallisto | bustools (8)

1
Python - kallisto | bustools (2024)
Top Articles
Latest Posts
Article information

Author: Madonna Wisozk

Last Updated:

Views: 5490

Rating: 4.8 / 5 (68 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Madonna Wisozk

Birthday: 2001-02-23

Address: 656 Gerhold Summit, Sidneyberg, FL 78179-2512

Phone: +6742282696652

Job: Customer Banking Liaison

Hobby: Flower arranging, Yo-yoing, Tai chi, Rowing, Macrame, Urban exploration, Knife making

Introduction: My name is Madonna Wisozk, I am a attractive, healthy, thoughtful, faithful, open, vivacious, zany person who loves writing and wants to share my knowledge and understanding with you.