Python - kallisto | bustools (2024)

Python - kallisto | bustools (1)

This Python notebook demonstrates the use of the kallisto and bustools programs for pre-processing single-cell RNA-seq data (also available as an R notebook). It streams in 1 million C. elegans reads, pseudoaligns them, and produces a cells x genes count matrix in about a minute. The notebook then performs some basic QC. It expands on a notebook prepared by Sina Booeshaghi for the Genome Informatics 2019 meeting, where he ran it in under 60 seconds during a 1 minute "lightning talk".

The kallistobus.tools tutorials site has a extensive list of follow-up tutorials and vignettes on single-cell RNA-seq.

123
#@title from IPython.display import HTMLHTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/x-rNofr88BM" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

The notebook was written by A. Sina Booeshaghi and Lior Pachter. If you use the methods in this notebook for your analysis please cite the following publication, on which it is based:

  • Melsted, P., Booeshaghi, A.S. et al. Modular and efficient pre-processing of single-cell RNA-seq. bioRxiv (2019). doi:10.1101/673285

Setup

123
# This is used to time the running of the notebookimport timestart_time = time.time()

Install python packages

123456
# These packages are pre-installed on Google Colab, but are included here to simplify running this notebook locally%%capture!pip install matplotlib!pip install scikit-learn!pip install numpy!pip install scipy
 1 2 3 4 5 6 7 8 910
# Install packages for analysis and plottingfrom scipy.io import mmreadfrom sklearn.decomposition import TruncatedSVDimport numpy as npimport matplotlib.pyplot as pltimport matplotlibfrom scipy.sparse import csr_matrixmatplotlib.rcParams.update({'font.size': 22})%config InlineBackend.figure_format = 'retina'
1234
%%time%%capture# `kb` is a wrapper for the kallisto and bustools program, and the kb-python package contains the kallisto and bustools executables.!pip install kb-python==0.24.1
CPU times: user 81.7 ms, sys: 19.2 ms, total: 101 msWall time: 10 s

Download required files

1234567
%%time# The quantification of single-cell RNA-seq with kallisto requires an index. # Indices are species specific and can be generated or downloaded directly with `kb`. # Here we download a pre-made index for C. elegans (the idx.idx file) along with an auxillary file (t2g.txt) # that describes the relationship between transcripts and genes.!wget -O idx.idx https://caltech.box.com/shared/static/82yv415pkbdixhzi55qac1htiaph9ng4.idx!wget -O t2g.txt https://caltech.box.com/shared/static/cflxji16171skf3syzm8scoxkcvbl97x.txt
--2021-07-07 18:38:00-- https://caltech.box.com/shared/static/82yv415pkbdixhzi55qac1htiaph9ng4.idxResolving caltech.box.com (caltech.box.com)... 107.152.25.197Connecting to caltech.box.com (caltech.box.com)|107.152.25.197|:443... connected.HTTP request sent, awaiting response... 301 Moved PermanentlyLocation: /public/static/82yv415pkbdixhzi55qac1htiaph9ng4.idx [following]--2021-07-07 18:38:00-- https://caltech.box.com/public/static/82yv415pkbdixhzi55qac1htiaph9ng4.idxReusing existing connection to caltech.box.com:443.HTTP request sent, awaiting response... 301 Moved PermanentlyLocation: https://caltech.app.box.com/public/static/82yv415pkbdixhzi55qac1htiaph9ng4.idx [following]--2021-07-07 18:38:00-- https://caltech.app.box.com/public/static/82yv415pkbdixhzi55qac1htiaph9ng4.idxResolving caltech.app.box.com (caltech.app.box.com)... 107.152.25.201Connecting to caltech.app.box.com (caltech.app.box.com)|107.152.25.201|:443... connected.HTTP request sent, awaiting response... 302 FoundLocation: https://public.boxcloud.com/d/1/b1!nAdT7HbMqSV2-u8uTTCudCYkwKYLV2tsQJOqE7FGwSMUnjwZZXY9VF2lnh-slpFJJ9ZfKUM1NoetaCvcUemBOKSnqcwcy9lLgSH3GsstN4twPjC45-6Ukm43LjqPm_0dHFEafE28tFmQJ2BnovfBjxKMgbKPDcA09Ec5jpLVqgstxCBldBBA5uxQ6hMq6PtjUvjanuzjLdTdUiI-UAJkcpX_BChWREi3PesYVk8j-wM69vI1tJ3iXYdnMduqXrjep_pa6Pr4j7SIDo4RTSj2jyrcGXE8z77dgTSKg5ZV-JqxO-moGLOappHAiWBsbebHn8dCTuHx9oZlqBauLv0DhvSkKtmSzwqYiwdUaBLDyI0AxbGY96u-lxIWoEVxYQBQQBNdnDR-W76f4WoLiV2qK9zK5veum2KluNaMYdXpg_CI1RbPw9Hb5vE7nubzBc6Fa_XvWRnbtqz1_GehvCwRGNM9Pac8PznO6pJz66nxN0qyd5wDw16ti_iQIDNndeOOlZx78nBLHO_VaS1GZH-bkzpHMrd8K4hha-M-BJ5o5RpsYZcHy8StB8R-kJNkmrddQvkIpuWb2vwl0FqaQOlx2BlWLLINMrXCWyRppGvdAf4pimgdLgzq-To-rpPfXJ89vqXW79fttpe044QAcQTAOFCsLSBqm3KQiwfR2oxnlJe744o0Qp0kRKQgEIANMHRnr9NRBgWTmkZ-vyeT2X0YEJh-3w3HLo8XzeMuz_95a89zf-dPWhkbTGRHlRTftYxgUQTo1kkL0D_5BXGWvnLgHhgRcdKFNW0pjd4Z0i3n4NYBrxIKb6XNnN5kprV_4Q3W9-M1spIWR0r4m-nuLMyjU7_LCtLH8VIEKlT1vBgDSdkb79k_EKdoiO1-DdhJU3WrCTt9h9_VYsg9j3dO5aC5FDDv49TkBOVGs-yYw73bjtcT7r1MPXKh6HCErASPktoijVtCJOf5aoOIRNnXdRlusfVZRCdJh5PQloCBeMgCbtBRznpq-oOvjCK-XcJZIVH8FE5zrFc-3u0qbTUc3HBQ1WTGXZ2NGp_OzaCzkfW0FtNhA-cPit3-b6YGCatwdeCg83O6hUsZ4bpuCSfJIdX6a32_wv9xOdMRzwbEUVuR0DCTJQFkl2H-jb6VHPmnnVTTUwon8q8Csqk0OP5xK53TxRwge9QAM-wojp4KVTaI59IWguqXuFnEH66vv8-c8hlYeoHUkL7dJ-j7bhvyXHdZfZXHGDDBYNSjghPddkvp2Hr856-K5r1muA1qFEZ5eRhPNFqlJIbs3SZTtY8Sis2PiSRZNWT0vZ2oOfALEGZHpzd4q29ieLMI4rwMGQ22099h6Up0JeXERR-t-jMSMCN1_eNMz34amOgsnGpX-J5OlZzDVUB0uJ2Uf3vdtRKaa97OjfzwmtN9NYbPh_u87DYUUsfKh7PVSZzoCU0f8l8NdVeZlA../download [following]--2021-07-07 18:38:00-- https://public.boxcloud.com/d/1/b1!nAdT7HbMqSV2-u8uTTCudCYkwKYLV2tsQJOqE7FGwSMUnjwZZXY9VF2lnh-slpFJJ9ZfKUM1NoetaCvcUemBOKSnqcwcy9lLgSH3GsstN4twPjC45-6Ukm43LjqPm_0dHFEafE28tFmQJ2BnovfBjxKMgbKPDcA09Ec5jpLVqgstxCBldBBA5uxQ6hMq6PtjUvjanuzjLdTdUiI-UAJkcpX_BChWREi3PesYVk8j-wM69vI1tJ3iXYdnMduqXrjep_pa6Pr4j7SIDo4RTSj2jyrcGXE8z77dgTSKg5ZV-JqxO-moGLOappHAiWBsbebHn8dCTuHx9oZlqBauLv0DhvSkKtmSzwqYiwdUaBLDyI0AxbGY96u-lxIWoEVxYQBQQBNdnDR-W76f4WoLiV2qK9zK5veum2KluNaMYdXpg_CI1RbPw9Hb5vE7nubzBc6Fa_XvWRnbtqz1_GehvCwRGNM9Pac8PznO6pJz66nxN0qyd5wDw16ti_iQIDNndeOOlZx78nBLHO_VaS1GZH-bkzpHMrd8K4hha-M-BJ5o5RpsYZcHy8StB8R-kJNkmrddQvkIpuWb2vwl0FqaQOlx2BlWLLINMrXCWyRppGvdAf4pimgdLgzq-To-rpPfXJ89vqXW79fttpe044QAcQTAOFCsLSBqm3KQiwfR2oxnlJe744o0Qp0kRKQgEIANMHRnr9NRBgWTmkZ-vyeT2X0YEJh-3w3HLo8XzeMuz_95a89zf-dPWhkbTGRHlRTftYxgUQTo1kkL0D_5BXGWvnLgHhgRcdKFNW0pjd4Z0i3n4NYBrxIKb6XNnN5kprV_4Q3W9-M1spIWR0r4m-nuLMyjU7_LCtLH8VIEKlT1vBgDSdkb79k_EKdoiO1-DdhJU3WrCTt9h9_VYsg9j3dO5aC5FDDv49TkBOVGs-yYw73bjtcT7r1MPXKh6HCErASPktoijVtCJOf5aoOIRNnXdRlusfVZRCdJh5PQloCBeMgCbtBRznpq-oOvjCK-XcJZIVH8FE5zrFc-3u0qbTUc3HBQ1WTGXZ2NGp_OzaCzkfW0FtNhA-cPit3-b6YGCatwdeCg83O6hUsZ4bpuCSfJIdX6a32_wv9xOdMRzwbEUVuR0DCTJQFkl2H-jb6VHPmnnVTTUwon8q8Csqk0OP5xK53TxRwge9QAM-wojp4KVTaI59IWguqXuFnEH66vv8-c8hlYeoHUkL7dJ-j7bhvyXHdZfZXHGDDBYNSjghPddkvp2Hr856-K5r1muA1qFEZ5eRhPNFqlJIbs3SZTtY8Sis2PiSRZNWT0vZ2oOfALEGZHpzd4q29ieLMI4rwMGQ22099h6Up0JeXERR-t-jMSMCN1_eNMz34amOgsnGpX-J5OlZzDVUB0uJ2Uf3vdtRKaa97OjfzwmtN9NYbPh_u87DYUUsfKh7PVSZzoCU0f8l8NdVeZlA../downloadResolving public.boxcloud.com (public.boxcloud.com)... 107.152.24.200Connecting to public.boxcloud.com (public.boxcloud.com)|107.152.24.200|:443... connected.HTTP request sent, awaiting response... 200 OKLength: 625579580 (597M) [application/octet-stream]Saving to: idx.idxidx.idx 100%[===================>] 596.60M 18.3MB/s in 36s2021-07-07 18:38:37 (16.8 MB/s) - idx.idx saved [625579580/625579580]--2021-07-07 18:38:37-- https://caltech.box.com/shared/static/cflxji16171skf3syzm8scoxkcvbl97x.txtResolving caltech.box.com (caltech.box.com)... 107.152.25.197Connecting to caltech.box.com (caltech.box.com)|107.152.25.197|:443... connected.HTTP request sent, awaiting response... 301 Moved PermanentlyLocation: /public/static/cflxji16171skf3syzm8scoxkcvbl97x.txt [following]--2021-07-07 18:38:37-- https://caltech.box.com/public/static/cflxji16171skf3syzm8scoxkcvbl97x.txtReusing existing connection to caltech.box.com:443.HTTP request sent, awaiting response... 301 Moved PermanentlyLocation: https://caltech.app.box.com/public/static/cflxji16171skf3syzm8scoxkcvbl97x.txt [following]--2021-07-07 18:38:37-- https://caltech.app.box.com/public/static/cflxji16171skf3syzm8scoxkcvbl97x.txtResolving caltech.app.box.com (caltech.app.box.com)... 107.152.25.201Connecting to caltech.app.box.com (caltech.app.box.com)|107.152.25.201|:443... connected.HTTP request sent, awaiting response... 302 FoundLocation: https://public.boxcloud.com/d/1/b1!TwnLlpzLPezE9R8L30b_LgO_uHuxPS8VDrBNDz_zAbovkjbHWuXi8l8DVnDitnwQBxpu9UsNRTJjhHuGtf0lAWqWJ4yCyh1VKOVy2w6Nfcol3DGLho4ynnZFtL3kXIRgm91nRVsXEAe96HSf8PtmkNMQ99kFT74uchBiDtS7wcSZZzwaEMxJbaMhSUchoelmxesP3QeOIKfb--kuYOqovncop33DB8WttbBUNDi-vYBZXMRKn3zoPVcDCYqJxG9llT2DpQ_JIegvNsa0X9rvzrQgqZrzQcsYyByQIWzTE63JxbGd0yUnKrTWN3jzkeXaXBepKUaGgtTNxkC0CKBjyQskBB3zMBhaPv713cFQrW5HoqrccKKRq488DIjfs2mLTs0yRjGkoYakhSIPnxglE9PTNslfKjwJfemxvf7DM7bCDUi4Q8h9jv8ClAbtJ-d5rVq4E65PUqNqv9ofsYn2dSPC9KWWTmaeVTV9BtHm7cHz4sBeG_kNG-Nph72y3XI9sl3cCWBlgfNwVbLg8hO3yFJIjUR4jtN0VvbYjTWux7A-NW-vCsn-bAa7A9SqZnL2b96ipj_NXB9jhC4iZ3zuWX467PR6F0iYGUACbP7ioGTiFM7cD5uVAZGDjvZN-K4JSe3Qq0cknEJpWLyh6vlWP2hN_N0nSlCTMnue3N9_kxsIEzDlq7Q47Qbtbhe4KrnFDWvL_0T13ckBkeQOxBYD7X2sLTJxbghpBjJbYTiO7eSCc4CJoiXIzb0BbeWHSZc9HueW6N_B0bC9HHnGttOqHPGOZiU1-H4MKlMvS7Zz612SJ4rLz0k71piAcSzfKRcwraflYqg4QdXll5byM7rt_YGfKp9uCtq-uh54ie8Nw0AUxLkI38xb18UOXCZT2wa3vL2eCsMwJLjg-yXyuBZ6jVE0DXQrPP7rWbbOTD3UHe_jfSo3S_H_Eor7s8_Z7aR70KBsmd-r4NXwIA7Rt7F1GSHBGE7UxtVgrSQ2zXvI8eQPwIXVGrsnqzu9MGzpt_qH_U0Pj9jRsR7vwajXzDq-YoMWwRU5ad3wX0gQyzA_8B1_AQ-6xtgpA0_WaY5kz01s4w86Dswu6kQO35RzXeKEU43bl9zs_SPvSI9KaOJ4uDAAc7mdKVmeG6mugD78-13YeN1tfDVt15-AWfQijJkBZIfnPzTd-WhKcgomn6MfxGuP9VDEGso5Y1i_sMQfPvxChuKIjoPSbbmqdDyX0ml4lC4nQ9NFAaFTvlsViSvoNaHiQImpD-viVJ3tUtF9L_ojonKM1YVA8Y1um6RUFW7v19TrNaw_kTDoSYygZgcug76QN8DRPwOjtBm0mVU64V7VwJXu3PvWqUn4mmcoIDpTNms_exKJA_p80qHmDJYYvghGMD4KejrtAA3bRiN9UrGLugpsaY338bU./download [following]--2021-07-07 18:38:37-- https://public.boxcloud.com/d/1/b1!TwnLlpzLPezE9R8L30b_LgO_uHuxPS8VDrBNDz_zAbovkjbHWuXi8l8DVnDitnwQBxpu9UsNRTJjhHuGtf0lAWqWJ4yCyh1VKOVy2w6Nfcol3DGLho4ynnZFtL3kXIRgm91nRVsXEAe96HSf8PtmkNMQ99kFT74uchBiDtS7wcSZZzwaEMxJbaMhSUchoelmxesP3QeOIKfb--kuYOqovncop33DB8WttbBUNDi-vYBZXMRKn3zoPVcDCYqJxG9llT2DpQ_JIegvNsa0X9rvzrQgqZrzQcsYyByQIWzTE63JxbGd0yUnKrTWN3jzkeXaXBepKUaGgtTNxkC0CKBjyQskBB3zMBhaPv713cFQrW5HoqrccKKRq488DIjfs2mLTs0yRjGkoYakhSIPnxglE9PTNslfKjwJfemxvf7DM7bCDUi4Q8h9jv8ClAbtJ-d5rVq4E65PUqNqv9ofsYn2dSPC9KWWTmaeVTV9BtHm7cHz4sBeG_kNG-Nph72y3XI9sl3cCWBlgfNwVbLg8hO3yFJIjUR4jtN0VvbYjTWux7A-NW-vCsn-bAa7A9SqZnL2b96ipj_NXB9jhC4iZ3zuWX467PR6F0iYGUACbP7ioGTiFM7cD5uVAZGDjvZN-K4JSe3Qq0cknEJpWLyh6vlWP2hN_N0nSlCTMnue3N9_kxsIEzDlq7Q47Qbtbhe4KrnFDWvL_0T13ckBkeQOxBYD7X2sLTJxbghpBjJbYTiO7eSCc4CJoiXIzb0BbeWHSZc9HueW6N_B0bC9HHnGttOqHPGOZiU1-H4MKlMvS7Zz612SJ4rLz0k71piAcSzfKRcwraflYqg4QdXll5byM7rt_YGfKp9uCtq-uh54ie8Nw0AUxLkI38xb18UOXCZT2wa3vL2eCsMwJLjg-yXyuBZ6jVE0DXQrPP7rWbbOTD3UHe_jfSo3S_H_Eor7s8_Z7aR70KBsmd-r4NXwIA7Rt7F1GSHBGE7UxtVgrSQ2zXvI8eQPwIXVGrsnqzu9MGzpt_qH_U0Pj9jRsR7vwajXzDq-YoMWwRU5ad3wX0gQyzA_8B1_AQ-6xtgpA0_WaY5kz01s4w86Dswu6kQO35RzXeKEU43bl9zs_SPvSI9KaOJ4uDAAc7mdKVmeG6mugD78-13YeN1tfDVt15-AWfQijJkBZIfnPzTd-WhKcgomn6MfxGuP9VDEGso5Y1i_sMQfPvxChuKIjoPSbbmqdDyX0ml4lC4nQ9NFAaFTvlsViSvoNaHiQImpD-viVJ3tUtF9L_ojonKM1YVA8Y1um6RUFW7v19TrNaw_kTDoSYygZgcug76QN8DRPwOjtBm0mVU64V7VwJXu3PvWqUn4mmcoIDpTNms_exKJA_p80qHmDJYYvghGMD4KejrtAA3bRiN9UrGLugpsaY338bU./downloadResolving public.boxcloud.com (public.boxcloud.com)... 107.152.25.200Connecting to public.boxcloud.com (public.boxcloud.com)|107.152.25.200|:443... connected.HTTP request sent, awaiting response... 200 OKLength: 1010392 (987K) [text/plain]Saving to: t2g.txtt2g.txt 100%[===================>] 986.71K --.-KB/s in 0.04s2021-07-07 18:38:38 (25.1 MB/s) - t2g.txt saved [1010392/1010392]CPU times: user 361 ms, sys: 96.2 ms, total: 457 msWall time: 38.2 s

Pseudoalignment and counting

In this notebook we pseudoalign 1 million C. elegans reads and count UMIs to produce a cells x genes matrix. Instead of being downloaded, the reads are streamed directly to the Google Colab notebook for quantification. See this blog post for more details on how the streaming works.

The data consists of a subset of reads from GSE126954 described in the paper:

Run kallisto and bustools

123456
%%time# This step runs `kb` to quantify the reads. `kb` can take as input URLs where the reads are located, and will stream the data # to Google Colab where it is quantified as it is downloaded. This allows for quantifying very large datasets without first # downloading them and saving them to disk. !kb count -i idx.idx -g t2g.txt --overwrite -t 2 -x 10xv2 https://caltech.box.com/shared/static/fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gz https://caltech.box.com/shared/static/ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz
[2021-07-07 18:38:39,095] INFO Piping https://caltech.box.com/shared/static/fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gz to tmp/fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gz[2021-07-07 18:38:39,096] INFO Piping https://caltech.box.com/shared/static/ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz to tmp/ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz[2021-07-07 18:38:39,100] INFO Generating BUS file from[2021-07-07 18:38:39,100] INFO tmp/fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gz[2021-07-07 18:38:39,100] INFO tmp/ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz[2021-07-07 18:38:55,315] INFO Sorting BUS file ./output.bus to tmp/output.s.bus[2021-07-07 18:38:57,856] INFO Whitelist not provided[2021-07-07 18:38:57,856] INFO Copying pre-packaged 10XV2 whitelist to .[2021-07-07 18:38:57,948] INFO Inspecting BUS file tmp/output.s.bus[2021-07-07 18:38:58,400] INFO Correcting BUS records in tmp/output.s.bus to tmp/output.s.c.bus with whitelist ./10xv2_whitelist.txt[2021-07-07 18:39:12,816] INFO Sorting BUS file tmp/output.s.c.bus to ./output.unfiltered.bus[2021-07-07 18:39:15,273] INFO Generating count matrix ./counts_unfiltered/cells_x_genes from BUS file ./output.unfiltered.busCPU times: user 243 ms, sys: 29.9 ms, total: 273 msWall time: 37.4 s

Exercises

  • kb can quantify data that is streamed from a URL as in the example above, or can read in data from disk. Is it faster to stream data, or to download it first and then quantify it from disk?
1234
# %%time# !wget https://caltech.box.com/shared/static/fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gz # !wget https://caltech.box.com/shared/static/ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz# !kb count -i idx.idx -g t2g.txt --overwrite -t 2 -x 10xv2 fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gz ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz
  • The -t option in kb sets the numnber of threads to be used. The Google Colab machine you are running on has two threads. If you run this notebook locally you can increase the number of threads beyond 2. As the number of threads is increased the running time decreases proportionately, although eventually the speed at which reads can be loaded from disk is a limiting factor. Verify that running kb with 1 thread on Google Colab takes about twice as long as with 2 threads.
12
# %%time# !kb count -i idx.idx -g t2g.txt --overwrite -t 1 -x 10xv2 https://caltech.box.com/shared/static/fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gz https://caltech.box.com/shared/static/ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz

Basic QC

Represent the cells in 2D

12
# Read in the count matrix that was output by `kb`.mtx = mmread("/content/counts_unfiltered/cells_x_genes.mtx")
1234
# Perform SVDtsvd = TruncatedSVD(n_components=2)tsvd.fit(mtx)X = tsvd.transform(mtx)
1234567
# Plot the cells in the 2D PCA projectionfig, ax = plt.subplots(figsize=(10, 7))ax.scatter(X[:,0], X[:,1], alpha=0.5, c="green")plt.axis('off')plt.show()

Python - kallisto | bustools (2)

While the PCA plot shows the overall structure of the data, a visualization highlighting the density of points reveals a large number of droplets represented in the lower left corner.

 1 2 3 4 5 6 7 8 910111213141516171819202122232425262728293031
# density display for PCA plotfrom scipy.interpolate import interpndef density_scatter( x , y, ax = None, sort = True, bins = 20, **kwargs ) : """ Scatter plot colored by 2d histogram """ if ax is None : fig , ax = plt.subplots() data , x_e, y_e = np.histogram2d( x, y, bins = bins) z = interpn( ( 0.5*(x_e[1:] + x_e[:-1]) , 0.5*(y_e[1:]+y_e[:-1]) ) , data , np.vstack([x,y]).T , method = "splinef2d", bounds_error = False ) # Sort the points by density, so that the densest points are plotted last if sort : idx = z.argsort() x, y, z = x[idx], y[idx], z[idx] sc = ax.scatter( x, y, c=z, **kwargs ) return scfig, ax = plt.subplots(figsize=(7,7))x = X[:,0]y = X[:,1]sc = density_scatter(x, y, ax=ax, cmap="Greens")fig.colorbar(sc, ax=ax)plt.axis('off')plt.show()

Python - kallisto | bustools (3)

The following plot helps clarify the reason for the concentrated points in the lower-left corner of the PCA plot.

12
# Create sparse matrix representation of the count matrixmtx = csr_matrix(mtx)

Test for library saturation

 1 2 3 4 5 6 7 8 91011121314
# Create a plot showing genes detected as a function of UMI counts.fig, ax = plt.subplots(figsize=(10, 7))ax.scatter(np.asarray(mtx.sum(axis=1))[:,0], np.asarray(np.sum(mtx>0, axis=1))[:,0], color="green", alpha=0.01)ax.set_xlabel("UMI Counts")ax.set_ylabel("Genes Detected")ax.set_xscale('log')ax.set_yscale('log', nonposy='clip')ax.set_xlim((0.5, 4500))ax.set_ylim((0.5,2000))plt.show()

Python - kallisto | bustools (4)

Here we see that there are a large number of near empty droplets. A useful approach to filtering out such data is the "knee plot" shown below.

Examine the knee plot

The "knee plot" was introduced in the Drop-seq paper: - Macosko et al., Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets, 2015. DOI:10.1016/j.cell.2015.05.002

In this plot cells are ordered by the number of UMI counts associated to them (shown on the x-axis), and the fraction of droplets with at least that number of cells is shown on the y-axis:

 1 2 3 4 5 6 7 8 91011
# Create the "knee plot"knee = np.sort((np.array(mtx.sum(axis=1))).flatten())[::-1]fig, ax = plt.subplots(figsize=(10, 7))ax.loglog(knee, range(len(knee)),linewidth=5, color="g")ax.set_xlabel("UMI Counts")ax.set_ylabel("Set of Barcodes")plt.grid(True, which="both")plt.show()

Python - kallisto | bustools (5)

1234
# An option is to filter the cells and genes by a threshold# row_mask = np.asarray(mtx.sum(axis=1)>30).reshape(-1)# col_mask = np.asarray(mtx.sum(axis=0)>0).reshape(-1)# mtx_filtered = mtx[row_mask,:][:,col_mask]

Exercises

  • The "knee plot" is sometimes shown with the UMI counts on the y-axis instead of the x-axis, i.e. flipped and rotated 90 degrees. Make the flipped and rotated plot. Is there a reason to prefer one orientation over the other?
 1 2 3 4 5 6 7 8 91011
# # Create the flipped and rotated "knee plot"# knee = np.sort((np.array(mtx.sum(axis=1))).flatten())[::-1]# fig, ax = plt.subplots(figsize=(10, 7))# # ax.loglog(range(len(knee)), knee,linewidth=5, color="g")# # ax.set_xlabel("Set of Barcodes")# ax.set_ylabel("UMI Counts")# # plt.grid(True, which="both")# plt.show()

For more information on this exercise see Rotating the knee (plot) and related yoga.

  • The PCA subspaces form a [flag](https://en.wikipedia.org/wiki/Flag_(linear_algebra). This means, for example, that regardless of the number of dimensions chosen for the PCA dimensionality reduction, the 2D subspace remains the same. Verify this empirically.

  • As you increase the number of dimensions for the PCA reduction, you can also view the relationship between different suspaces. Explore this by changing the subspace dimensions visualized.

 1 2 3 4 5 6 7 8 910111213141516
#@title Exploring PCA subspsaces { run: "auto", vertical-output: true, display-mode: "both" }n_components = 2#@param {type:"integer"}dimension_A = 1#@param {type:"integer"}dimension_B = 2#@param {type:"integer"}# Perform SVDtsvd = TruncatedSVD(n_components)tsvd.fit(mtx)X = tsvd.transform(mtx)# Plot the cells in the 2D PCA projectionfig, ax = plt.subplots(figsize=(10, 7))ax.scatter(X[:,dimension_A-1], X[:,dimension_B-1], alpha=0.5, c="green")plt.axis('off')plt.show()

Python - kallisto | bustools (6)

Discussion

This notebook has demonstrated the pre-processing required for single-cell RNA-seq analysis. kb is used to pseudoalign reads and to generate a cells x genes matrix. Following generation of a matrix, basic QC helps to assess the quality of the data.

12
# Running time of the notebookprint("{:.2f} seconds".format((time.time()-start_time)))
102.33 seconds

Feedback: please report any issues, or submit pull requests for improvements, in the Github repository where this notebook is located.

1
Python - kallisto | bustools (2024)
Top Articles
Latest Posts
Article information

Author: Zonia Mosciski DO

Last Updated:

Views: 5496

Rating: 4 / 5 (71 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Zonia Mosciski DO

Birthday: 1996-05-16

Address: Suite 228 919 Deana Ford, Lake Meridithberg, NE 60017-4257

Phone: +2613987384138

Job: Chief Retail Officer

Hobby: Tai chi, Dowsing, Poi, Letterboxing, Watching movies, Video gaming, Singing

Introduction: My name is Zonia Mosciski DO, I am a enchanting, joyous, lovely, successful, hilarious, tender, outstanding person who loves writing and wants to share my knowledge and understanding with you.