![Python - kallisto | bustools (1) Python - kallisto | bustools (1)](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
This Python notebook demonstrates the use of the kallisto and bustools programs for pre-processing single-cell RNA-seq data (also available as an R notebook). It streams in 1 million C. elegans reads, pseudoaligns them, and produces a cells x genes count matrix in about a minute. The notebook then performs some basic QC. It expands on a notebook prepared by Sina Booeshaghi for the Genome Informatics 2019 meeting, where he ran it in under 60 seconds during a 1 minute "lightning talk".
The kallistobus.tools tutorials site has a extensive list of follow-up tutorials and vignettes on single-cell RNA-seq.
| #@title from IPython.display import HTMLHTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/x-rNofr88BM" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')
|
The notebook was written by A. Sina Booeshaghi and Lior Pachter. If you use the methods in this notebook for your analysis please cite the following publication, on which it is based:
- Melsted, P., Booeshaghi, A.S. et al. Modular and efficient pre-processing of single-cell RNA-seq. bioRxiv (2019). doi:10.1101/673285
Setup¶
| # This is used to time the running of the notebookimport timestart_time = time.time()
|
Install python packages¶
| # These packages are pre-installed on Google Colab, but are included here to simplify running this notebook locally%%capture!pip install matplotlib!pip install scikit-learn!pip install numpy!pip install scipy
|
| # Install packages for analysis and plottingfrom scipy.io import mmreadfrom sklearn.decomposition import TruncatedSVDimport numpy as npimport matplotlib.pyplot as pltimport matplotlibfrom scipy.sparse import csr_matrixmatplotlib.rcParams.update({'font.size': 22})%config InlineBackend.figure_format = 'retina'
|
| %%time%%capture# `kb` is a wrapper for the kallisto and bustools program, and the kb-python package contains the kallisto and bustools executables.!pip install kb-python==0.24.1
|
CPU times: user 81.7 ms, sys: 19.2 ms, total: 101 msWall time: 10 s
Download required files¶
| %%time# The quantification of single-cell RNA-seq with kallisto requires an index. # Indices are species specific and can be generated or downloaded directly with `kb`. # Here we download a pre-made index for C. elegans (the idx.idx file) along with an auxillary file (t2g.txt) # that describes the relationship between transcripts and genes.!wget -O idx.idx https://caltech.box.com/shared/static/82yv415pkbdixhzi55qac1htiaph9ng4.idx!wget -O t2g.txt https://caltech.box.com/shared/static/cflxji16171skf3syzm8scoxkcvbl97x.txt
|
--2021-07-07 18:38:00-- https://caltech.box.com/shared/static/82yv415pkbdixhzi55qac1htiaph9ng4.idxResolving caltech.box.com (caltech.box.com)... 107.152.25.197Connecting to caltech.box.com (caltech.box.com)|107.152.25.197|:443... connected.HTTP request sent, awaiting response... 301 Moved PermanentlyLocation: /public/static/82yv415pkbdixhzi55qac1htiaph9ng4.idx [following]--2021-07-07 18:38:00-- https://caltech.box.com/public/static/82yv415pkbdixhzi55qac1htiaph9ng4.idxReusing existing connection to caltech.box.com:443.HTTP request sent, awaiting response... 301 Moved PermanentlyLocation: https://caltech.app.box.com/public/static/82yv415pkbdixhzi55qac1htiaph9ng4.idx [following]--2021-07-07 18:38:00-- https://caltech.app.box.com/public/static/82yv415pkbdixhzi55qac1htiaph9ng4.idxResolving caltech.app.box.com (caltech.app.box.com)... 107.152.25.201Connecting to caltech.app.box.com (caltech.app.box.com)|107.152.25.201|:443... connected.HTTP request sent, awaiting response... 302 FoundLocation: https://public.boxcloud.com/d/1/b1!nAdT7HbMqSV2-u8uTTCudCYkwKYLV2tsQJOqE7FGwSMUnjwZZXY9VF2lnh-slpFJJ9ZfKUM1NoetaCvcUemBOKSnqcwcy9lLgSH3GsstN4twPjC45-6Ukm43LjqPm_0dHFEafE28tFmQJ2BnovfBjxKMgbKPDcA09Ec5jpLVqgstxCBldBBA5uxQ6hMq6PtjUvjanuzjLdTdUiI-UAJkcpX_BChWREi3PesYVk8j-wM69vI1tJ3iXYdnMduqXrjep_pa6Pr4j7SIDo4RTSj2jyrcGXE8z77dgTSKg5ZV-JqxO-moGLOappHAiWBsbebHn8dCTuHx9oZlqBauLv0DhvSkKtmSzwqYiwdUaBLDyI0AxbGY96u-lxIWoEVxYQBQQBNdnDR-W76f4WoLiV2qK9zK5veum2KluNaMYdXpg_CI1RbPw9Hb5vE7nubzBc6Fa_XvWRnbtqz1_GehvCwRGNM9Pac8PznO6pJz66nxN0qyd5wDw16ti_iQIDNndeOOlZx78nBLHO_VaS1GZH-bkzpHMrd8K4hha-M-BJ5o5RpsYZcHy8StB8R-kJNkmrddQvkIpuWb2vwl0FqaQOlx2BlWLLINMrXCWyRppGvdAf4pimgdLgzq-To-rpPfXJ89vqXW79fttpe044QAcQTAOFCsLSBqm3KQiwfR2oxnlJe744o0Qp0kRKQgEIANMHRnr9NRBgWTmkZ-vyeT2X0YEJh-3w3HLo8XzeMuz_95a89zf-dPWhkbTGRHlRTftYxgUQTo1kkL0D_5BXGWvnLgHhgRcdKFNW0pjd4Z0i3n4NYBrxIKb6XNnN5kprV_4Q3W9-M1spIWR0r4m-nuLMyjU7_LCtLH8VIEKlT1vBgDSdkb79k_EKdoiO1-DdhJU3WrCTt9h9_VYsg9j3dO5aC5FDDv49TkBOVGs-yYw73bjtcT7r1MPXKh6HCErASPktoijVtCJOf5aoOIRNnXdRlusfVZRCdJh5PQloCBeMgCbtBRznpq-oOvjCK-XcJZIVH8FE5zrFc-3u0qbTUc3HBQ1WTGXZ2NGp_OzaCzkfW0FtNhA-cPit3-b6YGCatwdeCg83O6hUsZ4bpuCSfJIdX6a32_wv9xOdMRzwbEUVuR0DCTJQFkl2H-jb6VHPmnnVTTUwon8q8Csqk0OP5xK53TxRwge9QAM-wojp4KVTaI59IWguqXuFnEH66vv8-c8hlYeoHUkL7dJ-j7bhvyXHdZfZXHGDDBYNSjghPddkvp2Hr856-K5r1muA1qFEZ5eRhPNFqlJIbs3SZTtY8Sis2PiSRZNWT0vZ2oOfALEGZHpzd4q29ieLMI4rwMGQ22099h6Up0JeXERR-t-jMSMCN1_eNMz34amOgsnGpX-J5OlZzDVUB0uJ2Uf3vdtRKaa97OjfzwmtN9NYbPh_u87DYUUsfKh7PVSZzoCU0f8l8NdVeZlA../download [following]--2021-07-07 18:38:00-- https://public.boxcloud.com/d/1/b1!nAdT7HbMqSV2-u8uTTCudCYkwKYLV2tsQJOqE7FGwSMUnjwZZXY9VF2lnh-slpFJJ9ZfKUM1NoetaCvcUemBOKSnqcwcy9lLgSH3GsstN4twPjC45-6Ukm43LjqPm_0dHFEafE28tFmQJ2BnovfBjxKMgbKPDcA09Ec5jpLVqgstxCBldBBA5uxQ6hMq6PtjUvjanuzjLdTdUiI-UAJkcpX_BChWREi3PesYVk8j-wM69vI1tJ3iXYdnMduqXrjep_pa6Pr4j7SIDo4RTSj2jyrcGXE8z77dgTSKg5ZV-JqxO-moGLOappHAiWBsbebHn8dCTuHx9oZlqBauLv0DhvSkKtmSzwqYiwdUaBLDyI0AxbGY96u-lxIWoEVxYQBQQBNdnDR-W76f4WoLiV2qK9zK5veum2KluNaMYdXpg_CI1RbPw9Hb5vE7nubzBc6Fa_XvWRnbtqz1_GehvCwRGNM9Pac8PznO6pJz66nxN0qyd5wDw16ti_iQIDNndeOOlZx78nBLHO_VaS1GZH-bkzpHMrd8K4hha-M-BJ5o5RpsYZcHy8StB8R-kJNkmrddQvkIpuWb2vwl0FqaQOlx2BlWLLINMrXCWyRppGvdAf4pimgdLgzq-To-rpPfXJ89vqXW79fttpe044QAcQTAOFCsLSBqm3KQiwfR2oxnlJe744o0Qp0kRKQgEIANMHRnr9NRBgWTmkZ-vyeT2X0YEJh-3w3HLo8XzeMuz_95a89zf-dPWhkbTGRHlRTftYxgUQTo1kkL0D_5BXGWvnLgHhgRcdKFNW0pjd4Z0i3n4NYBrxIKb6XNnN5kprV_4Q3W9-M1spIWR0r4m-nuLMyjU7_LCtLH8VIEKlT1vBgDSdkb79k_EKdoiO1-DdhJU3WrCTt9h9_VYsg9j3dO5aC5FDDv49TkBOVGs-yYw73bjtcT7r1MPXKh6HCErASPktoijVtCJOf5aoOIRNnXdRlusfVZRCdJh5PQloCBeMgCbtBRznpq-oOvjCK-XcJZIVH8FE5zrFc-3u0qbTUc3HBQ1WTGXZ2NGp_OzaCzkfW0FtNhA-cPit3-b6YGCatwdeCg83O6hUsZ4bpuCSfJIdX6a32_wv9xOdMRzwbEUVuR0DCTJQFkl2H-jb6VHPmnnVTTUwon8q8Csqk0OP5xK53TxRwge9QAM-wojp4KVTaI59IWguqXuFnEH66vv8-c8hlYeoHUkL7dJ-j7bhvyXHdZfZXHGDDBYNSjghPddkvp2Hr856-K5r1muA1qFEZ5eRhPNFqlJIbs3SZTtY8Sis2PiSRZNWT0vZ2oOfALEGZHpzd4q29ieLMI4rwMGQ22099h6Up0JeXERR-t-jMSMCN1_eNMz34amOgsnGpX-J5OlZzDVUB0uJ2Uf3vdtRKaa97OjfzwmtN9NYbPh_u87DYUUsfKh7PVSZzoCU0f8l8NdVeZlA../downloadResolving public.boxcloud.com (public.boxcloud.com)... 107.152.24.200Connecting to public.boxcloud.com (public.boxcloud.com)|107.152.24.200|:443... connected.HTTP request sent, awaiting response... 200 OKLength: 625579580 (597M) [application/octet-stream]Saving to: ‘idx.idx’idx.idx 100%[===================>] 596.60M 18.3MB/s in 36s2021-07-07 18:38:37 (16.8 MB/s) - ‘idx.idx’ saved [625579580/625579580]--2021-07-07 18:38:37-- https://caltech.box.com/shared/static/cflxji16171skf3syzm8scoxkcvbl97x.txtResolving caltech.box.com (caltech.box.com)... 107.152.25.197Connecting to caltech.box.com (caltech.box.com)|107.152.25.197|:443... connected.HTTP request sent, awaiting response... 301 Moved PermanentlyLocation: /public/static/cflxji16171skf3syzm8scoxkcvbl97x.txt [following]--2021-07-07 18:38:37-- https://caltech.box.com/public/static/cflxji16171skf3syzm8scoxkcvbl97x.txtReusing existing connection to caltech.box.com:443.HTTP request sent, awaiting response... 301 Moved PermanentlyLocation: https://caltech.app.box.com/public/static/cflxji16171skf3syzm8scoxkcvbl97x.txt [following]--2021-07-07 18:38:37-- https://caltech.app.box.com/public/static/cflxji16171skf3syzm8scoxkcvbl97x.txtResolving caltech.app.box.com (caltech.app.box.com)... 107.152.25.201Connecting to caltech.app.box.com (caltech.app.box.com)|107.152.25.201|:443... connected.HTTP request sent, awaiting response... 302 FoundLocation: https://public.boxcloud.com/d/1/b1!TwnLlpzLPezE9R8L30b_LgO_uHuxPS8VDrBNDz_zAbovkjbHWuXi8l8DVnDitnwQBxpu9UsNRTJjhHuGtf0lAWqWJ4yCyh1VKOVy2w6Nfcol3DGLho4ynnZFtL3kXIRgm91nRVsXEAe96HSf8PtmkNMQ99kFT74uchBiDtS7wcSZZzwaEMxJbaMhSUchoelmxesP3QeOIKfb--kuYOqovncop33DB8WttbBUNDi-vYBZXMRKn3zoPVcDCYqJxG9llT2DpQ_JIegvNsa0X9rvzrQgqZrzQcsYyByQIWzTE63JxbGd0yUnKrTWN3jzkeXaXBepKUaGgtTNxkC0CKBjyQskBB3zMBhaPv713cFQrW5HoqrccKKRq488DIjfs2mLTs0yRjGkoYakhSIPnxglE9PTNslfKjwJfemxvf7DM7bCDUi4Q8h9jv8ClAbtJ-d5rVq4E65PUqNqv9ofsYn2dSPC9KWWTmaeVTV9BtHm7cHz4sBeG_kNG-Nph72y3XI9sl3cCWBlgfNwVbLg8hO3yFJIjUR4jtN0VvbYjTWux7A-NW-vCsn-bAa7A9SqZnL2b96ipj_NXB9jhC4iZ3zuWX467PR6F0iYGUACbP7ioGTiFM7cD5uVAZGDjvZN-K4JSe3Qq0cknEJpWLyh6vlWP2hN_N0nSlCTMnue3N9_kxsIEzDlq7Q47Qbtbhe4KrnFDWvL_0T13ckBkeQOxBYD7X2sLTJxbghpBjJbYTiO7eSCc4CJoiXIzb0BbeWHSZc9HueW6N_B0bC9HHnGttOqHPGOZiU1-H4MKlMvS7Zz612SJ4rLz0k71piAcSzfKRcwraflYqg4QdXll5byM7rt_YGfKp9uCtq-uh54ie8Nw0AUxLkI38xb18UOXCZT2wa3vL2eCsMwJLjg-yXyuBZ6jVE0DXQrPP7rWbbOTD3UHe_jfSo3S_H_Eor7s8_Z7aR70KBsmd-r4NXwIA7Rt7F1GSHBGE7UxtVgrSQ2zXvI8eQPwIXVGrsnqzu9MGzpt_qH_U0Pj9jRsR7vwajXzDq-YoMWwRU5ad3wX0gQyzA_8B1_AQ-6xtgpA0_WaY5kz01s4w86Dswu6kQO35RzXeKEU43bl9zs_SPvSI9KaOJ4uDAAc7mdKVmeG6mugD78-13YeN1tfDVt15-AWfQijJkBZIfnPzTd-WhKcgomn6MfxGuP9VDEGso5Y1i_sMQfPvxChuKIjoPSbbmqdDyX0ml4lC4nQ9NFAaFTvlsViSvoNaHiQImpD-viVJ3tUtF9L_ojonKM1YVA8Y1um6RUFW7v19TrNaw_kTDoSYygZgcug76QN8DRPwOjtBm0mVU64V7VwJXu3PvWqUn4mmcoIDpTNms_exKJA_p80qHmDJYYvghGMD4KejrtAA3bRiN9UrGLugpsaY338bU./download [following]--2021-07-07 18:38:37-- https://public.boxcloud.com/d/1/b1!TwnLlpzLPezE9R8L30b_LgO_uHuxPS8VDrBNDz_zAbovkjbHWuXi8l8DVnDitnwQBxpu9UsNRTJjhHuGtf0lAWqWJ4yCyh1VKOVy2w6Nfcol3DGLho4ynnZFtL3kXIRgm91nRVsXEAe96HSf8PtmkNMQ99kFT74uchBiDtS7wcSZZzwaEMxJbaMhSUchoelmxesP3QeOIKfb--kuYOqovncop33DB8WttbBUNDi-vYBZXMRKn3zoPVcDCYqJxG9llT2DpQ_JIegvNsa0X9rvzrQgqZrzQcsYyByQIWzTE63JxbGd0yUnKrTWN3jzkeXaXBepKUaGgtTNxkC0CKBjyQskBB3zMBhaPv713cFQrW5HoqrccKKRq488DIjfs2mLTs0yRjGkoYakhSIPnxglE9PTNslfKjwJfemxvf7DM7bCDUi4Q8h9jv8ClAbtJ-d5rVq4E65PUqNqv9ofsYn2dSPC9KWWTmaeVTV9BtHm7cHz4sBeG_kNG-Nph72y3XI9sl3cCWBlgfNwVbLg8hO3yFJIjUR4jtN0VvbYjTWux7A-NW-vCsn-bAa7A9SqZnL2b96ipj_NXB9jhC4iZ3zuWX467PR6F0iYGUACbP7ioGTiFM7cD5uVAZGDjvZN-K4JSe3Qq0cknEJpWLyh6vlWP2hN_N0nSlCTMnue3N9_kxsIEzDlq7Q47Qbtbhe4KrnFDWvL_0T13ckBkeQOxBYD7X2sLTJxbghpBjJbYTiO7eSCc4CJoiXIzb0BbeWHSZc9HueW6N_B0bC9HHnGttOqHPGOZiU1-H4MKlMvS7Zz612SJ4rLz0k71piAcSzfKRcwraflYqg4QdXll5byM7rt_YGfKp9uCtq-uh54ie8Nw0AUxLkI38xb18UOXCZT2wa3vL2eCsMwJLjg-yXyuBZ6jVE0DXQrPP7rWbbOTD3UHe_jfSo3S_H_Eor7s8_Z7aR70KBsmd-r4NXwIA7Rt7F1GSHBGE7UxtVgrSQ2zXvI8eQPwIXVGrsnqzu9MGzpt_qH_U0Pj9jRsR7vwajXzDq-YoMWwRU5ad3wX0gQyzA_8B1_AQ-6xtgpA0_WaY5kz01s4w86Dswu6kQO35RzXeKEU43bl9zs_SPvSI9KaOJ4uDAAc7mdKVmeG6mugD78-13YeN1tfDVt15-AWfQijJkBZIfnPzTd-WhKcgomn6MfxGuP9VDEGso5Y1i_sMQfPvxChuKIjoPSbbmqdDyX0ml4lC4nQ9NFAaFTvlsViSvoNaHiQImpD-viVJ3tUtF9L_ojonKM1YVA8Y1um6RUFW7v19TrNaw_kTDoSYygZgcug76QN8DRPwOjtBm0mVU64V7VwJXu3PvWqUn4mmcoIDpTNms_exKJA_p80qHmDJYYvghGMD4KejrtAA3bRiN9UrGLugpsaY338bU./downloadResolving public.boxcloud.com (public.boxcloud.com)... 107.152.25.200Connecting to public.boxcloud.com (public.boxcloud.com)|107.152.25.200|:443... connected.HTTP request sent, awaiting response... 200 OKLength: 1010392 (987K) [text/plain]Saving to: ‘t2g.txt’t2g.txt 100%[===================>] 986.71K --.-KB/s in 0.04s2021-07-07 18:38:38 (25.1 MB/s) - ‘t2g.txt’ saved [1010392/1010392]CPU times: user 361 ms, sys: 96.2 ms, total: 457 msWall time: 38.2 s
Pseudoalignment and counting¶
In this notebook we pseudoalign 1 million C. elegans reads and count UMIs to produce a cells x genes matrix. Instead of being downloaded, the reads are streamed directly to the Google Colab notebook for quantification. See this blog post for more details on how the streaming works.
The data consists of a subset of reads from GSE126954 described in the paper:
Run kallisto and bustools¶
| %%time# This step runs `kb` to quantify the reads. `kb` can take as input URLs where the reads are located, and will stream the data # to Google Colab where it is quantified as it is downloaded. This allows for quantifying very large datasets without first # downloading them and saving them to disk. !kb count -i idx.idx -g t2g.txt --overwrite -t 2 -x 10xv2 https://caltech.box.com/shared/static/fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gz https://caltech.box.com/shared/static/ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz
|
[2021-07-07 18:38:39,095] INFO Piping https://caltech.box.com/shared/static/fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gz to tmp/fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gz[2021-07-07 18:38:39,096] INFO Piping https://caltech.box.com/shared/static/ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz to tmp/ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz[2021-07-07 18:38:39,100] INFO Generating BUS file from[2021-07-07 18:38:39,100] INFO tmp/fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gz[2021-07-07 18:38:39,100] INFO tmp/ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz[2021-07-07 18:38:55,315] INFO Sorting BUS file ./output.bus to tmp/output.s.bus[2021-07-07 18:38:57,856] INFO Whitelist not provided[2021-07-07 18:38:57,856] INFO Copying pre-packaged 10XV2 whitelist to .[2021-07-07 18:38:57,948] INFO Inspecting BUS file tmp/output.s.bus[2021-07-07 18:38:58,400] INFO Correcting BUS records in tmp/output.s.bus to tmp/output.s.c.bus with whitelist ./10xv2_whitelist.txt[2021-07-07 18:39:12,816] INFO Sorting BUS file tmp/output.s.c.bus to ./output.unfiltered.bus[2021-07-07 18:39:15,273] INFO Generating count matrix ./counts_unfiltered/cells_x_genes from BUS file ./output.unfiltered.busCPU times: user 243 ms, sys: 29.9 ms, total: 273 msWall time: 37.4 s
kb
can quantify data that is streamed from a URL as in the example above, or can read in data from disk. Is it faster to stream data, or to download it first and then quantify it from disk?
| # %%time# !wget https://caltech.box.com/shared/static/fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gz # !wget https://caltech.box.com/shared/static/ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz# !kb count -i idx.idx -g t2g.txt --overwrite -t 2 -x 10xv2 fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gz ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz
|
- The -t option in
kb
sets the numnber of threads to be used. The Google Colab machine you are running on has two threads. If you run this notebook locally you can increase the number of threads beyond 2. As the number of threads is increased the running time decreases proportionately, although eventually the speed at which reads can be loaded from disk is a limiting factor. Verify that running kb
with 1 thread on Google Colab takes about twice as long as with 2 threads.
| # %%time# !kb count -i idx.idx -g t2g.txt --overwrite -t 1 -x 10xv2 https://caltech.box.com/shared/static/fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gz https://caltech.box.com/shared/static/ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz
|
Basic QC¶
Represent the cells in 2D¶
| # Read in the count matrix that was output by `kb`.mtx = mmread("/content/counts_unfiltered/cells_x_genes.mtx")
|
| # Perform SVDtsvd = TruncatedSVD(n_components=2)tsvd.fit(mtx)X = tsvd.transform(mtx)
|
| # Plot the cells in the 2D PCA projectionfig, ax = plt.subplots(figsize=(10, 7))ax.scatter(X[:,0], X[:,1], alpha=0.5, c="green")plt.axis('off')plt.show()
|
![Python - kallisto | bustools (2) Python - kallisto | bustools (2)](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
While the PCA plot shows the overall structure of the data, a visualization highlighting the density of points reveals a large number of droplets represented in the lower left corner.
1 2 3 4 5 6 7 8 910111213141516171819202122232425262728293031 | # density display for PCA plotfrom scipy.interpolate import interpndef density_scatter( x , y, ax = None, sort = True, bins = 20, **kwargs ) : """ Scatter plot colored by 2d histogram """ if ax is None : fig , ax = plt.subplots() data , x_e, y_e = np.histogram2d( x, y, bins = bins) z = interpn( ( 0.5*(x_e[1:] + x_e[:-1]) , 0.5*(y_e[1:]+y_e[:-1]) ) , data , np.vstack([x,y]).T , method = "splinef2d", bounds_error = False ) # Sort the points by density, so that the densest points are plotted last if sort : idx = z.argsort() x, y, z = x[idx], y[idx], z[idx] sc = ax.scatter( x, y, c=z, **kwargs ) return scfig, ax = plt.subplots(figsize=(7,7))x = X[:,0]y = X[:,1]sc = density_scatter(x, y, ax=ax, cmap="Greens")fig.colorbar(sc, ax=ax)plt.axis('off')plt.show()
|
![Python - kallisto | bustools (3) Python - kallisto | bustools (3)](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
The following plot helps clarify the reason for the concentrated points in the lower-left corner of the PCA plot.
| # Create sparse matrix representation of the count matrixmtx = csr_matrix(mtx)
|
Test for library saturation¶
1 2 3 4 5 6 7 8 91011121314 | # Create a plot showing genes detected as a function of UMI counts.fig, ax = plt.subplots(figsize=(10, 7))ax.scatter(np.asarray(mtx.sum(axis=1))[:,0], np.asarray(np.sum(mtx>0, axis=1))[:,0], color="green", alpha=0.01)ax.set_xlabel("UMI Counts")ax.set_ylabel("Genes Detected")ax.set_xscale('log')ax.set_yscale('log', nonposy='clip')ax.set_xlim((0.5, 4500))ax.set_ylim((0.5,2000))plt.show()
|
![Python - kallisto | bustools (4) Python - kallisto | bustools (4)](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Here we see that there are a large number of near empty droplets. A useful approach to filtering out such data is the "knee plot" shown below.
Examine the knee plot¶
The "knee plot" was introduced in the Drop-seq paper: - Macosko et al., Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets, 2015. DOI:10.1016/j.cell.2015.05.002
In this plot cells are ordered by the number of UMI counts associated to them (shown on the x-axis), and the fraction of droplets with at least that number of cells is shown on the y-axis:
| # Create the "knee plot"knee = np.sort((np.array(mtx.sum(axis=1))).flatten())[::-1]fig, ax = plt.subplots(figsize=(10, 7))ax.loglog(knee, range(len(knee)),linewidth=5, color="g")ax.set_xlabel("UMI Counts")ax.set_ylabel("Set of Barcodes")plt.grid(True, which="both")plt.show()
|
![Python - kallisto | bustools (5) Python - kallisto | bustools (5)](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
| # An option is to filter the cells and genes by a threshold# row_mask = np.asarray(mtx.sum(axis=1)>30).reshape(-1)# col_mask = np.asarray(mtx.sum(axis=0)>0).reshape(-1)# mtx_filtered = mtx[row_mask,:][:,col_mask]
|
Exercises¶
- The "knee plot" is sometimes shown with the UMI counts on the y-axis instead of the x-axis, i.e. flipped and rotated 90 degrees. Make the flipped and rotated plot. Is there a reason to prefer one orientation over the other?
| # # Create the flipped and rotated "knee plot"# knee = np.sort((np.array(mtx.sum(axis=1))).flatten())[::-1]# fig, ax = plt.subplots(figsize=(10, 7))# # ax.loglog(range(len(knee)), knee,linewidth=5, color="g")# # ax.set_xlabel("Set of Barcodes")# ax.set_ylabel("UMI Counts")# # plt.grid(True, which="both")# plt.show()
|
For more information on this exercise see Rotating the knee (plot) and related yoga.
The PCA subspaces form a [flag](https://en.wikipedia.org/wiki/Flag_(linear_algebra). This means, for example, that regardless of the number of dimensions chosen for the PCA dimensionality reduction, the 2D subspace remains the same. Verify this empirically.
As you increase the number of dimensions for the PCA reduction, you can also view the relationship between different suspaces. Explore this by changing the subspace dimensions visualized.
1 2 3 4 5 6 7 8 910111213141516 | #@title Exploring PCA subspsaces { run: "auto", vertical-output: true, display-mode: "both" }n_components = 2#@param {type:"integer"}dimension_A = 1#@param {type:"integer"}dimension_B = 2#@param {type:"integer"}# Perform SVDtsvd = TruncatedSVD(n_components)tsvd.fit(mtx)X = tsvd.transform(mtx)# Plot the cells in the 2D PCA projectionfig, ax = plt.subplots(figsize=(10, 7))ax.scatter(X[:,dimension_A-1], X[:,dimension_B-1], alpha=0.5, c="green")plt.axis('off')plt.show()
|
![Python - kallisto | bustools (6) Python - kallisto | bustools (6)](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Discussion¶
This notebook has demonstrated the pre-processing required for single-cell RNA-seq analysis. kb
is used to pseudoalign reads and to generate a cells x genes matrix. Following generation of a matrix, basic QC helps to assess the quality of the data.
| # Running time of the notebookprint("{:.2f} seconds".format((time.time()-start_time)))
|
Feedback: please report any issues, or submit pull requests for improvements, in the Github repository where this notebook is located.