Single Embryo RNA-Sequencing Analysis

▻ Single embryo RNA-Sequencing

▻ Mapping sequencing reads to the smed_20140614 transcriptome

▻ RPKM normalization

▻ Identification of differentially expressed transcripts

▻ GO Analysis

▻ Download Figure 1 – source data 1

▻ References

Single embryo RNA-Sequencing

Live embryos were dissected out of egg capsules in 1x Holfreter’s buffer (3.5 g/L NaCl; 0.2 g/L NaHCO₃; 0.05 g/L KCl; 0.2 g/L MgSO₄; 0.1 g/L CaCl₂; 1.0 g/L dextrose, pH 7.0-7.5) for S2-S7 egg capsules, or 1x Montjuic water for S8 hatchlings. Yolk (Y) samples were obtained from 8 d egg capsules that contained neither spherical nor elongating embryos. Single embryos were imaged on a Leica M205 FA dissecting microscope, transferred into microfuge tubes containing 200 µl TRIzol reagent (Thermo Fisher, item #15596-018), homogenized by pipetting, and stored at -80˚C. Single animal samples of intact C4 adults and virgin, sexually mature adults were homogenized in 1.0 mL TRIzol using an IKA Ultra Turrax T 25 Basic tissue disruptor prior to storage at -80˚C. Total RNA extraction was performed in 1.0 mL TRIzol per sample according to the manufacturer’s protocol, following the recommendations for working with small amounts of tissue. Pellets were resuspended in 25 µl nuclease free water, and 5 µl aliquots were reserved for quality control testing. Total RNA concentration anbv d integrity were determined using Agilent Bioanalyzer 2100 Expert Total RNA Nano or Pico chips (Agilent Technologies, items # 5067-1511 and 5067-1513). Total RNA samples were prepared for ten biological replicates per time point, and total RNA quality and yield were considered along with embryo size and morphology when selecting samples for library construction.

PolyA-selected, single stranded RNA-Sequencing libraries were prepared for four biological replicates per stage using the Illumina TruSeq RNA Sample V2 kit (item # RS-122-2001 and RS-122-2002), starting with 500 ng total RNA per sample (C4, virgin sexual adult (SX), Y, S4, S5, S6, S7, S8), or 100 ng total RNA per sample (S2, S3). Library concentrations and insert sizes were determined using Agilent Bioanalyzer DNA 1000 chips (Agilent Technologies, item # 5067-1504), and diluted, pooled samples were reanalyzed with Agilent Bioanalyzer 2100 DNA High Sensitivity chips (Agilent Technologies, item # 5067-4626). Nine barcoded samples, one replicate per time point (S2-S8, Y, C4) were pooled and sequenced per flow cell lane. Single end, 50 bp reads were acquired on an Illumina Hi-Seq 2000 sequencer. Illumina Primary Analysis version RTA 1.13.48.0 and Secondary Analysis version CASAVA-1.8.2 were run to demultiplex reads and generate FASTQ files. Barcoded SX replicates were pooled and run on a single flow cell lane of a HiSeq 2500, and Illumina Primary Analysis version RTA 1.17.21.3 and Secondary Analysis version CASAVA-1.8.2 were used. The RNA-Seq data has been deposited in the GEO database under the accession number: GSE82280.

Mapping sequencing reads to the smed_20140614 transcriptome

Sequencing reads were mapped to the smed_20140614 reference transcriptome (n=36,035 transcripts) (Tu et al., 2015), which contains sequencing data from de novo Trinity assemblies from the C4 and sexual biotypes, staged embryo collections, sorted cycling (X1) cells, and previously published sources (Adler et al., 2014); the Dresden transcript collection at PlanMine (http://planmine.mpi-cbg.de)). Transcripts were consolidated and reduced to a unique set using the CD-HIT program (Fu et al., 2012). smed_20140614 sequences may be downloaded from http://smedgd.stowers.org. Reads were mapped using the Bowtie algorithm, Version 1.0.0, (Langmead et al., 2009), allowing for two mismatches and up to 5 multi-matches (--best --strata -v 2 -m 5). Read counts for transcripts were tabulated from SAM files using a custom script. Of 36,035 transcripts, 32,000 accumulated ≥ 1 CPM across all 40 samples. (40 samples). Samples were each sequenced to an average depth of 19 million reads, and exhibited an average map rate of 89% to the transcriptome.

Correlation matrix for single embryo sequencing replicates.
Total transcripts with a row sum > 1 CPM: 31,248.
Y: yolk. 2: S2. 3: S3. 4: S4. 5: S5. 6: S6. 7: S7. 8: S8.

Principal component analysis demonstrates clustering of replicates and separation of developmental time points in expression space.
Y: yolk. 2: S2. 3: S3. 4: S4. 5: S5. 6: S6. 7: S7. 8: S8.

RPKM normalization

RPKM normalization parameters
A: EdgeR normalization factors for RNA-Sequencing replicates.
B: Comparisons of total (gold), mapped (blue) and ribosome depleted (green) read counts for single embryo RNA Sequencing replicates.
A-B: Y: yolk. S2-S8: Stages 2-8. C4: asexual CIW4 adult. SX: virgin sexual adult.

RPKM (Reads Per Kilobase per Million) values were scaled using TMM normalization (Robinson et al., 2010) in edgeR to account for read depth across samples. In addition 16s ribosomal RNA transcripts (SMED30032887), which soak up a significant but variable fraction of reads per sample, were removed prior to calculating RPKM values.

Identification of differentially expressed transcripts

Figure 1 - figure supplement 2: MA plots for pairwise comparisons. A: S2 vs Y. B: S2 vs S3. C: S3 vs S4. D: S4 vs S5. E: S5 vs S6. F: S6 vs S7. G: S7 vs S8. Upregulated transcripts: red. Downregulated transcripts: green. Criteria for flagging differentially expressed transcripts: adjusted p-value < 1e-5, log2 ratio ≥ 2.322 (5-fold upregulation) or log2 ratio ≤ -2.322 (5 fold downregulation), average scaled RPKM value ≥ 1.0 at indicated time point, transcript has ≥ 1 ORF. Transcripts derived from transposase or retroviruses were omitted. S2-S8: Stages 2-8. Y: Yolk.

Figure 1 - figure supplement 3: MA plots for enriched transcripts identified in mixed stage reference comparisons. A-G: S2 (A), S3 (B), S4(C), S5 (D), S6 (E), S7 (F) and S8 (G) enriched transcripts (red) identified using the following criteria: S2-5 enriched transcripts: p adj < 1e-5. S6-8 enriched transcripts: p adj < 1e-20. All stages: log2 ratio ≥ 2.322 (5-fold upregulation), average scaled RPKM value ≥ 1.0 at indicated time point, transcript has≥ 1 ORF. Transcripts derived from transposase or retroviruses were omitted, as were S2, S3 and S4 transcripts also upregulated in yolk. S2-S8: Stages 2-8.

Differential gene expression was evaluated using the edgeR library (Robinson et al., 2010), and adjusted P-values were calculated as described in (Benjamini, 1995). Pairwise comparisons were performed between adjacent time points using edgeR: Y vs S2, S2 vs S3, S3 vs S4, S4 vs S5, S5 vs S6, S6 vs S7 and S7 vs S8. Mapped data was filtered to remove transcripts with less than a sum of 1 CPM across all 32 samples, resulting in 30,766 transcripts. The maximum read sum across samples for omitted transcripts was 14. In addition, transcripts for the 28S (SMED30027845), 18S (SMED30032663) and 16S (SMED30032887) ribosomal subunits were removed. Differentially expressed genes were identified in mixed stage reference comparisons using the GLM approach in edgeR to contrast each treatment group (i.e., developmental stage) to the average of the remaining groups (Y, S2-S8). Non-redundant lists of enriched transcripts from the pairwise and mixed stage reference comparisons were generated for S2-S8 (Figure 1 – source data 1).

GO Analysis

Gene Ontology (GO) terms (Gene Ontology, 2015) terms were assigned to smed_20140614 transcripts based on homologous PFAM domains (Finn et al., 2016) and significant Swiss-Prot hits (E-value ≤ 0.001), (UniProt, 2015). GO term enrichment queries were performed using the R software package topGO, version 2.20.0 (Alexa and Rahnenfuhrer 2010). GO Analysis was performed on the non-redundant lists of enriched transcripts for S2-S8 (Figure 1 – source data 1). Categories containing similar and/or related Biological Process (BP) GO ids enriched at one or more time point(s) were generated manually (Figure 1 – source data 1). Enriched BP GO ids selected for categorization had Benjamini-Hochberg corrected P-values ≤ 1e-10 (Benjamini, 1995), and must have been associated with ≥ 1% of the enriched transcripts for the developmental stage(s) in question. BP GO ids were only assigned to one category. BP GO ids that did not describe a cell and/or tissue type present in Smed (e.g., heart, lung, neural crest) were omitted. Using these categories as a guide, lists of enriched BP GO ids and non-redundant lists of associated transcripts were generated for S2-S8 (Figure 1 – source data 2-8 downloadable links available on individual stage pages). Transcripts may appear in more than one BP GO id category, just as transcripts may be associated with more than one GO term.

Download Figure 1 – source data 1

Figure 1 – source data 1: Molecular Staging Resource for Smed Embryogenesis

Tab 1 (Staging Overview): An overview of the Molecular Staging Resource materials for Stages (S) S1-S8: developmental time interval (days post-egg capsule deposition, 20˚C); names of single embryo RNA Sequencing replicates; references to representative images (live brightfield and histological cross-sections); number of enriched transcripts; references to Figure 1 supplement files containing mean centered expression and raw RPKM profiles across embryogenesis for enriched transcripts; references to Figure 1 source data files (excel spreadsheets) containing enriched transcripts, organized by cluster membership, with RPKM profiles across development, BLASTx-based annotations, GO analysis, and short written descriptions of each stage (S2-S8). Additional tabs are included for: 1) Pairwise comparison overview. 2) Mixed stage reference overview. 3) Lists of enriched transcripts for S2-S8, compiled from both the pairwise and mixed stage reference analysis. 4) GO triage criteria. 5) Categories and lists of biological process (BP) GO IDs, manually curated from the statistically significant hits for S2-S8 enriched transcripts. 6) Summary table containing the number and percentage of enriched transcripts (S2-S8) assigned to BP GO ID categories.

References

Adler, C.E., Seidel, C.W., McKinney, S.A., and Sánchez Alvarado, A. (2014). Selective amputation of the pharynx identifies a FoxA-dependent regeneration program in planaria. Elife 3, e02238.

Benjamini Y., Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57, 289-300.

Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A. The Pfam protein families database: towards a more sustainable future. Nucleic acids research. 2016 Jan 04; 44(D1):D279-85.

Fu, L., Niu, B., Zhu, Z., Wu, S., and Li, W. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150-3152.

Gene Ontology, C. (2015). Gene Ontology Consortium: going forward. Nucleic Acids Res 43, D1049-1056.

Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10, R25.

Robinson, M.D., McCarthy, D.J., and Smyth, G.K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140.

Tu, K.C., Cheng, L.C., Tk Vu, H., Lange, J.J., McKinney, S.A., Seidel, C.W., and Sánchez Alvarado, A. (2015). egr-5 is a post-mitotic regulator of planarian epidermal differentiation. Elife 4, e10501.

UniProt, C. (2015). UniProt: a hub for protein information. Nucleic Acids Res 43, D204-212.

Owlery

blazegraph-runner