The first step in the bioinformatics pipeline after completing a sequence run is to convert the raw Illumina output to fastq files. We do that using the program **bcl2fastq**, which is pre-installed on the Helix cluster. First, we have to make a sample sheet that lists the indeces and the corresponding sample name. We use the indeces provided with the Curio kit (Illumina Dual Indexing). The sequences are listed in the Curio Seeker protocol. However, they list both primers in the forward direction. To properly demultiplex, you have to reverse transcribe the reverse primer. For example, we used primers F1 and R1 for the gar. Primer F1: TAAGGCGA Primer R1: TATCCTCT Our sample sheet (saved as SampleSheet.csv in the main folder from the sequencing output) is as follows: [Data] Lane,Sample_ID,index,index2 1,LepOcu8,TAAGGCGA,AGAGGATA 2,LepOcu8,TAAGGCGA,AGAGGATA 3,LepOcu8,TAAGGCGA,AGAGGATA 4,LepOcu8,TAAGGCGA,AGAGGATA Then the bcl2fastq code, which I run from the main sequencing directory for the run, is as folllows:


#!/bin/bash                                                                                                                                                         
#SBATCH --partition=single                                                                                                                                          
#SBATCH --nodes=1                                                                                                                                                   
#SBATCH --ntasks=1                                                                                                                                                  
#SBATCH --cpus-per-task=24                                                                                                                                          
#SBATCH --time=06:00:00                                                                                                                                             
#SBATCH --mem=80gb                                                                                                                                                  

module load bio/bcl2fastq/2.20

cd /mnt/sds-hd/sd17d003/NextSeq/240425_NB551333_0120_AHTMNLBGXV

bcl2fastq -p 24 -w 24 --no-lane-splitting --sample-sheet SampleSheet.csv

-------------------- When using non-model focal species, you need to generate a custom genome reference to map to. You do this using the genome fasta and gtf files. The 9th column in the gtf file must include "gene_name" attributes. You can test if these are present using pygtftk tools: ''gtftk count_key_values GCF_000242695.1_LepOcu1_genomic_gene_name.gtf'' For some gtf annotation files, the gene names are present but the field name is different. E.g., the gene name might instead be stored as "gene". If this is the case, you can simply use a text editing tool like sed to rename the attribute: ''sed -i 's/gene \"/gene_name \"/g' GCF_019279795.1_PAN1.0_genomic_modified.gtf'' For actually running the curioseeker pipeline, I have been using curioseeker v3.0, the latest release as of July 2024, which I downloaded from the company portal (login required) https://knowledgebase.curiobioscience.com I ran into issues with the pipeline with larger datasets. Specifically, it ran out of memory during the samtools sort step (sorting the BAM output from mapping the reads back to the genome with STAR). To fix this, I needed to modify two of the scripts: The first script is stored at ''~/curioseeker-v3.0.0/modules/nf-core/samtools/sort/main.nf'' For this script, I changed line 26 by adding the -m argument (limiting the total amount of memory per thread to 1 Gb): ''samtools sort $args -@ $task.cpus -m 1G -o ${prefix}.bam -T $prefix $bam'' The second script is stored at ''~/curioseeker-v3.0.0/modules/local/star/align.nf'' For this script, I added an argument to the STAR command starting at line 54 to limit the total memory to < 64Gb (--limitBAMsortRAM 63209696236). This chunk of the script now reads as follows: '' STAR \\ $args \\ --runThreadN $task.cpus \\ --readFilesCommand zcat \\ $limitOutSJcollapsed \\ --genomeDir ${read2[2]} \\ --readFilesIn ${read2[0]} \\ --outFileNamePrefix $prefix. \\ --sjdbGTFfile ${read2[1]} \\ --limitBAMsortRAM 63209696236 \\ $out_sam_type '' After these modifications, I ran the curioseeker pipeline from the piggeldy server with the following code: ''#!/bin/bash \\ conda activate curio-seeker \\ CURIO_DIR=/work/kgendreau/sds/sd17d003/VerteBrain/Curioseeker/Lepisosteus-oculatus \\ cd /work/kgendreau/curioseeker-v3.0.0 \\ nohup nice -n 3nextflow run main.nf \\ --input $CURIO_DIR/samplesheet_gar_piggeldy.csv \\ --outdir /work/kgendreau/sds/sd17d003/VerteBrain/Curioseeker/LepOcu8_piggdely_results/ \\ -work-dir /work/kgendreau/curio-tmp/ \\ ---igenomes_base $CURIO_DIR/Reference-Genomes \\ -resume \\ -profile singularity \\ 2>&1 & '' The input sample sheet was as follows: ''sample,experiment_date,barcode_file,fastq_1,fastq_2,genome,star_index,gtf Locu8,2024-04-25,/work/kgendreau/sds/sd17d003/VerteBrain/Curioseeker/Lepisosteus-oculatus/B0103_002_BeadBarcodes.txt,/work/kgendreau/sds/sd17d003/VerteBrain/Curioseeker/Fastqs/Lepisosteus-oculatus/LepOcu8_S1_R1_001_combined.fastq.gz,/work/kgendreau/sds/sd17d003/VerteBrain/Curioseeker/Fastqs/Lepisosteus-oculatus/LepOcu8_S1_R2_001_combined.fastq.gz,LepOcu1,/work/kgendreau/sds/sd17d003/VerteBrain/Curioseeker/Reference-Genomes/LepOcu1/Sequence/STARIndex,/work/kgendreau/sds/sd17d003/VerteBrain/Curioseeker/Reference-Genomes/LepOcu1/Annotation/Genes/genes_edit_reduce_2.gtf''