Skip to content

Minimap2 crashes for Wheat Scaffolding #1303

@neupanebpn63

Description

@neupanebpn63

We performed high-fidelity (HiFi) whole genome sequencing of two wheat cultivars, Madsen and Pritchett, using the PacBio Revio Circular Consensus Sequencing (CCS) platform. The high-accuracy long reads were first assembled into contigs using Hifiasm. Post-assembly, we conducted quality control and completeness assessments using tools such as BUSCO and Gfastats. For downstream scaffolding, we employed RagTag using the high-quality genome of the wheat cultivar ‘Attraktion’ as the reference assembly.

However, I’m facing challenges with my reference-guided scaffolding project using RagTag and could use your insights. Madsen and Pritchett has nearly identical BUSCO scores (C: 99.7% [S: 2.0%, D: 97.7%], F: 0.2%, M: 0.1%, n: 4896, E: 0.4%). Madsen has 4424 contigs, and Pritchett has 2754, both assembled with Hifiasm. The genomes are about 14Gb big.

I successfully scaffolded Madsen using RagTag, but Pritchett consistently fails with the same SLURM script and pipeline. For Pritchett, the job runs for ~7 days, reports as “completed,” but produces no ragtag.scaffold.fasta. The ragtag.scaffold.asm.paf.log is not complete and gets terminated at same point everytime.

Error says:
Traceback (most recent call last):
File “/home/…/bin/ragtag_scaffold.py”, line 577, in
main()
File “/home/…/bin/ragtag_scaffold.py”, line 420, in main
al.run_aligner()
File “/home/…/BPN/lib/python3.10/site-packages/ragtag_utilities/Aligner.py”, line 128, in run_aligner
run_oe(self.compile_command(), self.out_file, self.out_log)
File “/home/…/lib/python3.10/site-packages/ragtag_utilities/utilities.py”, line 73, in run_oe
raise RuntimeError(“Failed : minimap2 -x asm5 -t 24 … > ragtag.scaffold.asm.paf 2> ragtag.scaffold.asm.paf.log”)

The Slurm Job I gave was:

#SBATCH --partition=abc
#SBATCH --cpus-per-task=24
#SBATCH --mem=1500000
#SBATCH --time=14-00:00:00
ragtag.py scaffold “$REF” “$QUERY” -o “$OUT” -t 24 -u

Troubleshooting Steps:
Ran minimap2 manually on Pritchett’s reference (attraktion.fasta) and query (pt2_busco.fa); it generated a 442 MB .paf file in ~21 hours. Came to know that RagTag does not use pregenerated paf file.

Tested RagTag on a Pritchett subset (~409 Mbp, 10 contigs); it succeeded in ~10 hours, placing 9/10 sequences (~402 Mbp).

Someone suggested that with large genomes, minimap2 might struggle due to multi-indexing issues that can slow things down or cause memory overload. They recommended indexing the reference with minimap2 using -I 20G (which should be suitable for wheat) and then passing the prebuilt .mmi index directly to RagTag as if it were a FASTA file. I followed this approach — created the .mmi file and used it in RagTag — but unfortunately, it still didn’t resolve the issue with Pritchett.

Used SLURM settings: bigmem, 24 CPUs, 1.5 TB memory, 14-day limit, BPN environment (RagTag v2.1.0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions