This page is a copy of research/scientific_computing/project/genomics/cnF2freq (Wed, 31 Aug 2022 15:01:11)
cnF2freq
cnF2freq (originally same-Chromosome N-loci F2 FREQuencies) is our experimental codebase for analyzing genotype data with a known pedigree structure in different ways. The code has been used to compute line-origin probabilities in outbred and inbred F2 lines, determine phasing (haplotypes) of markers in different pedigree structures, (re)compute sex-specific marker distances based on the Haldane mapping function, and computing the most probable genotype assignments for missing markers as a form of genotype imputation in pedigrees.
All versions of the code are available under a BSD-style license, making it freely available for commercial as well as non-commercial use. We naturally expect use in academic contexts to be accompanied by the proper references to the original work, though.
Unique features
The most crucial distinctive feature of cnF2freq is that a large pedigree is separated into separate "focus pedigrees" of a single individual and one or two generations of ancestors.
Between focus pedigrees, all per-marker and per-individual parameters are shared. The main parameters are the skewness (phase) and sureness (probability of allele error). Other phasing schemes tend to treat phase as a binary variable, doing some kind of Markov sampling to explore different assignments. In our approach, we initialize the phase to 0.5 and then iteratively update it using a modified Baum-Welch algorithm, a standard expectation-maximization approach for Hidden Markov Models. This has been shown to give superior results in complex pedigrees.

Available versions of the code
The code exists in several editions tailored for different datasets and experiments. A specific fork for only computing line genotype probabilities is in the process of being released as an R software package. The currently actively maintained branches are available on github. The suggested branch for current use is plantimpute_modern.
The edition adapted for the 14th QTL-MAS workshop is available here (doing haplotyping with known genotypes in a multi-generational pedigree, tested support for recomputing marker maps but disabled in this specific version, parallel with OpenMP as well as MPI).
The edition adapted for the 15th QTL-MAS workshop is available here (doing haplotyping and genotype reconstruction with parental genotype data purposefully removed, including code for comparing results against those from Merlin, parallelization with MPI not enabled).
The boost library of a recent release is required, and a fairly recent C++ compilers. More recent branches of the code will require C++14 support. The Intel C++ compiler is our main platform for large runs (for performance reasons), so that one always tends to work. Please contact Carl Nettelblad regarding specific use cases, or any issues. Again, in general, the plantimpute_modern is the current stable fork. Future work (including great speedups) are found in the experimental Chaplink branch.
Publications
The following publications relate directly to this codebase:
-
Imputation of single nucleotide polymorphism genotypes in biparental, backcross, and topcross populations with a hidden Markov model
. In Crop science, volume 55, pp 1934-1946, 2015. (DOI
, Fulltext
).
-
MAPfastR: Quantitative trait loci mapping in outbred line crosses
. In G3: Genes, Genomes, Genetics, volume 3, pp 2147-2149, 2013. (DOI
, Fulltext
).
-
Inferring haplotypes and parental genotypes in larger full sib-ships and other pedigrees with missing or erroneous genotype data
. In BMC Genetics, volume 13, pp 85:1-13, 2012. (DOI
, fulltext:print
).
-
An improved method for estimating chromosomal line origin in QTL analysis of crosses between outbred lines
. In G3: Genes, Genomes, Genetics, volume 1, pp 57-64, 2011. (DOI
).
-
Haplotype inference based on hidden Markov models in the QTL–MAS 2010 multigenerational dataset
. In Proc. 14th European Workshop on QTL Mapping and Marker Assisted Selection, volume 5:3 of BMC Proceedings, pp S10:1-7, BioMed Central, London, 2011. (DOI
).
-
cnF2freq: Efficient determination of genotype and haplotype probabilities in outbred populations using Markov models
. In Bioinformatics and Computational Biology, volume 5462 of Lecture Notes in Computer Science, pp 307-319, Springer-Verlag, Berlin, 2009. (DOI
).