plink2.0和plink1.9的忧伤笔记

虽然plink2.0已经存在好久了,但是一直用的都是plink1.9,因为语法熟悉。更主要是plink2.0语法变动太大,害怕步子迈得太大了……

今天看一下plink2.0的读入和输出数据常用参数,

plink2.0用是不会用的,2022年都不会用!!!但是碰到bgen,pgen数据进行转化为bed,bim,fam文件,然后用plink1.9使用的想法还是有的,而且很大!!!

本篇目的:使用plink2.0软件将下面格式随便输入、输出

  • plink1.9的ped和map数据,不如:a.ped, a.map
  • plink1.9的bed和bim和fam数据,比如:a.bim, a.bed, a.fam
  • plink2.0的bgen和sample数据,比如:a.bgen, a.sample
  • plink2.0的pgen和bim和fam数据,比如a.pgen,a.fam,a.bim
    • vcf数据,比如a.vcf

1,plink2.0的提升

plink2.0主要是从以下几个方面,相对于plink1.9有较大的提升:

  • 1,保留参考等位基因的信息,比如vcf格式的数据,不要添加参数 –keep-allele-order。这样vcf变为plink,plink变为vcf就可以不用指定ref和alt了,切换无障碍!

  • 2,新的.pgen文件,结合SNPack-style的压缩,可以节约80%的文件大小。比如1000个Genomes,比压缩的gzip文件小70%,且不丢失任何信息。压缩文件空间更小,速度更快。

  • 3,旧版的二进制文件(bed,bim和fam)文件,plink2.0依旧支持,输出文件包括两种:–make-bpgen 和 –make-bpfile文件。可以支持plink1.9的文件格式,无论是map和ped数据,还是bed,bim和fam格式。

  • 4,分析模块,进行了优化。标准的logistic回归分析失败产生NA或者无意义的结果,–glm比plink1.9的–linear速度提升1000倍。尤其是填充的剂量效应的基因型值(比如0.2,1.8这样的非整数型数据)。PCA分析汇总,增加了参数,当样本超过1万,这个参数可以不影响精度(影响不到1%),大大提升计算效率。样本量支持得更多,处理速度更快!

2,plink2.0 安装

plink2.0 网站:https://www.cog-genomics.org/plink/2.0/

二进制文件,直接执行,支持:

  • Intel
  • AMD
  • 苹果M1

plink2.0和plink1.9的忧伤笔记

想查看一下–export的用法,可以看到主要功能:

  • A,是0-1-2编码
  • ped,是map和ped格式
  • vcf,是vcf格式
  • bgen-1.x,包括1.1, 1.2, 1.3,都是bgen格式
$ plink2 --export --helpPLINK v2.00a3.7LM AVX2 Intel (24 Oct 2022)     www.cog-genomics.org/plink/2.0/(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3--help present, ignoring other flags.--export output format(s)...> [{01 | 12}] ['bgz'] ['id-delim='char>] ['id-paste='column set descriptor>] ['include-alt'] ['omit-nonmale-y'] ['spaces'] ['vcf-dosage='field>] ['ref-first'] ['bits='#>] ['sample-v2']  Create a new fileset with all filters applied.  The following output  formats are supported:  (actually, only A, AD, Av, bcf, bgen-1.x, haps, hapslegend, ind-major-bed,  oxford, ped, tped, and vcf are implemented for now)  * '23': 23andMe 4-column format.  This can only be used on a single  sample's data (--keep may be handy), and does not support  multicharacter allele codes.  * 'A': Sample-major additive (0/1/2) coding, suitable for loading from R. If you need uncounted alleles to be named in the header line, add the 'include-alt' modifier.  * 'AD': Sample-major additive (0/1/2) + dominant (het=1/hom=0) coding.  Also supports 'include-alt'.  * 'Av': Variant-major 0/1/2.  * 'beagle': Unphased per-autosome .dat and .map files, readable by early      BEAGLE versions.  * 'beagle-nomap': Single .beagle.dat file.  * 'bgen-1.x': Oxford-format .bgen + .sample.  For v1.2/v1.3, sample identifiers are stored in the .bgen (with id-delim and id-paste settings applied), and default precision is 16-bit (use the 'bits' modifier to reduce this).  * 'bimbam': Regular BIMBAM format.  * 'bimbam-1chr': BIMBAM format, with a two-column .pos.txt file.  Does not    support multiple chromosomes.  * 'fastphase': Per-chromosome fastPHASE files, with  .chr-chr #>.phase.inp filename extensions.  * 'fastphase-1chr': Single .phase.inp file.  Does not support  multiple chromosomes.  * 'haps', 'hapslegend': Oxford-format .haps + .sample[ + .legend].  All   data must be biallelic and phased.  When the 'bgz'   modifier is present, the .haps file is   block-gzipped.  * 'HV': Per-chromosome Haploview files, with .chr-chr #>{.ped,.info}  filename extensions.  * 'HV-1chr': Single Haploview .ped + .info file pair.  Does not supportmultiple chromosomes.  * 'ind-major-bed': PLINK 1 sample-major .bed (+ .bim + .fam).  * 'lgen': PLINK 1 long-format (.lgen + .fam + .map), loadable with --lfile.  * 'lgen-ref': .lgen + .fam + .map + .ref, loadable with --lfile + --reference.  * 'list': Single genotype-based list, up to 4 lines per variant.  To omit    nonmale genotypes on the Y chromosome, add the 'omit-nonmale-y'    modifier.  * 'rlist': .rlist + .fam + .map fileset, where the .rlist file is a      genotype-based list which omits the most common genotype for      each variant.  Also supports 'omit-nonmale-y'.  * 'oxford', 'oxford-v2': Oxford-format .gen + .sample.  When the 'bgz'    modifier is present, the .gen file is    block-gzipped.  'oxford' requests the original    .gen file format with 5 leading columns    (understood by older PLINK builds); 'oxford-v2'    requests the current 6-leading-column flavor.  * 'ped', 'compound-genotypes': PLINK 1 sample-major (.ped + .map),   loadable with --pedmap.  * 'structure': Structure-format.  * 'tped': PLINK 1 variant-major (.tped + .tfam), loadable with --tfile.  * 'vcf',     : VCF (default version 4.3).  If PAR1 and PAR2 are present,    'vcf-4.2',   they are automatically merged with chrX, with proper    'bcf',       handling of chromosome codes and male ploidy.    'bcf-4.2'    When the 'bgz' modifier is present, the VCF file is  block-gzipped.  (This always happens with BCF output.)  The 'id-paste' modifier controls which .psam columns are  used to construct sample IDs (choices are maybefid, fid,  iid, maybesid, and sid; default is maybefid,iid,maybesid),  while the 'id-delim' modifier sets the character between the  ID pieces (default '_').  Genotypes are always exported来源:育种数据分析之放飞自我
                                                        

声明:本站部分文章及图片转载于互联网,内容版权归原作者所有,如本站任何资料有侵权请您尽早请联系jinwei@zod.com.cn进行处理,非常感谢!

上一篇 2022年11月1日
下一篇 2022年11月1日

相关推荐