hdb cd53d00105 test hai 10 meses
..
1.txt cd53d00105 test hai 10 meses
KEGG.sh 6c107c513e first commit hai 10 meses
NR.sh 6c107c513e first commit hai 10 meses
PPI.sh 6c107c513e first commit hai 10 meses
Pathway.sh 6c107c513e first commit hai 10 meses
Readme.md 6c107c513e first commit hai 10 meses
Swissprot.sh 6c107c513e first commit hai 10 meses
TF.sh 6c107c513e first commit hai 10 meses
TF_Family.R 6c107c513e first commit hai 10 meses
abstract_lncRNA_gtf.pl 6c107c513e first commit hai 10 meses
bc 6c107c513e first commit hai 10 meses
blast_uniprot.sh 6c107c513e first commit hai 10 meses
create_conf.sh 6c107c513e first commit hai 10 meses
dat2tsv.pl 6c107c513e first commit hai 10 meses
downloan_ensembl_species.sh 6c107c513e first commit hai 10 meses
eggNOG_GO.sh 6c107c513e first commit hai 10 meses
ensembl.sh 6c107c513e first commit hai 10 meses
ensembl_gbk2tsv.pl 6c107c513e first commit hai 10 meses
extract_fa_by_list.pl 6c107c513e first commit hai 10 meses
geneGoSummary.pl 6c107c513e first commit hai 10 meses
genome_conf.ini 6c107c513e first commit hai 10 meses
get_longseq.pl 6c107c513e first commit hai 10 meses
gtf2id_table.pl 6c107c513e first commit hai 10 meses
kegg_org_species_taxid.txt 6c107c513e first commit hai 10 meses
main.sh 6c107c513e first commit hai 10 meses
main.test.sh 6c107c513e first commit hai 10 meses
merge_anno.v1.pl 6c107c513e first commit hai 10 meses
ncbi.sh 6c107c513e first commit hai 10 meses
pid2gid.awk 6c107c513e first commit hai 10 meses
protein2geneid.pl 6c107c513e first commit hai 10 meses
taxidToGeneNames.pl 6c107c513e first commit hai 10 meses
taxid_division.txt 6c107c513e first commit hai 10 meses
term2gene.R 6c107c513e first commit hai 10 meses
uniprot_dat2tsv.pl 6c107c513e first commit hai 10 meses
uniprot_info.py 6c107c513e first commit hai 10 meses
uniq_ppi_pair.awk 6c107c513e first commit hai 10 meses
基因组整理构想.sh 6c107c513e first commit hai 10 meses

Readme.md

真核有参参考基因组整理

  • 参考基因组目录命名规范:
  • 只包含字母数字. _
  • 有数据库:拉丁名_数据库名称_版本 例:mus_musculus_ensembl_111 Homo_sapiens_NCBI_GCF_000001405.40_GRCh38.p14
  • 无数据库(客户提供): 拉丁名_合同号_客户姓名全拼: 例: Pseudomonas_fluorescens_PN20240113005_liuyang

非常规gff纠错和矫正为ensembl格式gtf

  • 先修复gff为标准的gff3格式;然后使用genpred转为gtf

gff 转为 gtf

gff3ToGenePred $gff genpred
genePredToGtf file genpred example.gtf
# 下面这一步会自动加上gene_biotype "protein_coding";请根据gff文件里是否有基因类型的信息;去处理
perl /Business/psn_company/t04/hudabang/bin/yaogu/tools/add_gene_type_to_gtf.pl -i example.gtf -o  example.add_gene.gtf
# 注意这个里面是否有 gene_biotype 信息;没有的话需要对有cds的基因加上 gene_biotype "protein_coding"
sed '/\tgene\t/s/$/ gene_biotype "protein_coding";/' example.add_gene.gtf

方法一:使用工具AGAT

source /PERSONALBIO/work/transcriptome/t00/software/biosoftware/miniconda3/bin/activate AGAT_gff
export PERL5LIB=/PERSONALBIO/work/transcriptome/t00/software/biosoftware/miniconda3/envs/pfam_scan/share/pfam_scan-1.6-4
# 修复gff 缺少基因或mRNA;cds没有对应的外显子 
agat_convert_sp_gxf2gxf.pl  -g example.gff -o example.fixed.gff3
# gff3 转gtf ;参考上面

方法二:手动处理;具体问题具体分析

  • 常见问题:
  • gene cds exon mRNA 类型不全
  • CDS feature 第8列 不为 0 1 2
  • Parent和ID相同
  • 正负链为. 或者同一个基因下出现不同的 + -
  • 出现不同位置;相同的基因id
  • 。。。待补充

参考基因组整理自动化流程

/Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/script/main.sh -h

Usage: /Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/script/main.sh [-h|--help] [-i|--indir <indir>] [-o|--outdir <outdir>] [-c|--config <config>] [-n|--cpu <cpu>] [-M|--memory <memory>] [-v|--version] [-a|--allgene]
Options:
  -h, --help            Show this help message and exit.
  -i, --indir           Input directory containing the files that in config file
  -o, --outdir          Output directory
  -c, --config          Config file for the pipeline.
  -n, --cpu             Number of CPUs in bsub;default: 6
  -M, --memory          Memory limit  in bsub;default: 50G
  -v, --version         Show the version of the pipeline and exit.
  -a, --allgene         Include all genes, not just protein-coding genes.

#示例路径
/Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/example/mus_musculus_ensembl_111
#示例命令
/Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/script/main.sh -i /Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/example/mus_musculus_ensembl_111 -o /Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/example/mus_musculus_ensembl_111/genome -c  /Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/example/mus_musculus_ensembl_111/genome_conf.ini

参数说明

  • -i 输入路径;下面包含需要整理基因组用到的文件;基因组序列和基因结构注释gtf; 蛋白序列或cds序列(序列title必须是gtf里的基因id);其他功能注释文件等
  • -o 输出目录;
  • -c 基因组配置文件
  • -n 分析过程中比对任务投递到集群的 -n
  • -M 分析过程中比对任务投递到集群的 -M ;默认50G;因为eggnogmap 至少需要50G
  • -v 流程版本
  • -a 整理所有基因(医学需要所有基因)还是protein_coding(真核有参mRNA);默认 protein_coding

配置文件 genome_conf.ini 介绍

#详细解释见 :/Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/script/genome_conf.ini
latin_name=Mus_musculus                                             物种拉丁名
taxid=10090                                                         物种taxid
url=https://ftp.ensembl.org/pub/release-111/fasta/mus_musculus/dna/ 物种数据库下载链接
version=111                                                         数据库上对应物种版本
dna_fasta=Mus_musculus.GRCm39.dna.toplevel.fa                       基因组序列
gtf=Mus_musculus.GRCm39.111.gtf                                     基因组结构注释gtf
fasta=pep.fa                                                        蛋白序列或cds序列(序列title必须是gtf里的基因id)
fasta_type=proteins                                                 序列类型:参数 proteins 或 CDS
id_table=id_table.txt                                               gtf里的基因id locus_tag old_locus_tag entry_id 蛋白id 基因类型
kindom=animal                                                       物种大类:参数: animal plant fungi
org=mmu                                                             物种kegg三字符; 没有则写物种大类
go_file=GO.txt                                                      GO注释文件
kegg_file=none                                                      KEGG注释文件
eggnog_file=none                                                    eggnog注释文件
swissprot_file=Swissprot.txt                                        swissprot注释文件
nr_file=NR.txt                                                      NR注释文件
uniprot_tsv=none                                                    uniprot 本物种的 GO注释
uniprot_fasta=none                                                  uniprot 本物种的 蛋白序列

参考基因组整理分步流程

参考基因组需要整理的文件 refCfg.ini 说明

  • DNA=DNA/Mus_musculus.GRCm39.dna.primary_assembly.fa 基因组序列和索引
  • RefDatabase_info=DNA/RefDatabase_info 基因组信息
  • chr_list=DNA/chr_list 染色体
  • hrds=DNA/hrds.txt hrds
  • cytoBand=DNA/cytoband.txt 圈图使用的染色体信息
  • GTF=GTF/Mus_musculus.GRCm39.108.gtf 结构注释gtf
  • GTFbed=GTF/GTF.bed 结构注释bed格式
  • hisat2SS=GTF/genome.ss 比对和可变剪切使用
  • lncRNA_gtf=GTF/lncRNA.gtf lncRNA结构注释gtf
  • DEXSeq_gff=GTF/DEXSeq.gff 差异外显子使用的外显子结构注释gff
  • matureMiRNA=mmu_mature.fa 已知成熟miRNA序列
  • annoDB=annoDB SNP annova使用
  • ppi=PPI/PPI.txt 蛋白互作文件
  • Annotation=Annotation/Annotation.xls 基因功能注释
  • Anno_summary=Annotation/Annotation.stat 基因功能注释统计
  • 09_TFs 转录因子;真菌无转录因子

DNA相关文件

# 2.1 RefDatabase_info文件:
Genome	Homo_sapiens.GRCh38.dna.primary_assembly.fa
Path	http://asia.ensembl.org/Homo_sapiens/Info/Index
Genebuild by	Ensembl
Database version	104.38
Base Pairs	3,099,750,718

统计Base Pairs可用以下命令:
perl /Business/psn_company/t01/public/ScriptManager/lib/PrepRef/calculate_fasta_basepair/calculate_fasta_basepair.pl -i reference.fa

# 2.2 对DNA建立 hisat2 索引和fai索引
hisat2-build -p 12 reference.fa reference.fa
samtools faidx reference.fa

# 2.3 chr_list文件
chr_list 主要是用于画基因组圈图的,指定哪些染色体需要画到圈图中,一般为常染色体加性染色体,不包括没拼好的 scalfold,每行写一个染色体;若全是contig或scaffold, 则全部保留;从参考基因组序列文件中获取。
grep "^>" reference.fa | grep chromosome | cut -d " " -f1 | sed 's/^>//' > chr_list

# 2.4 生成 hrds 文件(SNP分析中用到)
该文件记录了物种和每个染色体长度。
# 参数 -org 是给物种拉丁名,下面以人为例:
perl /Business/psn_company/t01/public/ScriptManager/lib/PrepRef/create_hrds/create_hrds.pl -i reference.fa -o hrds.txt -org Homo_sapiens

# 2.5 cytoBand文件(用于基因组圈图绘制)
perl /Business/psn_company/t01/public/ScriptManager/lib/PrepRef/create_cytoband/create_cytoband.pl -i hrds.txt -c chr_list -o cytoband.txt

GTF相关文件

# 3.1 生成genome.ss文件 (hisat2 比对中用于寻找剪切位点的文件):
extract_splice_sites.py reference.gtf > genome.ss

# 3.2 生成 GTF.bed 文件 (bed 文件记录了基因的位置信息,主要用于 mapQC 的几个模块)
/Business/psn_company/t01/public/ScriptManager/lib/PrepRef/gtf2bed/gtf2bed.pl reference.gtf > GTF.bed

# 3.3 DEXSeq_gff (用于差异外显子分析)
python /Business/psn_company/t01/public/ScriptManager/lib/DE/exon_DEXSeq/dexseq_prepare_annotation.py reference.gtf DEXSeq.gff

# 3.4 lncRNA.gtf (此文件是 lncRNA 的结构注释文件,从GTF中提取,如果GTF中没有,下述命令会生成一个空文件)
perl /Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/script/abstract_lncRNA_gtf.pl -i reference.gtf -o lncRNA.gtf

annoDB文件夹

mkdir annoDB
gtfToGenePred -genePredExt reference.gtf annoDB/Anno_refGene.txt
perl /Business/psn_company/t01/public/software/annovar/retrieve_seq_from_fasta.pl --format refGene --seqfile reference.fa annoDB/Anno_refGene.txt --out annoDB/Anno_refGeneMrna.fa

基因功能注释

GO注释

# 优先级1:NCBI 和ensembl 等公共数据库上 下载 
# 优先级2:uniprot上下载 对应和比对
wget -c -t 3 -q https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/reference_proteomes/Eukaryota/UP000005640/UP000005640_9606.fasta.gz
wget -c -t 3 -q https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/reference_proteomes/Eukaryota/UP000005640/UP000005640_9606.dat.gz
perl /personalbio1/R_Report/Users/hudb/script/download_genome/uniprot_dat2tsv.pl UP000005640_9606.dat.gz > UP000005640_9606.tsv
## 对应:
UP000005640_9606.tsv里分别是Uniprot AC ;GeneID (NCBI Entryid); GO ID 
## 比对
/Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/script/blast_uniprot.sh -i pep.fa -d UP000005640_9606.fasta -g UP000005640_9606.tsv
Usage:
        /Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/script/blast_uniprot.sh -i cds.fa -o outdir -t CDS -d uniprot.fa  -g uniprot.go.tsv
OPTIONS:
        -h: help information
        -i: fasta file
        -o: output dir
        -t: proteins or CDS default:proteins
        -d: uniprot species fasta
        -g: uniprot species go annotation

# 优先级3:eggnog_map 注释
/Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/script/eggNOG_GO.sh -i pep.fa -s animal -o ./ -t proteins

Usage:
        /Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/script/eggNOG_GO.sh -i cds.fa -o outdir -t CDS -s animal
OPTIONS:
        -h: help information
        -i: fasta file
        -o: output dir
        -t: proteins or CDS default:proteins
        -s: species [ animal plant fungi prokaryotes ]

KEGG注释

# 优先级1:脚本内会判断是否匹配上已知数据库
Usage:
        /Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/script/KEGG.sh -i cds.fa -o outdir -t CDS -s animal -l idlist
OPTIONS:
        -h: help information
        -i: fasta file
        -o: output dir
        -l: idlist;col1-6 is gene_id  locus_tag    old_locus_tag  entry_id protein_id  gene_name
        -t: proteins or CDS default:proteins
        -s: kegg org [ animal plant fungi prokaryotes hsa mmu ...  ]
        -r: rate of Matched to known KO; default:0.5

eggnog注释

# 优先级1:NCBI 和ensembl 等公共数据库上 下载 
# 优先级2: eggnog_map 注释
同GO注释优先级3;会同时出eggnog 和GO注释结果

Swissprot注释

# 优先级1:NCBI 和ensembl 等公共数据库上 下载 
# 优先级2:比对
Usage:
        /Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/script/Swissprot.sh -i cds.fa -o outdir -t CDS
OPTIONS:
        -h: help information
        -i: fasta file
        -o: output dir;default:./
        -t: proteins or CDS default:proteins

NR注释

# 优先级1 NCBI 和ensembl 等公共数据库上 下载 
# 优先级2 比对本物种 
# 优先级3 比对大库
Usage:
        /Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/script/NR.sh -i cds.fa -o outdir -t CDS -s animal
OPTIONS:
        -h: help information
        -i: fasta file
        -o: output dir
        -t: proteins or CDS default:proteins
        -s: species [ animal plant fungi prokaryotes virus taxid(9606 10096 ..) ]

PPI注释

# 优先级1:脚本内会判断是否匹配上已知数据库
Usage:
        /Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/script/NR.sh -i cds.fa -o outdir -t CDS -s animal
OPTIONS:
        -h: help information
        -i: fasta file
        -o: output dir
        -t: proteins or CDS default:proteins
        -s: species [ animal plant fungi prokaryotes virus taxid(9606 10096 ..) ]

转录因子注释

# 优先级1:脚本内会判断是否匹配上已知数据库
Usage:
        /Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/script/NR.sh -i cds.fa -o outdir -t CDS -s animal
OPTIONS:
        -h: help information
        -i: fasta file
        -o: output dir
        -t: proteins or CDS default:proteins
        -s: species [ animal plant fungi prokaryotes virus taxid(9606 10096 ..) ]

合并注释文件

cd Annotation/tmp
perl /Business/psn_company/Work/Transcriptome/Datum/Public/Database/genome_prepare/script/merge_anno.v1.pl -g ../../GTF/Mus_musculus.GRCm39.111.gtf GO KEGG eggNOG eggNOG_Category Swissprot NR SwissprotName Entrez_geneID  BP CC MF Pathway -o ../Annotation.xls

# 统计:
perl /Business/psn_company/t01/public/ScriptManager/lib/PrepRef/stat_Annotation/stat_Annotation.pl -i Annotation.xls -o Annotation.stat

ncbi 参考基因组下载

  • 说明:下载使用的是Aspera软件;设置的速度上限在100M; 下载速度大概在 90M左右;下载完成后会整理好整理基因组流程需要的配置文件
  • 如果有文件下载失败;则重新运行下载脚本;会跳过已下载的文件

    /personalbio1/R_Report/Users/hudb/script/download_genome/ncbi.sh 
    Usage:
    	/personalbio1/R_Report/Users/hudb/script/download_genome/ncbi.sh -a [GCF.. GCA..] -o outdir 
    assembly information: 245: /personalbio1/R_Report/Users/hudb/script/download_genome/ncbi/assembly_summary*txt
    OPTIONS:
    	-h: help information
    	-a: assembly_accession
    	-o: output dir
    
    

参数说明

  • -a NCBI assembly_accession编号;对应唯一的基因组; 详细信息见 /personalbio1/R_Report/Users/hudb/script/download_genome/ncbi/assembly_summary*txt
  • -o 输出目录;

下载逻辑

。。。

ensembl 参考基因组下载

  • 说明:ensembl 数据均使用wget下载;下载均加了-c参数;如果日志里有某些文件下载失败;则重新运行整个下载脚本

    Usage: /personalbio1/R_Report/Users/hudb/script/download_genome/ensembl.sh -n homo_sapiens -v 111 -o ./ -r 3
    
    Description:
    Download ensembl genome;for all species to see /personalbio1/R_Report/Users/hudb/script/download_genome/ensembl/all_species.txt
    Options:
    -h, --help            Show this help message and exit.
    -v, --version         ensembl genome version;default: current version
    -o, --outdir          Output directory;default: ./
    -n, --name            species name 
    -r, --retry           curl and wget retry times;default: 3
    

参数说明

  • -v ensembl数据库基因组版本;默认数据库当前版本
  • -o 输出目录;
  • -n 物种名;是数据库ftp链接下的物种名;详细见/personalbio1/R_Report/Users/hudb/script/download_genome/ensembl/all_species.txt
  • -r 下载和请求网页的次数

下载逻辑

。。。

只有基因组序列文件和基因结构注释gff文件

其他数据库 下载

  • 待添加 。。。