Fill This Form To Receive Instant Help
Homework answers / question archive / Assignment 2 - Processing VCF file 1
Assignment 2 - Processing VCF file
1. Download the vef file from
https://spliceatlas.s3.amazonaws.com/clinvar_ 20220227 10K.vcf
2. File contains 10,000 lines
3. This is a kind of tsv file (tab separated values)
4. Line containing single # is the header. This contains the column headers
5. File will contain the following 8 columns (tab separated).
a. CHROM
b. POS
c. ID
d. REF
e. ALT
f. QUAL
g. FILTER
h. INFO
6. 2 lines from file is given here as a sample
a. #HCHROM POS ID REF ALT QUAL FILTER INFO
b. 1 861332 1019397 G A . .
ALLELEID=1003021;CLNDISDB=MedGen:CN517202;CLNDN=not_provided;CL
NHGVS=NC_000001.10:g.861332G>A;CLNREVSTAT=criteria_ provided, single
_ submitter; CLNSIG=Uncertain_ significance; CLNVC=single_nucleotide_variant;C
LNVCSO=S0:0001483; GENEINFO=SAMD11:148398;MC=S0O:0001583|missens
e variant; ORIGIN=1;RS=1640863258
Cc. 1 865519 1125147 C T . .
ALLELEID=1110865;CLNDISDB=MedGen:CN517202;CLNDN=not_provided;CL
NHGVS=NC_000001.10:g.865519C>T;CLNREVSTAT=criteria_provided, single_
submitter; CLNSIG=Likely_benign;CLNVC=single_nucleotide_variant;CLNVCSO
=S0:0001483;GENEINFO=SAMD11:148398;MC=S0:0001627|intron_variant;OR
IGIN=1
7. Produce a output/result file (filename_of_ur_choice.csv - comma separated values)
containing 14 columns by processing the downloaded file
a. Result file should contain 14 columns which are given as follows
CHROM,POS,ID,REF,ALT,ALLELEID,CLNHGVS,CLNSIG,CLNVC,ORIGIN,RS,G
ene ID,Gene_symbol,Consequence
b. CHROM,POS,ID,REF,ALT can be collected directly from the first 5 columns of
downloaded vecf file
c. To collect remaining data please use the last/8th col (INFO)
i. 8th col values are separated by °;’
li. 8th col will have Attribute=value;Attribute=value;Attribute=value; and so
on
li, ALLELEID,CLNHGVS,CLNSIG,CLNVC,ORIGIN,RS can be collected
directly from attributes of 8th col
1. ALLELEID=
2. CLNHGVS=
3. CLNSIG=
4. CLNVC=
5. ORIGIN=
6. RS=
iv. Gene_ID & Gene Symbol can be collected from GENEINFO attribute in
8th column
1. GENEINFO=Gene_Symbol:Gene_|D
v. | Consequence can be collected from MC attribute of 8th col
1. MC=S0:SO_|ID|Consequence
vi. If any of the attribute isn’t available put ‘-’ in the result file
d. Expected o/p for the first 2 lines are
. 1,861332,1019397,G,A,1003021,NC_000001.10:g.861332G>A,Uncertain
_significance,single_nucleotide_variant, 1,1640863258 ,148398,SAMD11,
missense_variant
ll. 1,865519,1125147,C,1,1110865,NC_ 000001.10:g.865519C>T,Likely_beni
gn,single_nucleotide_variant,1,-,148398,SAMD11,intron_ variant
8. From the .csv file created in the above step, count the different type of origin (10th col)
a. Values of Origin col have the following meaning.
I. O - unknown;
il. 1 - germline;
ili, §2- somatic;
IV. 4 - inherited;
V. 8 - paternal;
VI. 16 - maternal:
vil. 32 -de-novo;
VIII. 64 - biparental;
IX. 128 - uniparental;
x. 2056 - not-tested;
xi. 512 - tested-inconclusive;
b. If you get number other than listed above, classify under ‘Others’
c. Make a data structure (dictionary) that has Origin type as key and the count of
them as value
i. Expected o/p
‘Others’ => 95,
'de-novo' => 38,
‘inherited’ => 32,
‘somatic’ => 18,
‘maternal’ => 19,
‘paternal’ => 14,
‘germline’ => 6648,
‘uniparental' => 6,
‘unknown! => 145
d. Make a data structure of the same as mentioned in ( c ) into a json object.
Display the results as a table using HTML