Fill This Form To Receive Instant Help

Help in Homework
trustpilot ratings
google ratings


Homework answers / question archive / Assignment 2 - Processing VCF file 1

Assignment 2 - Processing VCF file 1

Biology

Assignment 2 - Processing VCF file

1. Download the vef file from

https://spliceatlas.s3.amazonaws.com/clinvar_ 20220227 10K.vcf

2. File contains 10,000 lines

3. This is a kind of tsv file (tab separated values)

4. Line containing single # is the header. This contains the column headers

5. File will contain the following 8 columns (tab separated).

a. CHROM

b. POS

c. ID

d. REF

e. ALT

f. QUAL

g. FILTER

h. INFO

6. 2 lines from file is given here as a sample

a. #HCHROM POS ID REF ALT QUAL FILTER INFO

b. 1 861332 1019397 G A . .

ALLELEID=1003021;CLNDISDB=MedGen:CN517202;CLNDN=not_provided;CL

NHGVS=NC_000001.10:g.861332G>A;CLNREVSTAT=criteria_ provided, single

_ submitter; CLNSIG=Uncertain_ significance; CLNVC=single_nucleotide_variant;C

LNVCSO=S0:0001483; GENEINFO=SAMD11:148398;MC=S0O:0001583|missens

e variant; ORIGIN=1;RS=1640863258

Cc. 1 865519 1125147 C T . .

ALLELEID=1110865;CLNDISDB=MedGen:CN517202;CLNDN=not_provided;CL

NHGVS=NC_000001.10:g.865519C>T;CLNREVSTAT=criteria_provided, single_

submitter; CLNSIG=Likely_benign;CLNVC=single_nucleotide_variant;CLNVCSO

=S0:0001483;GENEINFO=SAMD11:148398;MC=S0:0001627|intron_variant;OR

IGIN=1

7. Produce a output/result file (filename_of_ur_choice.csv - comma separated values)

containing 14 columns by processing the downloaded file

a. Result file should contain 14 columns which are given as follows

CHROM,POS,ID,REF,ALT,ALLELEID,CLNHGVS,CLNSIG,CLNVC,ORIGIN,RS,G

ene ID,Gene_symbol,Consequence

b. CHROM,POS,ID,REF,ALT can be collected directly from the first 5 columns of

downloaded vecf file

c. To collect remaining data please use the last/8th col (INFO)

i. 8th col values are separated by °;’

li. 8th col will have Attribute=value;Attribute=value;Attribute=value; and so

on

li, ALLELEID,CLNHGVS,CLNSIG,CLNVC,ORIGIN,RS can be collected

directly from attributes of 8th col

1. ALLELEID=

2. CLNHGVS=

3. CLNSIG=

4. CLNVC=

5. ORIGIN=

6. RS=

iv. Gene_ID & Gene Symbol can be collected from GENEINFO attribute in

8th column

1. GENEINFO=Gene_Symbol:Gene_|D

v. | Consequence can be collected from MC attribute of 8th col

1. MC=S0:SO_|ID|Consequence

vi. If any of the attribute isn’t available put ‘-’ in the result file

d. Expected o/p for the first 2 lines are

.  1,861332,1019397,G,A,1003021,NC_000001.10:g.861332G>A,Uncertain

_significance,single_nucleotide_variant, 1,1640863258 ,148398,SAMD11,

missense_variant

ll. 1,865519,1125147,C,1,1110865,NC_ 000001.10:g.865519C>T,Likely_beni

gn,single_nucleotide_variant,1,-,148398,SAMD11,intron_ variant

8. From the .csv file created in the above step, count the different type of origin (10th col)

a. Values of Origin col have the following meaning.

I. O - unknown;

il. 1 - germline;

ili, §2- somatic;

IV. 4 - inherited;

V. 8 - paternal;

VI. 16 - maternal:

vil. 32 -de-novo;

VIII. 64 - biparental;

IX. 128 - uniparental;

x. 2056 - not-tested;

xi. 512 - tested-inconclusive;

b. If you get number other than listed above, classify under ‘Others’

c. Make a data structure (dictionary) that has Origin type as key and the count of

them as value

i. Expected o/p

‘Others’ => 95,

'de-novo' => 38,

‘inherited’ => 32,

‘somatic’ => 18,

‘maternal’ => 19,

‘paternal’ => 14,

‘germline’ => 6648,

‘uniparental' => 6,

‘unknown! => 145

d. Make a data structure of the same as mentioned in ( c ) into a json object.

Display the results as a table using HTML

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Related Questions