Creative Commons License
Drekendrop | Blog of Tutorial by Mei Pakpahan is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at drekendrop.blogspot.com.
Permissions beyond the scope of this license may be available at http://softdadesign.co.nr.

Wednesday, August 7, 2013

Create Boxplot with legend from text file and coloring them in R

So i have been playing around with R today (this is my very first time using it), a statistical tool and i wanted to draw a boxplot analysis for my research. I am not going to explain what boxplot is here (you may find some resources explaining about it through the internet but here's one in Wikipedia), but more to how to create it in R. So if you have R already installed on your machine let's start it.

First of all, i used R Commander (Rcmdr) to import my txt file to R dataset.
1st step ----

If you have Rcmdr package already, you may skip to '2nd step ----'. But if you don't, following are the instructions.
> Go to Packages > Install package(s) > Choose your CRAN mirror. In my case i chose the closest mirror to where i am now. After you choose it, there will be a Packages window pops up, and find Rcmdr. It may ask you any other packages that you need to install, just do whatever the program says :P

2nd step ----

Go to Packages > Load package > select Rcmdr
When the R Commander window comes up,
Go to Data > Import data > from text, clipboard, or URL ... 
Enter your dataset name : (in my case, i typed 'hydro')
And i used tab as the field separator in my txt file. Then locate your txt file, and open it. It should be loaded immediately.

3rd step ----

Get back to your R console and we are ready to draw our boxplot. At first, we need to define how many colors are we going to have in our boxplot to divide the boxplot type. For example, in my case, i divided my data that each two boxplots are divided into three categories: high, medium, low;So there will be 6 boxplots (2 high, 2 medium, and 2 low).

We need to define the colors first. Type this into R console:
colors = c(rep("dark grey",2),rep("white",2),rep("light grey",2))

Now, draw the boxplot:
boxplot(YOUR_DATASET_NAME,col=colors,xlab="LABEL_FOR_X_AXIS",ylab="LABEL_FOR_Y_AXIS",main="BOXPLOT_TITLE")

in my case,
boxplot(hydro,col=colors,xlab="Protein",ylab="Number of clusters",main="Hydrophobic Clusters")

So now we have our boxplot image.

4th step -----

Last, create legend for our boxplot.
legend(x=X_COORDINATE,y=Y_COORDINATE,legend=c("FIRST_LABEL","SECOND_LABEL","THIRD LABEL"),fill=c("dark grey","white","light grey"))

Here's mine,
legend(x=4.6,y=22,legend=c("High Overlap","Medium Overlap","Low Overlap"),fill=c("dark grey","white","light grey"))

And we're done! Look at my Boxplot :)

Good luck with yours!

Saturday, July 27, 2013

How to Create Your Own BLAST protein database and use BLAST in Windows

Hi, so this is my second post about Science in this blog. I actually have a lot in mind to write in here but i have been extremely busy doing my research. Anyway, what i want to share in here is actually very simple, however, for such a beginner in protein sequence like me it should be a useful post.
So here is the thing, i want to find homologous pair of a protein, BLAST particularly blastp can generate and score some proteins that potentially homologous to another protein. Let's start.

1st step ---
Download the latest BLAST from NCBI site here. In my case i used BLAST Win-32. Just follow the instructions when installing, then it should be installed in C:\Program Files\NCBI\blast-2.2.28+.

2nd step ---
Prepare your FASTA sequence file. In my case i tried to find homologous pair of protein malate dehydrogenase from Salinibacter ruber (PDB ID: 3NEP). You can download it easily from Protein Data Bank website by simply go to the Download Files hyperlink and choose FASTA seq. Following is the FASTA file looks like,

>3NEP:X|PDBID|CHAIN|SEQUENCE
MKVTVIGAGNVGATVAECVARQDVAKEVVMVDIKDGMPQGKALDMRESSPIHGFDTRVTGTNDYGPTEDSDVCIITAGLP
RSPGMSRDDLLAKNTEIVGGVTEQFVEGSPDSTIIVVANPLDVMTYVAYEASGFPTNRVMGMAGVLDTGRFRSFIAEELD
VSVRDVQALLMGGHGDTMVPLPRYTTVGGIPVPQLIDDARIEEIVERTKGAGGEIVDLMGTSAWYAPGAAAAEMTEAILK
DNKRILPCAAYCDGEYGLDDLFIGVPVKLGAGGVEEVIEVDLDADEKAQLKTSAGHVHSNLDDLQRLRDEGKIG

3rd step ---
Prepare your database. First of all, you need to prepare your FASTA file that contains all the FASTA sequences of the proteins that you would like to evaluate. In the example below, my FASTA file contains 4 PDB (proteins), namely 1EMD, 1IB6, 1Z2I, 3TL2. I saved it as meso.fasta file.

>1EMD:A|PDBID|CHAIN|SEQUENCE
MKVAVLGAAGGIGQALALLLKTQLPSGSELSLYDIAPVTPGVAVDLSHIPTAVKIKGFSGEDATPALEGADVVLISAGVR
RKPGMDRSDLFNVNAGIVKNLVQQVAKTCPKACIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFGVTTLDIIRSNTFVAE
LKGKQPGEVEVPVIGGHSGVTILPLLSQVPGVSFTEQEVADLTKRIQNAGTEVVEAKAGGGSATLSMGQAAARFGLSLVR
ALQGEQGVVECAYVEGDGQYARFFSQPLLLGKNGVEERKSIGTLSAFEQNALEGMLDTLKKDIALGQEFVNK
>3TL2:A|PDBID|CHAIN|SEQUENCE
SNAMTIKRKKVSVIGAGFTGATTAFLLAQKELADVVLVDIPQLENPTKGKALDMLEASPVQGFDANIIGTSDYADTADSD
VVVITAGIARKPGMSRDDLVATNSKIMKSITRDIAKHSPNAIIVVLTNPVDAMTYSVFKEAGFPKERVIGQSGVLDTARF
RTFIAQELNLSVKDITGFVLGGHGDDMVPLVRYSYAGGIPLETLIPKERLEAIVERTRKGGGEIVGLLGNGSAYYAPAAS
LVEMTEAILKDQRRVLPAIAYLEGEYGYSDLYLGVPVILGGNGIEKIIELELLADEKEALDRSVESVRNVMKVLV
>1IB6:A|PDBID|CHAIN|SEQUENCE
MKVAVLGAAGGIGQALALLLKTQLPSGSELSLYDIAPVTPGVAVDLSHIPTAVKIKGFSGEDATPALEGADVVLISAGVA
RKPGMDRSDLFNVNAGIVKNLVQQVAKTCPKACIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFGVTTLDIICSNTFVAE
LKGKQPGEVEVPVIGGHSGVTILPLLSQVPGVSFTEQEVADLTKRIQNAGTEVVEAKAGGGSATLSMGQAAARFGLSLVR
ALQGEQGVVECAYVEGDGQYARFFSQPLLLGKNGVEERKSIGTLSAFEQNALEGMLDTLKKDIALGQEFVNK
>1IB6:B|PDBID|CHAIN|SEQUENCE
MKVAVLGAAGGIGQALALLLKTQLPSGSELSLYDIAPVTPGVAVDLSHIPTAVKIKGFSGEDATPALEGADVVLISAGVA
RKPGMDRSDLFNVNAGIVKNLVQQVAKTCPKACIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFGVTTLDIICSNTFVAE
LKGKQPGEVEVPVIGGHSGVTILPLLSQVPGVSFTEQEVADLTKRIQNAGTEVVEAKAGGGSATLSMGQAAARFGLSLVR
ALQGEQGVVECAYVEGDGQYARFFSQPLLLGKNGVEERKSIGTLSAFEQNALEGMLDTLKKDIALGQEFVNK
>1IB6:C|PDBID|CHAIN|SEQUENCE
MKVAVLGAAGGIGQALALLLKTQLPSGSELSLYDIAPVTPGVAVDLSHIPTAVKIKGFSGEDATPALEGADVVLISAGVA
RKPGMDRSDLFNVNAGIVKNLVQQVAKTCPKACIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFGVTTLDIICSNTFVAE
LKGKQPGEVEVPVIGGHSGVTILPLLSQVPGVSFTEQEVADLTKRIQNAGTEVVEAKAGGGSATLSMGQAAARFGLSLVR
ALQGEQGVVECAYVEGDGQYARFFSQPLLLGKNGVEERKSIGTLSAFEQNALEGMLDTLKKDIALGQEFVNK
>1IB6:D|PDBID|CHAIN|SEQUENCE
MKVAVLGAAGGIGQALALLLKTQLPSGSELSLYDIAPVTPGVAVDLSHIPTAVKIKGFSGEDATPALEGADVVLISAGVA
RKPGMDRSDLFNVNAGIVKNLVQQVAKTCPKACIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFGVTTLDIICSNTFVAE
LKGKQPGEVEVPVIGGHSGVTILPLLSQVPGVSFTEQEVADLTKRIQNAGTEVVEAKAGGGSATLSMGQAAARFGLSLVR
ALQGEQGVVECAYVEGDGQYARFFSQPLLLGKNGVEERKSIGTLSAFEQNALEGMLDTLKKDIALGQEFVNK
>1Z2I:A|PDBID|CHAIN|SEQUENCE
MAHGNEKATVLARLDELERFCRAVFLAVGTDEETADAATRAMMHGTRLGVDSHGVRLLAHYVTALEGGRLNRRPQISRVS
GFGAVETIDADHAHGARATYAAMENAMALAEKFGIGAVAIRNSSHFGPAGAYALEAARQGYIGLAFCNSDSFVRLHDGAM
RFHGTNPIAVGVPAADDMPWLLDMATSAVPYNRVLLYRSLGQQLPQGVASDGDGVDTRDPNAVEMLAPVGGEFGFKGAAL
AGVVEIFSAVLTGMRLSFDLAPMGGPDFSTPRGLGAFVLALKPEAFLERDVFDESMKRYLEVLRGSPAREDCKVMAPGDR
EWAVAAKREREGAPVDPVTRAAFSELAEKFSVSPPTYH
>1Z2I:B|PDBID|CHAIN|SEQUENCE
MAHGNEKATVLARLDELERFCRAVFLAVGTDEETADAATRAMMHGTRLGVDSHGVRLLAHYVTALEGGRLNRRPQISRVS
GFGAVETIDADHAHGARATYAAMENAMALAEKFGIGAVAIRNSSHFGPAGAYALEAARQGYIGLAFCNSDSFVRLHDGAM
RFHGTNPIAVGVPAADDMPWLLDMATSAVPYNRVLLYRSLGQQLPQGVASDGDGVDTRDPNAVEMLAPVGGEFGFKGAAL
AGVVEIFSAVLTGMRLSFDLAPMGGPDFSTPRGLGAFVLALKPEAFLERDVFDESMKRYLEVLRGSPAREDCKVMAPGDR
EWAVAAKREREGAPVDPVTRAAFSELAEKFSVSPPTYH
>1Z2I:C|PDBID|CHAIN|SEQUENCE
MAHGNEKATVLARLDELERFCRAVFLAVGTDEETADAATRAMMHGTRLGVDSHGVRLLAHYVTALEGGRLNRRPQISRVS
GFGAVETIDADHAHGARATYAAMENAMALAEKFGIGAVAIRNSSHFGPAGAYALEAARQGYIGLAFCNSDSFVRLHDGAM
RFHGTNPIAVGVPAADDMPWLLDMATSAVPYNRVLLYRSLGQQLPQGVASDGDGVDTRDPNAVEMLAPVGGEFGFKGAAL
AGVVEIFSAVLTGMRLSFDLAPMGGPDFSTPRGLGAFVLALKPEAFLERDVFDESMKRYLEVLRGSPAREDCKVMAPGDR
EWAVAAKREREGAPVDPVTRAAFSELAEKFSVSPPTYH
>1Z2I:D|PDBID|CHAIN|SEQUENCE
MAHGNEKATVLARLDELERFCRAVFLAVGTDEETADAATRAMMHGTRLGVDSHGVRLLAHYVTALEGGRLNRRPQISRVS
GFGAVETIDADHAHGARATYAAMENAMALAEKFGIGAVAIRNSSHFGPAGAYALEAARQGYIGLAFCNSDSFVRLHDGAM
RFHGTNPIAVGVPAADDMPWLLDMATSAVPYNRVLLYRSLGQQLPQGVASDGDGVDTRDPNAVEMLAPVGGEFGFKGAAL
AGVVEIFSAVLTGMRLSFDLAPMGGPDFSTPRGLGAFVLALKPEAFLERDVFDESMKRYLEVLRGSPAREDCKVMAPGDR
EWAVAAKREREGAPVDPVTRAAFSELAEKFSVSPPTYH

When you're done with creating this file, you can now create your database.
1. Go to command prompt then go to your BLAST bin folder
cd "Program Files\NCBI\blast-2.2.28+\bin"
2. makeblastdb -in YOUR_FASTA_FILE -dbtype prot -out  YOUR_WISH_DBNAME
makeblastdb -in D:\proteins\blastdata\meso.fasta -dbtype prot -out Meso
3. Once it's done, go to your BLAST bin folder then you'll find 3 new files there: YOUR_WISH_DBNAME.phr, YOUR_WISH_DBNAME.pin, and YOUR_WISH_DBNAME.psq

4th step ---
Now we are ready to BLAST
blastp -query YOUR_FASTA_SEQUENCE_file.txt -db YOUR_WISH_DBNAME -out YOUR_RESULT.txt
blastp -query "D:\proteins\blastdata\3NEP.fasta.txt" -db Meso -out "D:\proteins\blastdata\homo1.txt"

Easy huh? Good luck! ;)

 
Design by Free WordPress Themes | Bloggerized by Lasantha - Premium Blogger Themes | Grants For Single Moms