Hi, so this is my second post about Science in this blog. I actually have a lot in mind to write in here but i have been extremely busy doing my research. Anyway, what i want to share in here is actually very simple, however, for such a beginner in protein sequence like me it should be a useful post.
So here is the thing, i want to find homologous pair of a protein, BLAST particularly blastp can generate and score some proteins that potentially homologous to another protein. Let's start.
1st step ---
Download the latest BLAST from NCBI site here. In my case i used BLAST Win-32. Just follow the instructions when installing, then it should be installed in C:\Program Files\NCBI\blast-2.2.28+.
2nd step ---
Prepare your FASTA sequence file. In my case i tried to find homologous pair of protein malate dehydrogenase from Salinibacter ruber (PDB ID: 3NEP). You can download it easily from Protein Data Bank website by simply go to the Download Files hyperlink and choose FASTA seq. Following is the FASTA file looks like,
>3NEP:X|PDBID|CHAIN|SEQUENCE
MKVTVIGAGNVGATVAECVARQDVAKEVVMVDIKDGMPQGKALDMRESSPIHGFDTRVTGTNDYGPTEDSDVCIITAGLP
RSPGMSRDDLLAKNTEIVGGVTEQFVEGSPDSTIIVVANPLDVMTYVAYEASGFPTNRVMGMAGVLDTGRFRSFIAEELD
VSVRDVQALLMGGHGDTMVPLPRYTTVGGIPVPQLIDDARIEEIVERTKGAGGEIVDLMGTSAWYAPGAAAAEMTEAILK
DNKRILPCAAYCDGEYGLDDLFIGVPVKLGAGGVEEVIEVDLDADEKAQLKTSAGHVHSNLDDLQRLRDEGKIG
So here is the thing, i want to find homologous pair of a protein, BLAST particularly blastp can generate and score some proteins that potentially homologous to another protein. Let's start.
1st step ---
Download the latest BLAST from NCBI site here. In my case i used BLAST Win-32. Just follow the instructions when installing, then it should be installed in C:\Program Files\NCBI\blast-2.2.28+.
2nd step ---
Prepare your FASTA sequence file. In my case i tried to find homologous pair of protein malate dehydrogenase from Salinibacter ruber (PDB ID: 3NEP). You can download it easily from Protein Data Bank website by simply go to the Download Files hyperlink and choose FASTA seq. Following is the FASTA file looks like,
>3NEP:X|PDBID|CHAIN|SEQUENCE
MKVTVIGAGNVGATVAECVARQDVAKEVVMVDIKDGMPQGKALDMRESSPIHGFDTRVTGTNDYGPTEDSDVCIITAGLP
RSPGMSRDDLLAKNTEIVGGVTEQFVEGSPDSTIIVVANPLDVMTYVAYEASGFPTNRVMGMAGVLDTGRFRSFIAEELD
VSVRDVQALLMGGHGDTMVPLPRYTTVGGIPVPQLIDDARIEEIVERTKGAGGEIVDLMGTSAWYAPGAAAAEMTEAILK
DNKRILPCAAYCDGEYGLDDLFIGVPVKLGAGGVEEVIEVDLDADEKAQLKTSAGHVHSNLDDLQRLRDEGKIG
3rd step ---
Prepare your database. First of all, you need to prepare your FASTA file that contains all the FASTA sequences of the proteins that you would like to evaluate. In the example below, my FASTA file contains 4 PDB (proteins), namely 1EMD, 1IB6, 1Z2I, 3TL2. I saved it as meso.fasta file.
>1EMD:A|PDBID|CHAIN|SEQUENCE
MKVAVLGAAGGIGQALALLLKTQLPSGSELSLYDIAPVTPGVAVDLSHIPTAVKIKGFSGEDATPALEGADVVLISAGVR
RKPGMDRSDLFNVNAGIVKNLVQQVAKTCPKACIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFGVTTLDIIRSNTFVAE
LKGKQPGEVEVPVIGGHSGVTILPLLSQVPGVSFTEQEVADLTKRIQNAGTEVVEAKAGGGSATLSMGQAAARFGLSLVR
ALQGEQGVVECAYVEGDGQYARFFSQPLLLGKNGVEERKSIGTLSAFEQNALEGMLDTLKKDIALGQEFVNK
>3TL2:A|PDBID|CHAIN|SEQUENCE
SNAMTIKRKKVSVIGAGFTGATTAFLLAQKELADVVLVDIPQLENPTKGKALDMLEASPVQGFDANIIGTSDYADTADSD
VVVITAGIARKPGMSRDDLVATNSKIMKSITRDIAKHSPNAIIVVLTNPVDAMTYSVFKEAGFPKERVIGQSGVLDTARF
RTFIAQELNLSVKDITGFVLGGHGDDMVPLVRYSYAGGIPLETLIPKERLEAIVERTRKGGGEIVGLLGNGSAYYAPAAS
LVEMTEAILKDQRRVLPAIAYLEGEYGYSDLYLGVPVILGGNGIEKIIELELLADEKEALDRSVESVRNVMKVLV
>1IB6:A|PDBID|CHAIN|SEQUENCE
MKVAVLGAAGGIGQALALLLKTQLPSGSELSLYDIAPVTPGVAVDLSHIPTAVKIKGFSGEDATPALEGADVVLISAGVA
RKPGMDRSDLFNVNAGIVKNLVQQVAKTCPKACIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFGVTTLDIICSNTFVAE
LKGKQPGEVEVPVIGGHSGVTILPLLSQVPGVSFTEQEVADLTKRIQNAGTEVVEAKAGGGSATLSMGQAAARFGLSLVR
ALQGEQGVVECAYVEGDGQYARFFSQPLLLGKNGVEERKSIGTLSAFEQNALEGMLDTLKKDIALGQEFVNK
>1IB6:B|PDBID|CHAIN|SEQUENCE
MKVAVLGAAGGIGQALALLLKTQLPSGSELSLYDIAPVTPGVAVDLSHIPTAVKIKGFSGEDATPALEGADVVLISAGVA
RKPGMDRSDLFNVNAGIVKNLVQQVAKTCPKACIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFGVTTLDIICSNTFVAE
LKGKQPGEVEVPVIGGHSGVTILPLLSQVPGVSFTEQEVADLTKRIQNAGTEVVEAKAGGGSATLSMGQAAARFGLSLVR
ALQGEQGVVECAYVEGDGQYARFFSQPLLLGKNGVEERKSIGTLSAFEQNALEGMLDTLKKDIALGQEFVNK
>1IB6:C|PDBID|CHAIN|SEQUENCE
MKVAVLGAAGGIGQALALLLKTQLPSGSELSLYDIAPVTPGVAVDLSHIPTAVKIKGFSGEDATPALEGADVVLISAGVA
RKPGMDRSDLFNVNAGIVKNLVQQVAKTCPKACIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFGVTTLDIICSNTFVAE
LKGKQPGEVEVPVIGGHSGVTILPLLSQVPGVSFTEQEVADLTKRIQNAGTEVVEAKAGGGSATLSMGQAAARFGLSLVR
ALQGEQGVVECAYVEGDGQYARFFSQPLLLGKNGVEERKSIGTLSAFEQNALEGMLDTLKKDIALGQEFVNK
>1IB6:D|PDBID|CHAIN|SEQUENCE
MKVAVLGAAGGIGQALALLLKTQLPSGSELSLYDIAPVTPGVAVDLSHIPTAVKIKGFSGEDATPALEGADVVLISAGVA
RKPGMDRSDLFNVNAGIVKNLVQQVAKTCPKACIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFGVTTLDIICSNTFVAE
LKGKQPGEVEVPVIGGHSGVTILPLLSQVPGVSFTEQEVADLTKRIQNAGTEVVEAKAGGGSATLSMGQAAARFGLSLVR
ALQGEQGVVECAYVEGDGQYARFFSQPLLLGKNGVEERKSIGTLSAFEQNALEGMLDTLKKDIALGQEFVNK
>1Z2I:A|PDBID|CHAIN|SEQUENCE
MAHGNEKATVLARLDELERFCRAVFLAVGTDEETADAATRAMMHGTRLGVDSHGVRLLAHYVTALEGGRLNRRPQISRVS
GFGAVETIDADHAHGARATYAAMENAMALAEKFGIGAVAIRNSSHFGPAGAYALEAARQGYIGLAFCNSDSFVRLHDGAM
RFHGTNPIAVGVPAADDMPWLLDMATSAVPYNRVLLYRSLGQQLPQGVASDGDGVDTRDPNAVEMLAPVGGEFGFKGAAL
AGVVEIFSAVLTGMRLSFDLAPMGGPDFSTPRGLGAFVLALKPEAFLERDVFDESMKRYLEVLRGSPAREDCKVMAPGDR
EWAVAAKREREGAPVDPVTRAAFSELAEKFSVSPPTYH
>1Z2I:B|PDBID|CHAIN|SEQUENCE
MAHGNEKATVLARLDELERFCRAVFLAVGTDEETADAATRAMMHGTRLGVDSHGVRLLAHYVTALEGGRLNRRPQISRVS
GFGAVETIDADHAHGARATYAAMENAMALAEKFGIGAVAIRNSSHFGPAGAYALEAARQGYIGLAFCNSDSFVRLHDGAM
RFHGTNPIAVGVPAADDMPWLLDMATSAVPYNRVLLYRSLGQQLPQGVASDGDGVDTRDPNAVEMLAPVGGEFGFKGAAL
AGVVEIFSAVLTGMRLSFDLAPMGGPDFSTPRGLGAFVLALKPEAFLERDVFDESMKRYLEVLRGSPAREDCKVMAPGDR
EWAVAAKREREGAPVDPVTRAAFSELAEKFSVSPPTYH
>1Z2I:C|PDBID|CHAIN|SEQUENCE
MAHGNEKATVLARLDELERFCRAVFLAVGTDEETADAATRAMMHGTRLGVDSHGVRLLAHYVTALEGGRLNRRPQISRVS
GFGAVETIDADHAHGARATYAAMENAMALAEKFGIGAVAIRNSSHFGPAGAYALEAARQGYIGLAFCNSDSFVRLHDGAM
RFHGTNPIAVGVPAADDMPWLLDMATSAVPYNRVLLYRSLGQQLPQGVASDGDGVDTRDPNAVEMLAPVGGEFGFKGAAL
AGVVEIFSAVLTGMRLSFDLAPMGGPDFSTPRGLGAFVLALKPEAFLERDVFDESMKRYLEVLRGSPAREDCKVMAPGDR
EWAVAAKREREGAPVDPVTRAAFSELAEKFSVSPPTYH
>1Z2I:D|PDBID|CHAIN|SEQUENCE
MAHGNEKATVLARLDELERFCRAVFLAVGTDEETADAATRAMMHGTRLGVDSHGVRLLAHYVTALEGGRLNRRPQISRVS
GFGAVETIDADHAHGARATYAAMENAMALAEKFGIGAVAIRNSSHFGPAGAYALEAARQGYIGLAFCNSDSFVRLHDGAM
RFHGTNPIAVGVPAADDMPWLLDMATSAVPYNRVLLYRSLGQQLPQGVASDGDGVDTRDPNAVEMLAPVGGEFGFKGAAL
AGVVEIFSAVLTGMRLSFDLAPMGGPDFSTPRGLGAFVLALKPEAFLERDVFDESMKRYLEVLRGSPAREDCKVMAPGDR
EWAVAAKREREGAPVDPVTRAAFSELAEKFSVSPPTYH
When you're done with creating this file, you can now create your database.
1. Go to command prompt then go to your BLAST bin folder
cd "Program Files\NCBI\blast-2.2.28+\bin"
2. makeblastdb -in YOUR_FASTA_FILE -dbtype prot -out YOUR_WISH_DBNAME
makeblastdb -in D:\proteins\blastdata\meso.fasta -dbtype prot -out Meso
3. Once it's done, go to your BLAST bin folder then you'll find 3 new files there: YOUR_WISH_DBNAME.phr, YOUR_WISH_DBNAME.pin, and YOUR_WISH_DBNAME.psq
4th step ---
Now we are ready to BLAST
blastp -query YOUR_FASTA_SEQUENCE_file.txt -db YOUR_WISH_DBNAME -out YOUR_RESULT.txt
blastp -query "D:\proteins\blastdata\3NEP.fasta.txt" -db Meso -out "D:\proteins\blastdata\homo1.txt"
Easy huh? Good luck! ;)