Help Page

General Recommendations for ChemBioServer usage:

Follow steps for each application.
Press processing button once.
After pressing processing button, data are uploaded to the server and processed.
Do not Press Refresh Button during processing (while page is busy).
When application is finished, Web-Page is automatically Refreshed and results are displayed at the same page.
Note that aproximate processing time for 1000 compounds is 4 mins.
It is recommeded to use sdf files with less than 2000 compounds.
In order to convert files to sdf or mol format, you can use Open Babel.

Quick Link Instructions
Case Study Work Flow for discovering PI3Kα inhibitors.
Hypothetic Work Flow Example.
Browse Compounds.
Predifined Queries Filtering.
Combined Searching Filtering.
Advanced Substructure Filtering.
Van der Waals Filtering.
Toxicity Filtering.
Docking Re-ranking.
Hierarchical Clustering.
Affinity Propagation Clustering.
Structural Similarity Network Visualization and Analysis.
Custom Pipeline Filtering.
Graphical representation of molecular properties.
MACCS Fingerptints.
Sample Data.
Enable JavaScript Instruction.
References.

Case Study Work Flow for discovering PI3Kα inhibitors

This work has been submitted for publication in the 12th IEEE International Conference on BioInformatics and BioEngineering by:

Paraskevi Gkeka, Emmanouil Athanasiadis, George Spyrou and Zoe Cournia, "Enhancing the effectiveness of virtual screening by using the ChemBioServer: Application to the discovery of PI3Kα inhibitors".

The crystal structure of the mutant PI3Kα (PDB ID: 3HIZ) was complemented for missing parts using a combination of homology and loop modeling in order to create the full atomistic model of the full-length H1047R mutant. The resulting model was solvated in water and employed in Molecular Dynamics (MD) simulations using the NAMD package (James et al., 2005). The CHARMM22 force field (MacKerell et al., 1998; Buck et al., 2006) was used to model all protein interactions and the TIP3P model (Jorgensen et al., 1983) was used for water. The solvated protein system was energy-minimized and gradually heated from 0 to 310 K with constraints of 1 kcal mol-1 Ǻ-2 applied on the backbone protein atoms under constant volume. An equilibration run was then performed under constant pressure and constant temperature. Non-bonded forces were calculated with a 2-fs time step and a 12 Ǻ cut-off using the CHARMM switch potential between 10-12 Ǻ. Bonds involving hydrogen were kept rigid by using the SHAKE algorithm (Ryckaert et al., 1977) for the protein and the SETTLE algorithm (Miyamoto et al., 1992) for water. Periodic boundary conditions were applied and the Particle Mesh Ewald method (Darden et al., 1993) was used to calculate electrostatic interactions every 4 fs. The pressure was maintained at 1 atm with the Langevin piston method (Feller et al., 1995), while the temperature was maintained at 310 K by means of Langevin dynamics with a damping coefficient of 5 ps-1. Atomic coordinates of the systems were saved every 2 ps. The total simulation time was 70 ns.
Following the simulation, binding site analysis was performed using the SiteMap module of Schrodinger v2.4 (Schrödinger, 2011) on the protein conformation corresponding to the last frame of the trajectory (70 ns). An allosteric binding site close to the H1047R mutation was found by SiteMap to be among the top-ranked potential receptor binding sites and was used in the present study. After binding site identification, we performed virtual screening using the docking program Glide 5.7 (Schrodinger, 2011; Glide, 2011; Friesner et al., 2004) . In the process of virtual screening, initially, the all-atom protein model was submitted to a series of restrained, partial minimizations using the OPLS-AA force field within the “Protein Preparation” module of Glide. A benzene molecule was placed in the predicted binding cavity and was used for the “Grid Generation” module of Glide, which prepares a grid for ligand docking. For the protein preparation, grid generation, and ligand docking procedures, the default Glide settings were used. The van der Waals (vdW) radii for nonpolar ligand atoms were scaled by a factor of 0.8, thereby decreasing penalties for close contacts. Receptor atoms were not scaled. The drug-like subset of the HitFinder collection from the Maybridge database (www.maybridge .com) was used for the virtual screening (Cournia et al., 2009). All structures were docked and scored using the Glide standard precision (SP) mode (Friesner et al., 2004). The 10,000 top-ranked structures from the SP filter were redocked and rescored using the Glide extra precision (XP) mode (Friesner et al., 2006). The complexes for the top-ranked 1000 compounds resulting from the XP processing were submitted to further postprocessing with the ChemBioServer.
To post-process docking results in order to enhance the experimental hit rate, several functionalities of the ChemBioServer were used. Initially, the vdW filtering was applied to remove compounds with steric clashes. Poses that are far from the energy minimum are unlikely to be adopted in nature and hence should be discarded. In this docking exercise with Glide we observed that the post-docking poses often suffered from vdW clashes; even after Glide post-docking minimization, approximately 20% of the generated poses should be discarded due to unrealistic vdW interactions. The compounds that passed the vdW filtering were then subjected to physicochemical properties/toxicity filtering within the ChemBioServer.  Subsequently, a hierarchical clustering was performed for the remaining compounds using the Tanimoto coefficient and the Ward Clustering Linkage. Finally, the resulting clusters were visually inspected and the most promising compounds were purchased and submitted to in vitro assay testing. The process is described in the following Figure.

Workflow

Molecular Dynamics (MD)simulations of the PI3Kα protein, a common target in cancer, were performed starting from its crystal structure. Binding site identification on the last frame of the MD trajectory revealed an allosteric binding site, which was further used to perform a docking exercise for the identification of PI3Kα inhibitors. Virtual screening results were then post-processed with physicochemical, toxicity, and structural filters in order to enhance the efficiency and accuracy of the docking exercise. Initially, the 1000 top-scored compounds were filtered for steric clashes, using the vdW filtering available in ChemBioServer using a threshold energy of 50 kcal/mol, which resulted in 250 rejected poses. The remaining 750 drug candidates were subjected to physicochemical/toxicity filtering and the 600 accepted compounds were grouped in clusters via hierarchical clustering using Simple Matching Coefficient, Ward Clustering Linkage and distance 0.8. The clustering resulted in twenty clusters. Maximum one compound per cluster was selected by visual inspection based on a) important ligand interactions with key residues of the binding site and b) promising predicted physicochemical properties. Finally, seven of the most promising compounds were purchased. The compounds were tested with in vitro assays and four inhibited PI3Kα activity in micromolar concentrations, achieving a 57% hit rate and indicating that the workflow described herein can be successfully applied to enhance the hit rate of in silico drug discovery.


Hypothetic Work Flow Example
In the following figure, a hypothetic work flow of Candidate Drug compound selection is presented.
An sdf with N compounds is used.
Using Lipinski Rule of 5, this set is reduced to M (< N ) compounds.
Using successive substructure, van der Waals and Toxicity filtering, the set of M compounds is further reduced to L (<< N ) compounds that meet filtering criteria.
Using Affinity Propagation Clustering, G ( G <<< N ) most representative compounds for each cluster (exemplars) are proposed as Candidate Drug compounds.

Workflow


"Browse Compounds" Tutorial

In the Browse Compounds section, the user is able to upload data in either sdf, or mol format, to explore sdf information and to visualize compounds in 2D (if check button is checked) or in 3D (Cao et al., 2008; O'Boyle et al., 2011).

Step 1
. Press Choose File Button in order to select a file. Sample Data
Step 2. Press Upload File and wait few seconds.
After uploading file, compounds with their sdf details can be found.
The 2D chemical structure using an internal applet (JChemPaint) as well as additional information included in each sdf file is presented.
A link for each compound that leads to an external applet (Jmol) is also provided and a 3D representation of the molecule according to the provided in the sdf file x-y-z coordinates is available.
For sdf files with more than 50 Compounds, 2D Prieview is not available yet.

Browse Compounds Tutorial


"Predined Queries" Tutorial

In the Predefined Queries Search section, the user is able to upload data and apply filtering using the Lipinski Five Rule. Data that PASS the test are available in sdf format for further use (Lipinski et al., 2001; Guha et al., 2007; Cao et al., 2008; O'Boyle et al., 2011).

Step 1. Press Choose File Button in order to select a file. Sample Data
Step 2. Press Process Data and wait few seconds.
Results will be displayed as shown in the following Figure.
Compounds that PASS or FAIL Lipinski Rule of five searching criteria, as well as an sdf with the compounds that PASS is available.

Simple Search Tutorial

"Combined Search" Tutorial

In the Combined Search section, the user is able to upload data and apply advanced custom filtering using physicochemical properties such as molecular weight, charge, number of atoms, number of aromatic rings, etc. criteria using AND logic. Data that PASS the test are available in sdf format fot further use (Guha et al., 2007; Cao et al., 2008; O'Boyle et al., 2011).

Step 1. Press Choose File Button in order to select a file. Sample Data
Step 2 (Optional). The user is able to upload a previously saved txt file with searching criteria. After selecting the file, searching parameters are automatically change according to the txt file.
Step 3. Select Searching Parametres, i.e. Molecular Weight < 500
Searching Criteria are with AND logic.
Step 4. Press Process Data and wait few seconds.
Results will be displayed as shown in the following Figure.
Compounds that PASS or FAIL searching criteria, as well as an sdf with the compounds that PASS are available. The user is able to download an additional txt file were searching settings were stored to repeat filtering in the future.

Combined Search Tutorial

"Substructure Filtering" Tutorial

In the Substructure Filtering section, the user is able to upload two (2) different files and apply perfect match substrucure filtering. Data of File 1 that not contain data of File 2 PASS the test (Guha et al., 2007; Cao et al., 2008; O'Boyle et al., 2011).

Step 1. Press Choose File Button in order to select a file 1 in sdf format. Sample Data
Step 2. Press Choose File Button in order to select a file 2 in sdf format. Sample Data
Step 3. Press Process Data and wait few seconds
After Processing the following information are available to user:
- The compounds of the file 1.
- The compounds of the file 2.
- The compounds of file 2 that not exist in file 1 (Green Color).
- The compounds of file 2 that are common with file 1 (Red Color).
- A summary txt file with the compounds of file 2 that were present(FAIL)/absent(PASS) in file 1, as well as an sdf file with the compounds that PASS the filtering test.

Substructure Filtering Tutorial

"Van der Waals Filtering" Tutorial

In the Van der Waals Filtering section, the user is able to upload a file and discard molecules through the evaluation of the poses van der Waals energies as well as geometric criteria.
In several docking exercises with Glide (Schrodinger, LLC) we have observed that the post-docking poses often suffer from vdW clashes. In particular, we observed that even after Glide post-docking minimization, approximately 20% of the generated poses should be discarded due to unrealistic vdW interactions, which required automating this procedure
Data that PASS van der Waals criteria are available in sdf format fot further use (Jorgensen et al., 1996; Guha et al., 2007; Cao et al., 2008; O'Boyle et al., 2011).



Figures with bad vdW poses after docking experiment.

Step 1. Press Choose File Button in order to test in sdf format. Sample Data.
Step 2. Select van der Waals energy threshold and tolerance
i.e. 50Kcal/mol and 75% respectively.
Step 3. Press Process Data and wait few seconds.
After Processing the following information are available to user:
- The compounds that PASS vdW test (Green Color).
- The compounds that FAIL vdW test (Red Color).
- Results in txt format, as well as passing compounds in sdf format.

Van der Waals Filtering Tutorial

"Toxicity Filtering" Tutorial

In the Toxicity Filtering section, the user is able to upload a file and discard compounds with toxic moieties. Data that PASS Toxicity test are available in sdf format fot further use (Guha et al., 2007; Cao et al., 2008; O'Boyle et al., 2011).

Step 1. Press Choose File Button in order to select a file. Sample Data
Step 2. Press Process Data and wait few seconds
After Processing the following information are available to user:
- The compounds of the file.
- The toxic roots.
- The compounds that PASS toxicity test (Green Color).
- The compounds that FAIL toxicity test (Red Color).
- Results in txt format, as well as non-toxic compounds in sdf format.

Toxicity Filtering Tutorial

"Docking Re-ranking" Tutorial

In the docking re-ranking section, compounds arising from virtual screening are post processed to enhance protein selectivity.

The objective of this tool is to re-rank docking results based on screening a compound library against different protein members of the same family and selecting only those compounds that simultaneously have a low energy of binding for the protein of interest and a high energy of binding for all the other proteins.

The input files are docking results stored in the general CSV format with only two conditions: there must be at least two columns named 'Unique SMILES Stereo' and 'docking score'. Other columns can be present in the CSV files but will not be considered during the filtering. If you've used Schrödinger's software to do the docking calculations, click here to find out how to put your results in the correct format.

  • The 'Unique SMILES Stereo' column is the unique SMILES representation of each compound.
  • The 'docking score' column is the calculated energy of interaction for a molecule in kcal/mol units.
There are three different types of filtering. The compounds that pass the filtering are stored in an Excel format for convenience of a general user base.

Step 1. In this step the user can upload multiple CSV files with the docking results of his protein of interest. More than one file can be used because the protein of interest might have more than one crystal structure. Compounds with low energies of interaction to these structures will be looked for during the filtering.

Step 2. In this step the user can upload multiple CSV files with the docking results of the rest of the family members. More than one file can be used because multiple different proteins might have been used (and ideally, multiple crystal structures of each protein). Compounds that have high energies of interaction for these proteins will be looked for during the filtering.

Step 3. Three different methods can be selected from the drop-down button. Additionally, a minimal number of compounds to pass the filtering has to be input in the 'Min compounds' text box (allowed minimum is 1).

  • 1. Automatic (default): This method starts by defining the cutoffs as the top 1% best scoring compounds for the target(s) and the top 1% worst scoring compounds for the rest of the proteins. Both cutoffs are iteratively updated by 1% steps until the minimum amount of compounds desired by the user meet the filter conditions. This method is guaranteed to always find at least the minimum number of compounds that the user wants.
  • 2. Manual:When this method is selected, two extra input boxes show up. The 'Target(s) cutoff' is the energy (in kcal/mol) below which selective compounds will have to be for all the proteins that were uploaded in Step 1. The second text box, 'Other(s) cutoff' is the energy (in kcal/mol) above which we desire the compounds to score for the rest of the proteins in the family (that are uploaded in Step 2). This method is not guaranteed to find the minimal amount of compounds set by the user. If that happens, the program falls back to the automatic method and reports the results obtained with it instead. We recommend starting with a small (<10) number of compounds when using this method (although this will depend on how many proteins you're screening against).
  • 3. Score Difference:This method provides an alternative way to define the compounds specificity. Often, the absolute values of the cutoffs might not be as important as the actual energy difference between the compounds for each protein. The larger this difference, the more selective the compounds will be. Therefore, the user can specify a desired level of energy difference in the 'Difference in kcal/mol' box, and the program will proceed in a similar fashion to the automatic procedure. It will start by defining the top 1% lowest scoring compounds for the target protein and the second cutoff will be set above the given score difference. Again, if no compounds match this filter, the low energy cutoff will be gradually increased by 1% steps, while the high energy cutoff will always be at least the desired amount of kcal/mol above it.

If the re-ranking calculation proceeds normally, the results will be written to an Excel file which will be available for download clicking the Download Results link. Result files are only stored in the server during 24 hours.

Note for Schrödinger users:

If you've docked a library of compounds with Glide, you will have probably stored your results in a mae or maegz file. For the program to work, you will need to have the SMILES representation of each molecule on a colum named 'Unique SMILES Stereo' and save the results as a CSV file.

If you don't have the SMILES representation of each molecule, use the uniquesmiles program inside the utilities/ folder of your schrodinger installation to get the unique SMILES representation directly from your mae file using the following terminal command:

  • $ /your_schrodinger_installation_path/uniquesmiles docking_results.mae docking_results_SMILES.mae

If you open the new mae file in Maestro you will see that there is an additional entry in the Project Table called Unique SMILES Stereo. This column has the information about the stereoisomery and tautomerization state of each molecule. Finally, save the results in a CSV file. To do that, open your Project Table if you haven't already done so (Ctrl+T shortcut) and go to Table→Export→Spreadsheet and save using the default settings (see image below). Please try to give your CSV files sensible names so they can be easily parsed by the re-ranking algorithm. We recommend something like: PDBID-PROTEINNAME.csv.

Saving the docking results as a CSV file.

This tool has been developed by Juan Eiros Zamora as a summer project for the PRACE programme Summer of HPC.

"Hierarchical" Clustering Tutorial

In the Hierarchical clustering section, the user is able to upload a file (or a txt file with custom binary fingerprints) and cluster compounds with similar characteristics (tanimoto, dice, cosine, simple matching coefficient). In addition, results are availble in PDF format (Guha et al., 2007; Cao et al., 2008; O'Boyle et al., 2011; Backman et al., 2011). A PDF reader program should be installed in order to display clustering results. In the present clustering implementation, Oben Babel MACCS 166-bit binary fingerprints are used.

Step 1. Press Choose File Button in order to select a file. Sample Data
Step 2. Select Distance, clustering method and threshold.
i.e. Soergel (Tanimoto Coefficient) Distance and Complete linkage.
Step 3. Press Process Data and wait few seconds.
Step 4. Press Show Results and clustering result is illustrated in pdf format.In order to handle (save, print, zoom, rotate) the dendrogram, press left click once inside the plot and a menu will appear. In addition, sdf with clusters are also provided either as signle cluster or as a zip file with all clusters in different sdf file.

kmeans clustering tutorial

"Affinity Propagation" Clustering Tutorial

In the Affinity Propagation Clustering section, the user is able to upload a file (or a txt file with custom binary fingerprints) and group compounds with similar characteristics (tanimoto, dice, cosine, simple matching coefficient). Most representative compounds (Exemplars) for each cluster are provided for further investigation. (Guha et al., 2007; Frey et al., 2007; Cao et al., 2008; O'Boyle et al., 2011; Bodenhofer et al., 2011). In the present clustering implementation, Oben Babel MACCS 166-bit binary fingerprints are used.

Step 1. Press Choose File Button in order to select a file. Sample Data
Step 2. Select Distance method.
i.e. Soergel (Tanimoto Coefficient) Distance.
Step 3. Press Process Data and wait few seconds.
After processing, Affinity Propagation clustering method provides user with exemplars (most representative compounds) for each cluster.

Affinity Propagation Clustering Tutorial

"Structural Similarity Network" Visualization and Analysis Tutorial

In the "Construct a network of compounds" section of ChemBioServer, the user can upload an sdf file and after choosing a similarity metric and a value cutoff threshold for the edges (similarity values) can either visualize the network or run a network analysis.
In case the user wants to test an sdf file against another reference sdf set, there is a function which combines two sdf files in a network, giving different colors to the two groups. In the "Attach similar-only nodes to Network" tab, a main network is created for the reference set with a given edge threshold and then compounds from the test set are attached to the main network via another edge thresohld (e.g. more strict connections). Then the user can download the upper triangular adjacency matrix of the whole network, as well as the edgelist of the reference - test edges. Finally, in the "Remove nodes from Network, based on similarity" tab, a main network is created for the reference set with a given edge threshold and then compounds similar to the test set with another edge threshold are removed together with their edges from the network. Again the user can download the upper triangular adjacency matrix of the new network, as well as the edgelist of the reference - test edges that acounted for the removal of the reference nodes.

Example: Structural Similarity Network Analysis

Step 1. Press Choose File Button in order to select a file.
Step 2. Select Similarity metric (5 options).
Step 3. Select Edge Threshold in [0, 1].
Step 4. Press Process Data and wait few seconds.
After processing, the Degree, Betweenness and Strength metrics of each network node are presented in tabular form to the user.

Network Analysis Metrics

Example: Combine two sdf files in a Network

Step 1. Press Choose File Button in order to select a reference set sdf file.
Step 2. Press Choose File Button in order to select a test set sdf file.
Step 3. Select Similarity metric (5 options).
Step 4. Select Edge Threshold in [0, 1].
Step 5. Press Process Data and wait few seconds.
After processing, a combined network with different colors for each of the two groups is presented to the user and the respective adjacency matrix can be downloaded.

Dual Network

Example: Remove nodes from Network, based on similarity

Step 1. Press Choose File Button in order to select a reference set sdf file.
Step 2. Press Choose File Button in order to select a test set sdf file.
Step 3. Select Similarity metric (5 options).
Step 4. Select Edge Threshold for main network in [0, 1].
Step 5. Select Edge Threshold for reference - test edges in [0, 1].
Step 6. Press Process Data and wait few seconds.
After processing, a network excluding the compounds similar to those in the test is presented and the respective upper triangular adjacency matrix and edgelist file of reference - test edges can be downloaded.

Dual Network

"Custom Pipeline Filtering" Tutorial

In the Custom Pipeline Filtering section, the user is able to upload a file and create fast custom pipeline filtering method by combining all previously described filtering services. (Guha et al., 2007; Cao et al., 2008; O'Boyle et al., 2011)

Step 1. Press Choose File Button in order to select a file 1. Sample Data
Step 2. Enable at least one filter by first clicking the corresponding checkbox and then selecting the appropriate parameters. When a field is enabled, the background color changed from green to white. To disable a selected field, just uncheck the checkbox.
Step 3. Press Process Data and wait few seconds. An sdf file that pass all tests is provided to the user.

Affinity Propagation Clustering Tutorial

"Graphical representations of molecular properties" Tutorial

In the Graphical representations of molecular properties section, the user is able to upload a file and Visualize Compounds' Properties (Guha et al., 2007; Wehrens et al., 2007; Cao et al., 2008; O'Boyle et al., 2011; Backman et al., 2011).

Step 1. Press Choose File Button in order to select a file with more than two 2 compounds.Sample Data
Step 2. Press Process Data and wait few seconds.
Step 3. Press Display Plots button in order to see the following graphs:
(Raphaël javascript library)
- [PCA2 vs PCA1] Principle Component Analysis (PCA) first component (PCA1) against the second component (PCA2), based on the tanimoto coefficient.
- [PSA vs logP] Logarithm of the calculated Partition coefficeint (logP) against the Polar Surface Area (PSA).
- [PSA vs MW] Molecular Weight (MW) against the Polar Surface Area (PSA).
- [logP vs MW] Molecular Weight (MW) against Logarithm of the calculated Partition coefficeint (logP).

Affinity Propagation Clustering Tutorial

MACCS Fingerptints

SMARTS definitions for the publically available MACCS keys used by Open Babel.
Copyright (C) 2001-2008 greg Landrum and Rational Discovery LLC.
these are SMARTS patterns corresponding to the MDL MACCS keys:

1:('?',0), # ISOTOPE
2:('[#103,#104]',0), # ISOTOPE
3:('[Ge,As,Se,Sn,Sb,Te,Tl,Pb,Bi]',0), # Group IVa,Va,VIa Periods 4-6
4:('[Ac,Th,Pa,U,Np,Pu,Am,Cm,Bk,Cf,Es,Fm,Md,No,Lr]',0), # actinide
5:('[Sc,Ti,Y,Zr,Hf]',0), # Group IIIB,IVB (Sc...)
6:('[La,Ce,Pr,Nd,Pm,Sm,Eu,Gd,Tb,Dy,Ho,Er,Tm,Yb,Lu]',0), # Lanthanide
7:('[V,Cr,Mn,Nb,Mo,Tc,Ta,W,Re]',0), # Group VB,VIB,VIIB
8:('[!#6;!#1]1~*~*~*~1',0), # QAAA@1
9:('[Fe,Co,Ni,Ru,Rh,Pd,Os,Ir,Pt]',0), # Group VIII
10:('[Be,Mg,Ca,Sr,Ba,Ra]',0), # Group IIa (Alkaline earth)
11:('*1~*~*~*~1',0), # 4M Ring
12:('[Cu,Zn,Ag,Cd,Au,Hg]',0), # Group IB,IIB (Cu..)
13:('[#8]~[#7](~[#6])~[#6]',0), # ON(C)C
14:('[#16]-[#16]',0), # S-S
15:('[#8]~[#6](~[#8])~[#8]',0), # OC(O)O
16:('[!#6;!#1]1~*~*~1',0), # QAA@1
17:('[#6]#[#6]',0), #CTC
18:('[B,Al,Ga,In,Tl]',0), # Group IIIA
19:('*1~*~*~*~*~*~*~1',0), # 7M Ring
20:('[Si]',0), #Si
21:('[#6]=[#6](~[!#6;!#1])~[!#6;!#1]',0), # C=C(Q)Q
22:('*1~*~*~1',0), # 3M Ring
23:('[#7]~[#6](~[#8])~[#8]',0), # NC(O)O
24:('[#7]-[#8]',0), # N-O
25:('[#7]~[#6](~[#7])~[#7]',0), # NC(N)N
26:('[#6]=;@[#6](@*)@*',0), # C$=C($A)$A
27:('[I]',0), # I
28:('[!#6;!#1]~[CH2]~[!#6;!#1]',0), # QCH2Q
29:('[#15]',0),# P
30:('[#6]~[!#6;!#1](~[#6])(~[#6])~*',0), # CQ(C)(C)A
31:('[!#6;!#1]~[F,Cl,Br,I]',0), # QX
32:('[#6]~[#16]~[#7]',0), # CSN
33:('[#7]~[#16]',0), # NS
34:('[CH2]=*',0), # CH2=A
35:('[Li,Na,K,Rb,Cs,Fr]',0), # Group IA (Alkali Metal)
36:('[#16R]',0), # S Heterocycle
37:('[#7]~[#6](~[#8])~[#7]',0), # NC(O)N
38:('[#7]~[#6](~[#6])~[#7]',0), # NC(C)N
39:('[#8]~[#16](~[#8])~[#8]',0), # OS(O)O
40:('[#16]-[#8]',0), # S-O
41:('[#6]#[#7]',0), # CTN
42:('F',0), # F
43:('[!C;!c;!#1;!H0]~*~[!C;!c;!#1;!H0]',0), # QHAQH
44:('?',0), # OTHER
45:('[#6]=[#6]~[#7]',0), # C=CN
46:('Br',0), # BR
47:('[#16]~*~[#7]',0), # SAN
48:('[#8]~[!#6;!#1](~[#8])(~[#8])',0), # OQ(O)O
49:('[!+0]',0), # CHARGE
50:('[#6]=[#6](~[#6])~[#6]',0), # C=C(C)C
51:('[#6]~[#16]~[#8]',0), # CSO
52:('[#7]~[#7]',0), # NN
53:('[!#6;!#1;!H0]~*~*~*~[!#6;!#1;!H0]',0), # QHAAAQH
54:('[!#6;!#1;!H0]~*~*~[!#6;!#1;!H0]',0), # QHAAQH
55:('[#8]~[#16]~[#8]',0), #OSO
56:('[#8]~[#7](~[#8])~[#6]',0), # ON(O)C
57:('[#8R]',0), # O Heterocycle
58:('[!#6;!#1]~[#16]~[!#6;!#1]',0), # QSQ
59:('[#16]!:*:*',0), # Snot%A%A
60:('[#16]=[#8]',0), # S=O
61:('*~[#16](~*)~*',0), # AS(A)A
62:('*@*!@*@*',0), # A$!A$A
63:('[#7]=[#8]',0), # N=O
64:('*@*!@[#16]',0), # A$A!S
65:('c:n',0), # C%N
66:('[#6]~[#6](~[#6])(~[#6])~*',0), # CC(C)(C)A
67:('[!#6;!#1]~[#16]',0), # QS
68:('[!#6;!#1;!H0]~[!#6;!#1;!H0]',0), # QHQH
69:('[!#6;!#1]~[!#6;!#1;!H0]',0), # QQH
70:('[!#6;!#1]~[#7]~[!#6;!#1]',0), # QNQ
71:('[#7]~[#8]',0), # NO
72:('[#8]~*~*~[#8]',0), # OAAO
73:('[#16]=*',0), # S=A
74:('[CH3]~*~[CH3]',0), # CH3ACH3
75:('*!@[#7]@*',0), # A!N$A
76:('[#6]=[#6](~*)~*',0), # C=C(A)A
77:('[#7]~*~[#7]',0), # NAN
78:('[#6]=[#7]',0), # C=N
79:('[#7]~*~*~[#7]',0), # NAAN
80:('[#7]~*~*~*~[#7]',0), # NAAAN
81:('[#16]~*(~*)~*',0), # SA(A)A
82:('*~[CH2]~[!#6;!#1;!H0]',0), # ACH2QH
83:('[!#6;!#1]1~*~*~*~*~1',0), # QAAAA@1
84:('[NH2]',0), #NH2
85:('[#6]~[#7](~[#6])~[#6]',0), # CN(C)C
86:('[C;H2,H3][!#6;!#1][C;H2,H3]',0), # CH2QCH2
87:('[F,Cl,Br,I]!@*@*',0), # X!A$A
88:('[#16]',0), # S
89:('[#8]~*~*~*~[#8]',0), # OAAAO
90:('[$([!#6;!#1;!H0]~*~*~[CH2]~*),$([!#6;!#1;!H0;R]1@[R]@[R]@[CH2;R]1),
$([!#6;!#1;!H0]~[R]1@[R]@[CH2;R]1)]',0), # QHAACH2A 91:('[$([!#6;!#1;!H0]~*~*~*~[CH2]~*),$([!#6;!#1;!H0;R]1@[R]@[R]@[R]@
[CH2;R]1),$([!#6;!#1;!H0]~[R]1@[R]@[R]@[CH2;R]1),
$([!#6;!#1;!H0]~*~[R]1@[R]@[CH2;R]1)]',0), # QHAAACH2A
92:('[#8]~[#6](~[#7])~[#6]',0), # OC(N)C
93:('[!#6;!#1]~[CH3]',0), # QCH3
94:('[!#6;!#1]~[#7]',0), # QN
95:('[#7]~*~*~[#8]',0), # NAAO
96:('*1~*~*~*~*~1',0), # 5 M ring
97:('[#7]~*~*~*~[#8]',0), # NAAAO
98:('[!#6;!#1]1~*~*~*~*~*~1',0), # QAAAAA@1
99:('[#6]=[#6]',0), # C=C
100:('*~[CH2]~[#7]',0), # ACH2N
101:('[$([R]@1@[R]@[R]@[R]@[R]@[R]@[R]@[R]1),
$([R]@1@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]1),
$([R]@1@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]1),
$([R]@1@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]1),
$([R]@1@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]1),
$([R]@1@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]1),
$([R]@1@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]1)]',0),
# 8M Ring or larger. This only handles up to ring sizes of 14
102:('[!#6;!#1]~[#8]',0), # QO
103:('Cl',0), # CL
104:('[!#6;!#1;!H0]~*~[CH2]~*',0), # QHACH2A
105:('*@*(@*)@*',0), # A$A($A)$A
106:('[!#6;!#1]~*(~[!#6;!#1])~[!#6;!#1]',0), # QA(Q)Q
107:('[F,Cl,Br,I]~*(~*)~*',0), # XA(A)A
108:('[CH3]~*~*~*~[CH2]~*',0), # CH3AAACH2A
109:('*~[CH2]~[#8]',0), # ACH2O
110:('[#7]~[#6]~[#8]',0), # NCO
111:('[#7]~*~[CH2]~*',0), # NACH2A
112:('*~*(~*)(~*)~*',0), # AA(A)(A)A
113:('[#8]!:*:*',0), # Onot%A%A
114:('[CH3]~[CH2]~*',0), # CH3CH2A
115:('[CH3]~*~[CH2]~*',0), # CH3ACH2A
116:('[$([CH3]~*~*~[CH2]~*),$([CH3]~*1~*~[CH2]1)]',0), # CH3AACH2A
117:('[#7]~*~[#8]',0), # NAO
118:('[$(*~[CH2]~[CH2]~*),$(*1~[CH2]~[CH2]1)]',1), # ACH2CH2A > 1
119:('[#7]=*',0), # N=A
120:('[!#6;R]',1), # Heterocyclic atom > 1
121:('[#7;R]',0), # N Heterocycle
122:('*~[#7](~*)~*',0), # AN(A)A
123:('[#8]~[#6]~[#8]',0), # OCO
124:('[!#6;!#1]~[!#6;!#1]',0), # QQ
125:('?',0), # Aromatic Ring > 1
126:('*!@[#8]!@*',0), # A!O!A
127:('*@*!@[#8]',1), # A$A!O > 1
128:('[$(*~[CH2]~*~*~*~[CH2]~*),$([R]1@[CH2;R]@[R]@[R]@[R]@[CH2;R]1),
$(*~[CH2]~[R]1@[R]@[R]@[CH2;R]1),$(*~[CH2]~*~[R]1@[R]@[CH2;R]1)]',0), # ACH2AAACH2A
129:('[$(*~[CH2]~*~*~[CH2]~*),$([R]1@[CH2]@[R]@[R]@[CH2;R]1),
$(*~[CH2]~[R]1@[R]@[CH2;R]1)]',0), # ACH2AACH2A
130:('[!#6;!#1]~[!#6;!#1]',1), # QQ > 1
131:('[!#6;!#1;!H0]',1), # QH > 1
132:('[#8]~*~[CH2]~*',0), # OACH2A
133:('*@*!@[#7]',0), # A$A!N
134:('[F,Cl,Br,I]',0), # X (HALOGEN)
135:('[#7]!:*:*',0), # Nnot%A%A
136:('[#8]=*',1), # O=A>1
137:('[!C;!c;R]',0), # Heterocycle
138:('[!#6;!#1]~[CH2]~*',1), # QCH2A>1
139:('[O;!H0]',0), # OH
140:('[#8]',3), # O > 3
141:('[CH3]',2), # CH3 > 2
142:('[#7]',1), # N > 1
143:('*@*!@[#8]',0), # A$A!O
144:('*!:*:*!:*',0), # Anot%A%Anot%A
145:('*1~*~*~*~*~*~1',1), # 6M ring > 1
146:('[#8]',2), # O > 2
147:('[$(*~[CH2]~[CH2]~*),$([R]1@[CH2;R]@[CH2;R]1)]',0), # ACH2CH2A
148:('*~[!#6;!#1](~*)~*',0), # AQ(A)A
149:('[C;H3,H4]',1), # CH3 > 1
150:('*!@*@*!@*',0), # A!A$A!A
151:('[#7;!H0]',0), # NH
152:('[#8]~[#6](~[#6])~[#6]',0), # OC(C)C
153:('[!#6;!#1]~[CH2]~*',0), # QCH2A
154:('[#6]=[#8]',0), # C=O
155:('*!@[CH2]!@*',0), # A!CH2!A
156:('[#7]~*(~*)~*',0), # NA(A)A
157:('[#6]-[#8]',0), # C-O
158:('[#6]-[#7]',0), # C-N
159:('[#8]',1), # O>1
160:('[C;H3,H4]',0), #CH3
161:('[#7]',0), # N
162:('a',0), # Aromatic
163:('*1~*~*~*~*~*~1',0), # 6M Ring
164:('[#8]',0), # O
165:('[R]',0), # Ring
166:('?',0), #

Sample Data

Set 1: Download Dataset with 10 compounds, which have undesirable vdW interactions. This Dataset resulted from a virtual screening exercise with the Glide software (Schrodinger, LLC.) from the Cournia lab.
Set 2: Download Dataset with 20 toxic moieties. These toxicophores have been selected from the following references: "Blagg, J., (2006) Structure-Activity Relationships for In vitro and In vivo Toxicity, Annual Reports in Medicinal Chemistry 41 , pp. 353-368" and "Baell, J.B., Holloway, G.A., (2010) New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays, Journal of Medicinal Chemistry 53 (7), pp. 2719-2740".
Set 3: Download Dataset with 5 compounds that appear as frequent hitters (promiscuous compounds) in many biochemical assays. These compounds have been selected from "Chuprina, A., Lukin, O., Demoiseaux, R., Buzko, A., Shivanyuk, A., (2010) Drug- and lead-likeness, target class, and molecular diversity analysis of 7.9 million commercially available organic compounds provided by 29 suppliers, Journal of Chemical Information and Modeling 50 (4), pp. 470-479".

There is a set of 3 different sdf files that contain compounds acquired randomly from the drug-like Zinc database.
Set 4: Download Dataset with 557 compounds for general use.
Set 5: Download Dataset with 1549 compounds for general use.
Set 6: Download Dataset with 1965 compounds for general use.


Set 7: Download Dataset with 10 binary random fingerprints, which can be used as an input in the section "Clustering".


Set 8: Download Dataset with 1459 molecules, which are commercial fragments extracted by FDA approved drugs. These fragments have been taken from ChemInformatic Tools and Databases.

Enable JavaScript Instruction

Google Chrome (15.0): Click the spanner icon on the browser toolbar. Select Options. Click the Under the Hood tab. Click Content Settings in the 'Privacy section.' Select Allow all sites to run JavaScript in the 'JavaScript' section.

Mozilla Firefox (8.0): Select Tools from the top menu. Choose Options. Choose Content from the top navigation. Select the checkbox next to Enable JavaScript and click OK. Internet

Windows Internet Explorer (9.0): Select Tools from the top menu. Choose Internet Options. Click on the Security tab. Click on Custom Level. Scroll down until you see the section labeled 'Scripting.' Under 'Active Scripting,' select Enable and click OK.

Apple Safari (5.0): Open the Safari menu on your browser's toolbar. Choose Preferences. Choose Security. Select the checkbox next to Enable JavaScript.

References

Ryckaert JP, Ciccotti G, Berendsen HJC (1977) “Numerical-integration of cartesian equations of motion of a system with constraints: molecular dynamics of n-alkanes”. J. Comput. Phys., 23: 327-341.

Jorgensen W. L., Chandrasekhar J., Madura J. D., Impey R. W. and Klein M. L. (1983) “Comparison of simple potential functions for simulating liquid water”. J. Chem. Phys., 79:926-935.

Miyamoto S, Kollman PA (1992) “Settle: An analytical version of the SHAKE and RATTLE algorithm for rigid water models”. J. Comput. Chem., 13: 952-962.

Darden T, York D, Pedersen L (1993) “Particle mesh Ewald: an Nlog(N) method for Ewald sums in large systems”. J. Chem. Phys., 98: 10089-10092.

Feller SE, Zhang YH, Pastor RW, Brooks BR (1995) “Constant pressure molecular dynamics Simulation: The Langevin piston method”. J Chem. Phys., 103: 4613-4621.

Jorgensen, W.L., Maxwell, D.S. and Tirado-Rives, J. (1996) Development and Testing of the OPLS All-Atom Force Field on Conformational Energetics and Properties of Organic Liquids. Journal of the American Chemical Society, 118, 11225-11236.

MacKerell J., A.D., Bashford, D., Bellott, M., Dunbrack J., R.L., Evanseck, J.D., Field, M.J., Fischer, S., Gao, J., Guo, H., H. S., Joseph-McCarthy, D., Kuchnir, L., Kuczera, K., Lau, F.T.K., Mattos, C., Michnick, S., Ngo, T., Nguyen, D.T., Prodhom, B., Reiher III, W.E., Roux, B., Schlenkrich, M., Smith, J.C., Stote, R., Straub, J., Watanabe, M., Wiórkiewicz-Kuczera, J., Yin, D., Karplus, M. (1998) “All-atom empirical potential for molecular modeling and dynamics studies of proteins”. J. Phys. Chem. B, 102: 3586-3616.

Lipinski, C.A., Lombardo, F., Dominy, B.W. and Feeney, P.J. (2001) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Advanced drug delivery reviews, 46, 3-26.

Friesner, R.A, Banks, J.L., Murphy, R.B., Halgren, T.A., Klicic, J.J., Mainz, D.T., Repasky, M.P., Knoll, E.H., Shelley, M., Perry, J.K., Shaw, D.E., Francis, P., Shenkin, P.S. (2004) “Glide: A New Approach for Rapid, Accurate Docking and Scoring. 1. Method and Assessment of Docking Accuracy”, J. Med. Chem., 47, 1739–1749.

Irwin, J.J. and Shoichet, B.K. (2005) ZINC--a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling, 45, 177-182.

James C. Phillips, Rosemary Braun, Wei Wang, James Gumbart, Emad Tajkhorshid, Elizabeth Villa, Christophe Chipot, Robert D. Skeel, Laxmikant Kale, and Klaus Schulten (2005) Scalable molecular dynamics with NAMD. Journal of Computational Chemistry, 26:1781-1802.

Buck M, Bouguet-Bonnet S, Pastor RW, MacKerell AD Jr. (2006) “Importance of the CMAP correction to the CHARMM22 protein force field: dynamics of hen lysozyme” , Biophys J.15;90(4):L36-8.

Guha, R. (2007) Chemical Informatics Functionality in R. Journal of Statistical Software, 18 (5), 1-16, ISSN: 15487660.

Friesner, R.A., Murphy, R.B., Repasky, M.P., Frye, L.L., Greenwood, J.R., Halgren, T.A., Sanschagrin, P.C., Mainz, D.T. (2006) “Extra Precision Glide: Docking and Scoring Incorporating a Model of Hydrophobic Enclosure for Protein-Ligand Complexes”, J. Med. Chem., 49, 6177–6196.

Frey, B.J. and Dueck, D. (2007) Clustering by passing messages between data points. Science, 315, 972-976.

R. Wehrens and L.M.C. Buydens (2007) Self- and Super-organising Maps in R: the kohonen package. Journal of Statistical Software, 21 (5), 1-19, ISSN: 15487660.

Cao, Y., Charisi, A., Cheng, L.C., Jiang, T. and Girke, T. (2008) ChemmineR: a compound mining framework for R. Bioinformatics, 24, 1733-1734.

Cournia, Z., Leng, L., Gandavadi, S., Du, X., Bucala, R., Jorgensen, W.L. (2009) “Discovery of human macrophage migration inhibitory factor (MIF)-CD74 antagonists via virtual screening”. Journal of medicinal chemistry, 52, 416-424,

Bodenhofer, U., Kothmeier, A. and Hochreiter, S. (2011) APCluster: an R package for affinity propagation clustering. Bioinformatics, 27, 2463-2464.

O'Boyle, N.M., Banck, M., James, C.A., Morley, C., Vandermeersch, T. and Hutchison, G.R. (2011) Open Babel: An open chemical toolbox. Journal of cheminformatics, 3, 33.

SiteMap, version 2.4, Schrödinger, LLC, New York, NY, 2011.

Glide, version 5.7, Schrödinger, LLC, New York, NY, 2011.