This file contains the structures downloaded from the
PubChem FTP site
that have at least one assay result associated with them that was
obtained in the context of the
NIH Common Fund (previously: NIH Roadmap) Molecular Libraries Probe Production Centers
Network (previously: Molecular Libraries Screening Center Network), part of
the Common Fund’s
Molecular Libraries and Imaging program. It is
organized by unique chemical structures (“Compounds” in PubChem parlance),
i.e. assay results for possibly multiple different samples (“Substances” in
PubChem parlance) have been combined into the one record representing the
unique chemical structure. Placeholder assays (assays containing a single
record only) have been filtered out.
Explanation of the property data fields in the SD file
(note - properties present in the original PubChem files have been copied
unchanged, for the explanation of those properties we point directly to the
appropriate PubChem document):
- PUBCHEM_ASSAYID_nnn_NAME - Name of the assay with PubChem assay ID
(AID) nnn. For example, the assay named "qHTS Assay for Inhibitors and
Substrates of Cytochrome P450 3A4" has AID 884, thus the property name for
this assay would be PUBCHEM_ASSAYID_884_NAME.
- PUBCHEM_ASSAYID_nnn_SUBMITTER - Organization that submitted the assay
data to PubChem.
- PUBCHEM_ASSAYID_nnn_RESULT - Assay result (active/inactive/inconclusive/unspecified)
for this compound.
- PUBCHEM_ASSAYID_nnn_SID_RESULT - List of the assay results of the
individual Substances corresponding to this CID.
- PUBCHEM_ASSAYID_nnn_CURVE_CLASS - Indication of the quality of the
titration curve obtained in qHTS assays. Particularly used in results
obtained from the NCGC Screening Center. See here for an explanation of
the curve class concept.
- PUBCHEM_ASSAYID_nnn_LOGAC50 - Log of the concentration at which 50% of
the activity was observed. The activity can be either inhibition or activation.
Note that this property does not exist for some structures/assay results.
- PUBCHEM_ASSAY_ACTIVITIES - Boolean indication of the
assay result for all assays performed with this compound. A “!” in front of
the assay name indicates inactivity in this assay, the assay name listed
without “!” indicates activity as per the result call made by the original
submitter and deposited in PubChem. A "?" means inconclusive or unspecified
result.
- PUBCHEM_SID_ASSOCIATIONS - List of PubChem Substance
IDs (SID), i.e. in the widest sense, samples, that have the unique chemical
structure of this Compound entry.
461,937 structures in SDF format.
WARNING: This is a 2.1 GB gzipped file.
Use the "Save Link As..." (Firefox) or "Save Target As..." (IE) option
of your web browser to download the file.
1475 structures
with at least one contradictory assay result.
This file was constructed in the following way:
- Download SDF files for "Substances" from Pubchem ftp site
- Combine them all into one big SD file
- Split that file according to PUBCHEM_EXT_DATASOURCE_NAME property
- Combine the following files into one:
- Emory_University_Molecular_Libraries_Screening_Center
- NCGC
- NMMLSC
- PCMD
- MLSMR
- SRMLSC
- The_Scripps_Research_Institute_Molecular_Screening_Center
- Columbia_University_Molecular_Screening_Center
- NIH_Clinical_Collection
- University_of_Pittsburgh_Molecular_Library_Screening_Center
- Vanderbilt_Screening_Center_for_GPCRs__Ion_Channels_and_Transporters
- Burnham_Center_for_Chemical_Genomics
- Johns_Hopkins_Ion_Channel_Center
- Southern_Research_Institute
- Download all assays (*.descr.xml and *.data.xml)
- Keep only the assays with the submitter from the following list,
remove the rest. The submitter is extracted from .descr.xml file
from the key:
PC-AssaySubmit -> PC-AssaySubmit_assay -> PC-AssaySubmit_assay_descr ->
PC-AssayDescription -> PC-AssayDescription_aid-source ->
PC-Source -> PC-Source_db -> PC-DBTracking -> PC-DBTracking_name
- Columbia University Molecular Screening Center
- Emory University Molecular Libraries Screening Center
- NCGC
- NMMLSC
- PCMD
- SRMLSC
- The Scripps Research Institute Molecular Screening Center
- University of Pittsburgh Molecular Library Screening Center
- Vanderbilt University Molecular Libraries Screening Center (VUMLSC)
- Burnham Center for Chemical Genomics
- Johns Hopkins Ion Channel Center
- Southern Research Specialized Biocontainment Screening Center
- Extract the following properties from .descr.xml and .data.xml files
and put them into SD file:
- PUBCHEM_ASSAYID_".$aid."_NAME
- PUBCHEM_ASSAYID_".$aid."_SUBMITTER
- PUBCHEM_ASSAYID_".$aid."_RESULT
- PUBCHEM_ASSAYID_".$aid."_LOGAC50
- PUBCHEM_ASSAYID_".$aid."_CURVE_CLASS
- Construct property PUBCHEM_ASSAY_ACTIVITIES as a list of
active/inactive assays
- Remove structures without PUBCHEM_ASSAY_ACTIVITIES property
- Remove LOGAC50 property for "Inactive", "Inconclusive" and
"CURVE_CLASS==4" cases
- Create property PUBCHEM_ASSAYID_".$aid."_SID_RESULTS which contains
results for a particular SID
- Fold all substances with the same compound id (CID) into one record
The properties are folded according to the following rules:
"Inconclusive" results are deleted, unless they are the only results
present, LOGAC50 is averaged out, unless there are both "Increasing" and
"Decreasing" results present, PUBCHEM_ASSAYID_".$aid."_RESULT is set to
"Discrepant" if there is more than one type present, discounting the
"Inconclusive".
473,965 structures in SDF format.
WARNING: This is a 1.6 GB gzipped file.
Use the "Save Link As..." (Firefox) or "Save Target As..." (IE) option of
your web browser to download the file.
466,537 structures in SDF format.
WARNING: This is a 642 MB gzipped file.
Use the "Save Link As..." (Firefox) or "Save Target As..." (IE)
option of your web browser to download the file.
Igor Filippov
Last Update: 2016-08-27