【Python】根据CID获取化合物数据(调用Pubchem官方API) 您所在的位置:网站首页 smiles英文什么意思 【Python】根据CID获取化合物数据(调用Pubchem官方API)


2023-07-28 23:58| 来源: 网络整理| 查看: 265

文章目录 简介下载演示安装用法相关化合物属性表属性API同义词API 打包源码参考


根据CID从PubChem爬取化合物的数据(基于PubChem PUG REST API),2~3秒即可实现对上千条CID对应的化合物数据的抓取。






安装 pip install requests 用法 克隆仓库。 git clone https://github.com/XavierJiezou/python-pubchem-api.git Cd到根目录。 cd python-pubchem-api 将cid列表复制到cid.txt。运行命令python pubchem.py.爬取结果保存在data.json或者data.csv.你也可以根据下面的化合物属性表修改pubchem.py中的变量self.property_list self.property_list = [ 'IUPACName', 'IsomericSMILES', 'MolecularFormula', 'MolecularWeight', 'HBondDonorCount', 'HBondAcceptorCount' ] 相关 化合物属性表


属性描述MolecularFormulaMolecular formula.MolecularWeightThe molecular weight is the sum of all atomic weights of the constituent atoms in a compound, measured in g/mol. In the absence of explicit isotope labelling, averaged natural abundance is assumed. If an atom bears an explicit isotope label, 100% isotopic purity is assumed at this location.CanonicalSMILESCanonical SMILES (Simplified Molecular Input Line Entry System) string. It is a unique SMILES string of a compound, generated by a “canonicalization” algorithm.IsomericSMILESIsomeric SMILES string. It is a SMILES string with stereochemical and isotopic specifications.InChIStandard IUPAC International Chemical Identifier (InChI). It does not allow for user selectable options in dealing with the stereochemistry and tautomer layers of the InChI string.InChIKeyHashed version of the full standard InChI, consisting of 27 characters.IUPACNameChemical name systematically determined according to the IUPAC nomenclatures.TitleThe title used for the compound summary page.XLogPComputationally generated octanol-water partition coefficient or distribution coefficient. XLogP is used as a measure of hydrophilicity or hydrophobicity of a molecule.ExactMassThe mass of the most likely isotopic composition for a single molecule, corresponding to the most intense ion/molecule peak in a mass spectrum.MonoisotopicMassThe mass of a molecule, calculated using the mass of the most abundant isotope of each element.TPSATopological polar surface area, computed by the algorithm described in the paper by Ertl et al.ComplexityThe molecular complexity rating of a compound, computed using the Bertz/Hendrickson/Ihlenfeldt formula.ChargeThe total (or net) charge of a molecule.HBondDonorCountNumber of hydrogen-bond donors in the structure.HBondAcceptorCountNumber of hydrogen-bond acceptors in the structure.RotatableBondCountNumber of rotatable bonds.HeavyAtomCountNumber of non-hydrogen atoms.IsotopeAtomCountNumber of atoms with enriched isotope(s)AtomStereoCountTotal number of atoms with tetrahedral (sp3) stereo [e.g., ®- or (S)-configuration]DefinedAtomStereoCountNumber of atoms with defined tetrahedral (sp3) stereo.UndefinedAtomStereoCountNumber of atoms with undefined tetrahedral (sp3) stereo.BondStereoCountTotal number of bonds with planar (sp2) stereo [e.g., (E)- or (Z)-configuration].DefinedBondStereoCountNumber of atoms with defined planar (sp2) stereo.UndefinedBondStereoCountNumber of atoms with undefined planar (sp2) stereo.CovalentUnitCountNumber of covalently bound units.Volume3DAnalytic volume of the first diverse conformer (default conformer) for a compound.XStericQuadrupole3DThe x component of the quadrupole moment (Qx) of the first diverse conformer (default conformer) for a compound.YStericQuadrupole3DThe y component of the quadrupole moment (Qy) of the first diverse conformer (default conformer) for a compound.ZStericQuadrupole3DThe z component of the quadrupole moment (Qz) of the first diverse conformer (default conformer) for a compound.FeatureCount3DTotal number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D)FeatureAcceptorCount3DNumber of hydrogen-bond acceptors of a conformerFeatureDonorCount3DNumber of hydrogen-bond donors of a conformer.FeatureAnionCount3DNumber of anionic centers (at pH 7) of a conformer.FeatureCationCount3DNumber of cationic centers (at pH 7) of a conformer.FeatureRingCount3DNumber of rings of a conformer.FeatureHydrophobeCount3DNumber of hydrophobes of a conformer.ConformerModelRMSD3DConformer sampling RMSD inEffectiveRotorCount3DTotal number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D)ConformerCount3DThe number of conformers in the conformer model for a compound.Fingerprint2DBase64-encoded PubChem Substructure Fingerprint of a molecule. 属性API


实例: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1,2,3,4,5/property/MolecularFormula,MolecularWeight/JSON



实例: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1,2,3,4,5/synonyms/JSON

打包 git clone https://github.com/XavierJiezou/python-pubchem-api.git cd python-pubchem-api pip install pipenv pipenv install pipenv shell pip install requests pip install pyinstaller pyinstaller -F -i favicon.ico pubchem.py 源码


import os, csv, json, requests class PubchemCrawlFast(): def __init__(self, cid_path, out_path): """Initialization function. Args: cid_path (str): Input file path of cid list out_path (str): Output file path of crawled data """ self.cid_path = cid_path self.out_path = out_path self.property_list = [ 'IUPACName', 'IsomericSMILES', 'MolecularFormula', 'MolecularWeight', 'HBondDonorCount', 'HBondAcceptorCount' ] def get_cid_list(self): """Get the cid list from the local file """ if os.path.exists(self.cid_path): with open(self.cid_path) as f: self.cid_list = [i.strip() for i in f.readlines()] else: self.cid_list = [] cid = input('Please inpute the CID list below: \n') while cid != '': self.cid_list.append(cid) cid = input() self.length = len(self.cid_list) def get_property_from_cid(self): """Get the property from cid """ limit = 300 api = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/' property_str = ','.join(self.property_list) return_type = 'json' self.prp = [] for i in range(limit, self.length+limit, limit): cid_str = ','.join(self.cid_list[i-limit:i]) url = f'{api}{cid_str}/property/{property_str}/{return_type}' res = requests.get(url).json() self.prp += res['PropertyTable']['Properties'] def get_synonyms_from_cid(self): """Get the synonym from cid """ limit = 300 api = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/' return_type = 'json' self.syn = [] for i in range(limit, self.length+limit, limit): cid_str = ','.join(self.cid_list[i-limit:i]) url = f'{api}{cid_str}/synonyms/{return_type}' res = requests.get(url).json() self.syn += res['InformationList']['Information'] for i in range(len(self.syn)): if 'Synonym' not in self.syn[i]: self.syn[i]['Synonym'] = [] def save_as_csv(self, data): """Save the crawled data in CSV format """ csv_name = self.out_path.split('.')[0]+'.csv' header_list = ['CID']+self.property_list+['Synonym'] # with open(csv_name, 'w') as f: # f.write(','.join(header_list)+'\n') # with open(csv_name, 'a') as f: # for item in data: # line = ['"'+str(item[each])+'"' for each in header_list] # f.write(','.join(line)+'\n') with open(csv_name,'w', newline='') as f: writer = csv.DictWriter(f, header_list) writer.writeheader() writer.writerows(data) def __main__(self): print('Getting CID list: ') self.get_cid_list() print('CID list acquisition is complete!') print('--------------------------------------------') print('Querying property list: ') self.get_property_from_cid() print('Property list query is complete!') print('--------------------------------------------') print('Querying synonym: ') self.get_synonyms_from_cid() print('Synonym query is complete!') print('--------------------------------------------') dt = { 'InfoList': { 'Info': [dict(d1, **d2) for d1, d2 in zip(self.prp, self.syn)] } } json_str = json.dumps(dt, indent=2) print('The data is being written to the JSON file: ') with open(self.out_path, 'w') as f: f.write(json_str) print('Finished writing the JSON file! ') print('--------------------------------------------') print('The data is being written to the CSV file: ') self.save_as_csv(dt['InfoList']['Info']) print('Finished writing the CSV file! ') os.system('pause') if __name__ == '__main__': PubchemCrawlFast('cid.txt', 'data.json').__main__() 参考







        CopyRight 2018-2019 实验室设备网 版权所有