【Python】根据CID获取化合物数据（调用Pubchem官方API）

2023-07-28 23:58| 来源: 网络整理| 查看: 265

文章目录简介下载演示安装用法相关化合物属性表属性API同义词API 打包源码参考

简介

根据CID从PubChem爬取化合物的数据（基于PubChem PUG REST API），2~3秒即可实现对上千条CID对应的化合物数据的抓取。

下载

小编已将程序打包为可执行文件，下载即可使用：pubchem-1.0.2-win64.zip

演示

在这里插入图片描述

非开发人员直接下载打包好的软件使用即可，无需继续往下看（以此为分界线），如有问题请联系我。

安装 pip install requests 用法克隆仓库。 git clone https://github.com/XavierJiezou/python-pubchem-api.git Cd到根目录。 cd python-pubchem-api 将cid列表复制到cid.txt。运行命令python pubchem.py.爬取结果保存在data.json或者data.csv.你也可以根据下面的化合物属性表修改pubchem.py中的变量self.property_list self.property_list = [ 'IUPACName', 'IsomericSMILES', 'MolecularFormula', 'MolecularWeight', 'HBondDonorCount', 'HBondAcceptorCount' ] 相关化合物属性表

如果将以逗号分隔的属性标签列表写入URL中，则可以请求多个属性。属性表的有效输出格式为：XML、ASNT/B、JSON§、CSV和TXT(仅限于单个属性)。可用的属性包括：

属性描述MolecularFormulaMolecular formula.MolecularWeightThe molecular weight is the sum of all atomic weights of the constituent atoms in a compound, measured in g/mol. In the absence of explicit isotope labelling, averaged natural abundance is assumed. If an atom bears an explicit isotope label, 100% isotopic purity is assumed at this location.CanonicalSMILESCanonical SMILES (Simplified Molecular Input Line Entry System) string. It is a unique SMILES string of a compound, generated by a “canonicalization” algorithm.IsomericSMILESIsomeric SMILES string. It is a SMILES string with stereochemical and isotopic specifications.InChIStandard IUPAC International Chemical Identifier (InChI). It does not allow for user selectable options in dealing with the stereochemistry and tautomer layers of the InChI string.InChIKeyHashed version of the full standard InChI, consisting of 27 characters.IUPACNameChemical name systematically determined according to the IUPAC nomenclatures.TitleThe title used for the compound summary page.XLogPComputationally generated octanol-water partition coefficient or distribution coefficient. XLogP is used as a measure of hydrophilicity or hydrophobicity of a molecule.ExactMassThe mass of the most likely isotopic composition for a single molecule, corresponding to the most intense ion/molecule peak in a mass spectrum.MonoisotopicMassThe mass of a molecule, calculated using the mass of the most abundant isotope of each element.TPSATopological polar surface area, computed by the algorithm described in the paper by Ertl et al.ComplexityThe molecular complexity rating of a compound, computed using the Bertz/Hendrickson/Ihlenfeldt formula.ChargeThe total (or net) charge of a molecule.HBondDonorCountNumber of hydrogen-bond donors in the structure.HBondAcceptorCountNumber of hydrogen-bond acceptors in the structure.RotatableBondCountNumber of rotatable bonds.HeavyAtomCountNumber of non-hydrogen atoms.IsotopeAtomCountNumber of atoms with enriched isotope(s)AtomStereoCountTotal number of atoms with tetrahedral (sp3) stereo [e.g., ®- or (S)-configuration]DefinedAtomStereoCountNumber of atoms with defined tetrahedral (sp3) stereo.UndefinedAtomStereoCountNumber of atoms with undefined tetrahedral (sp3) stereo.BondStereoCountTotal number of bonds with planar (sp2) stereo [e.g., (E)- or (Z)-configuration].DefinedBondStereoCountNumber of atoms with defined planar (sp2) stereo.UndefinedBondStereoCountNumber of atoms with undefined planar (sp2) stereo.CovalentUnitCountNumber of covalently bound units.Volume3DAnalytic volume of the first diverse conformer (default conformer) for a compound.XStericQuadrupole3DThe x component of the quadrupole moment (Qx) of the first diverse conformer (default conformer) for a compound.YStericQuadrupole3DThe y component of the quadrupole moment (Qy) of the first diverse conformer (default conformer) for a compound.ZStericQuadrupole3DThe z component of the quadrupole moment (Qz) of the first diverse conformer (default conformer) for a compound.FeatureCount3DTotal number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D)FeatureAcceptorCount3DNumber of hydrogen-bond acceptors of a conformerFeatureDonorCount3DNumber of hydrogen-bond donors of a conformer.FeatureAnionCount3DNumber of anionic centers (at pH 7) of a conformer.FeatureCationCount3DNumber of cationic centers (at pH 7) of a conformer.FeatureRingCount3DNumber of rings of a conformer.FeatureHydrophobeCount3DNumber of hydrophobes of a conformer.ConformerModelRMSD3DConformer sampling RMSD inEffectiveRotorCount3DTotal number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D)ConformerCount3DThe number of conformers in the conformer model for a compound.Fingerprint2DBase64-encoded PubChem Substructure Fingerprint of a molecule. 属性API

根据CID获取属性。

实例： https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1,2,3,4,5/property/MolecularFormula,MolecularWeight/JSON

同义词API

根据CID获取同义词。

实例： https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1,2,3,4,5/synonyms/JSON

打包 git clone https://github.com/XavierJiezou/python-pubchem-api.git cd python-pubchem-api pip install pipenv pipenv install pipenv shell pip install requests pip install pyinstaller pyinstaller -F -i favicon.ico pubchem.py 源码

https://github.com/XavierJiezou/python-pubchem-api

import os, csv, json, requests class PubchemCrawlFast(): def __init__(self, cid_path, out_path): """Initialization function. Args: cid_path (str): Input file path of cid list out_path (str): Output file path of crawled data """ self.cid_path = cid_path self.out_path = out_path self.property_list = [ 'IUPACName', 'IsomericSMILES', 'MolecularFormula', 'MolecularWeight', 'HBondDonorCount', 'HBondAcceptorCount' ] def get_cid_list(self): """Get the cid list from the local file """ if os.path.exists(self.cid_path): with open(self.cid_path) as f: self.cid_list = [i.strip() for i in f.readlines()] else: self.cid_list = [] cid = input('Please inpute the CID list below: \n') while cid != '': self.cid_list.append(cid) cid = input() self.length = len(self.cid_list) def get_property_from_cid(self): """Get the property from cid """ limit = 300 api = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/' property_str = ','.join(self.property_list) return_type = 'json' self.prp = [] for i in range(limit, self.length+limit, limit): cid_str = ','.join(self.cid_list[i-limit:i]) url = f'{api}{cid_str}/property/{property_str}/{return_type}' res = requests.get(url).json() self.prp += res['PropertyTable']['Properties'] def get_synonyms_from_cid(self): """Get the synonym from cid """ limit = 300 api = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/' return_type = 'json' self.syn = [] for i in range(limit, self.length+limit, limit): cid_str = ','.join(self.cid_list[i-limit:i]) url = f'{api}{cid_str}/synonyms/{return_type}' res = requests.get(url).json() self.syn += res['InformationList']['Information'] for i in range(len(self.syn)): if 'Synonym' not in self.syn[i]: self.syn[i]['Synonym'] = [] def save_as_csv(self, data): """Save the crawled data in CSV format """ csv_name = self.out_path.split('.')[0]+'.csv' header_list = ['CID']+self.property_list+['Synonym'] # with open(csv_name, 'w') as f: # f.write(','.join(header_list)+'\n') # with open(csv_name, 'a') as f: # for item in data: # line = ['"'+str(item[each])+'"' for each in header_list] # f.write(','.join(line)+'\n') with open(csv_name,'w', newline='') as f: writer = csv.DictWriter(f, header_list) writer.writeheader() writer.writerows(data) def __main__(self): print('Getting CID list: ') self.get_cid_list() print('CID list acquisition is complete!') print('--------------------------------------------') print('Querying property list: ') self.get_property_from_cid() print('Property list query is complete!') print('--------------------------------------------') print('Querying synonym: ') self.get_synonyms_from_cid() print('Synonym query is complete!') print('--------------------------------------------') dt = { 'InfoList': { 'Info': [dict(d1, **d2) for d1, d2 in zip(self.prp, self.syn)] } } json_str = json.dumps(dt, indent=2) print('The data is being written to the JSON file: ') with open(self.out_path, 'w') as f: f.write(json_str) print('Finished writing the JSON file! ') print('--------------------------------------------') print('The data is being written to the CSV file: ') self.save_as_csv(dt['InfoList']['Info']) print('Finished writing the CSV file! ') os.system('pause') if __name__ == '__main__': PubchemCrawlFast('cid.txt', 'data.json').__main__() 参考

https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest

【本文地址】

公司简介

联系我们

今日新闻

推荐新闻

专题文章