2 🤗 Transformers pipeline 使用

Transformers models pipeline 初体验

为了快速体验 Transformers,我们可以使用它的 pipeline API。它将模型的预处理, 后处理等步骤包装起来,使得我们可以直接定义好任务名称后,输出文本,直接得到我们需要的结果。这是一个高级的API,可以让我们领略到transformers 这个库的强大且友好。

from transformers import pipeline classifier = pipeline("sentiment-analysis") classifier("I've been waiting for a HuggingFace course my whole life.")


[{'label': 'POSITIVE', 'score': 0.9598047137260437}]


classifier([ "I've been waiting for a HuggingFace course my whole life.", "I hate this so much!" ])


[{'label': 'POSITIVE', 'score': 0.9598047137260437}, {'label': 'NEGATIVE', 'score': 0.9994558095932007}]

其实这背后,当我们使用 pipeline API,输入任务名称,默认会选择特定已经存好的模型文件,然后会进行下载并且缓存。


输入文本被预处理成机器可以理解的格式 被处理后的输入被传入模型中 模型的预测结果经过后处理,得到人类可以理解的结果

目前支持的pipeline 如下:

feature-extraction (get the vector representation of a text) 特征抽取 fill-mask 掩码回复 ner (named entity recognition) 命名实体识别 question-answering 阅读理解 sentiment-analysis 情感分析 summarization 摘要 text-generation 文本生成 translation 翻译 zero-shot-classification 零样本分类Zero-shot classification

文本分类标注往往非常耗时,huggingface 提供了0样本分类的pipeline, 用户只需要传入文本内容,以及可能的分类标签,就可以得到每个标签的概率,这样子可以提供标注人员参考结果,大大提高标注效率。

from transformers import pipeline classifier = pipeline("zero-shot-classification") classifier( "This is a course about the Transformers library", candidate_labels=["education", "politics", "business"], ) {'sequence': 'This is a course about the Transformers library', 'labels': ['education', 'business', 'politics'], 'scores': [0.8445963859558105, 0.111976258456707, 0.043427448719739914]}Text generation

文本生成任务,是指你输入开头的话术(prompt),然后让机器自动帮你生成完剩下的句子。Text generation 中包含了一些随机因子,因此每次生成的结果都可能不同。

from transformers import pipeline generator = pipeline("text-generation") generator("In this course, we will teach you how to") [{'generated_text': 'In this course, we will teach you how to understand and use ' 'data flow and data interchange when handling user data. We ' 'will be working with one or more of the most commonly used ' 'data flows — data flows of various types, as seen by the ' 'HTTP'}]

你可以设置参数 num_return_sequences 选择返回的结果个数,也可以通过 max_length 限制每次返回的结果句子的长度.

并且模型选择可以通过 model 设置,这边选择 distilgpt2

from transformers import pipeline generator = pipeline("text-generation", model="distilgpt2") generator( "In this course, we will teach you how to", max_length=30, num_return_sequences=2, ) [{'generated_text': 'In this course, we will teach you how to manipulate the world and ' 'move your mental and physical capabilities to your advantage.'}, {'generated_text': 'In this course, we will teach you how to become an expert and ' 'practice realtime, and with a hands on experience on both real ' 'time and real'}]Mask filling

掩码恢复是将一个句子中随机遮掩的词给恢复回来,top_k 控制了概率最大的 top k 个词被返回。


from transformers import pipeline unmasker = pipeline("fill-mask") unmasker("This course will teach you all about models.", top_k=2) [{'sequence': 'This course will teach you all about mathematical models.', 'score': 0.19619831442832947, 'token': 30412, 'token_str': ' mathematical'}, {'sequence': 'This course will teach you all about computational models.', 'score': 0.04052725434303284, 'token': 38163, 'token_str': ' computational'}]Named entity recognition

命名实体是被是指如何将文本中的实体,例如:persons, locations, or organizations,识别出来的任务:

from transformers import pipeline ner = pipeline("ner", grouped_entities=True) ner("My name is Sylvain and I work at Hugging Face in Brooklyn.") [{'entity_group': 'PER', 'score': 0.99816, 'word': 'Sylvain', 'start': 11, 'end': 18}, {'entity_group': 'ORG', 'score': 0.97960, 'word': 'Hugging Face', 'start': 33, 'end': 45}, {'entity_group': 'LOC', 'score': 0.99321, 'word': 'Brooklyn', 'start': 49, 'end': 57} ]

注意这边设置了 grouped_entities=True,这就告诉模型,将同一个entity的部分,聚合起来,例如这边的 “Hugging” and “Face” 是一个实体organization,所以就把它给聚合起来。

在数据预处理的部分, Sylvain 被拆解为4 pieces: S, ##yl, ##va, and ##in. 这边后处理也会将这些给聚合起来。

Question answering


from transformers import pipeline question_answerer = pipeline("question-answering") question_answerer( question="Where do I work?", context="My name is Sylvain and I work at Hugging Face in Brooklyn" ) {'score': 0.6385916471481323, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}Summarization


from transformers import pipeline summarizer = pipeline("summarization") summarizer(""" America has changed dramatically during recent years. Not only has the number of graduates in traditional engineering disciplines such as mechanical, civil, electrical, chemical, and aeronautical engineering declined, but in most of the premier American universities engineering curricula now concentrate on and encourage largely the study of engineering science. As a result, there are declining offerings in engineering subjects dealing with infrastructure, the environment, and related issues, and greater concentration on high technology subjects, largely supporting increasingly complex scientific developments. While the latter is important, it should not be at the expense of more traditional engineering. Rapidly developing economies such as China and India, as well as other industrial countries in Europe and Asia, continue to encourage and advance the teaching of engineering. Both China and India, respectively, graduate six and eight times as many traditional engineers as does the United States. Other industrial countries at minimum maintain their output, while America suffers an increasingly serious decline in the number of engineering graduates and a lack of well-educated engineers. """) [{'summary_text': ' America has changed dramatically during recent years . The ' 'number of engineering graduates in the U.S. has declined in ' 'traditional engineering disciplines such as mechanical, civil ' ', electrical, chemical, and aeronautical engineering . Rapidly ' 'developing economies such as China and India, as well as other ' 'industrial countries in Europe and Asia, continue to encourage ' 'and advance engineering .'}]

跟text generation 任务一样,我们也可以设置参数: max_length or a min_length ,限制文本的长度。


文本翻译,你可以在 Model Hub 中,找到特定的翻译模型,例如法翻英的模型, Helsinki-NLP/opus-mt-fr-en:

from transformers import pipeline translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en") translator("Ce cours est produit par Hugging Face.") [{'translation_text': 'This course is produced by Hugging Face.'}]The Inference API

所有的API都可以通过 搜索,并且在线测试,例如:


这边我们学习如何使用 high-level Transformers pipeline API. 用户可以通过在 Model Hub 中搜索需要的模型,直接在 网页进行结果预测,或者在本地进行预测。






