Sample Text Data Preprocessing Implementation In SparkNLP

您所在的位置：网站首页 › 老人帽子编织全视频教程最简单的猫耳帽勾针织法 › Sample Text Data Preprocessing Implementation In SparkNLP

Sample Text Data Preprocessing Implementation In SparkNLP

2023-09-04 04:20| 来源: 网络整理| 查看: 265

In this study, I will create all annotators one by one and subsequently collect them into one pipeline for completing the pre-processing. Here is which annotators I will create:->DocumentAssembler->Tokenizer->SentenceDetector->Normalizer->StopWordsCleaner->TokenAssembler->Stemmer->Lemmatizer

Importing SparkNLP, necessary libraries, reading the data from local and converting into spark dataFrame like following:

import sparknlpspark= sparknlp.start()from sparknlp.base import *from sparknlp.annotator import *df= spark.read\ .option("header", True)\ .csv("spam_text_messages.csv")\ .toDF("category", "text")df.show(5, truncate=30)>>>+--------+------------------------------+|category| text|+--------+------------------------------+| ham|Go until jurong point, craz...|| ham| Ok lar... Joking wif u oni...|| spam|Free entry in 2 a wkly comp...|| ham|U dun say so early hor... U...|| ham|Nah I don't think he goes t...|+--------+------------------------------+only showing top 5 rows

Annotators and transformers are come from base and annotator libraries. I won’t give details what is annotators and transformers throughout article.

We have two columns which names are category and message. Text column consist messages and category column consist type of messages, spam or not(ham).

1- Document Assembler

DocumentAssembler is the begining part of the any SparkNLP project. It creates the first annotation of type Document which may be used by annotators down the road. documentAseembler() comes from SparkNLP’s base class. We can use it like following:

documentAssembler= DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ .setCleanupMode("shrink")

Parameters:setInputCol() -> The name of the column that will be converted. We can specify only one column here. The significant point here It can read either a String column or an Array[String]. setOutputCol() -> (optional) The name of the column in Document type that is generated. We can specify only one column here. Default is ‘document’.setCleanupMode() -> (optional) Cleaning up options.

I choiced shrink as clean up mode, it removes new lines and tabs, plus merging multiple spaces and blank lines to a single space.Now, we will transform the df with documentAssembler by using transform() function and subsequently print the schema as following:

The annotators and transformers all come with universal metadata in SparkNLP. We can access all these parameters shown above with {column name}.{parameter name}.

In order to view text line by line and beginning and ending of the lines:

df_doc.select("document.result", "document.begin", "document.end").show(5, truncate=30)>>>+------------------------------+-----+-----+| result|begin| end|+------------------------------+-----+-----+|[Go until jurong point, cra...| [0]|[110]||[Ok lar... Joking wif u oni...| [0]| [28]||[Free entry in 2 a wkly com...| [0]|[154]||[U dun say so early hor... ...| [0]| [48]||[Nah I don't think he goes ...| [0]| [60]|+------------------------------+-----+-----+only showing top 5 rows

We can print out the first item’s result:

df_doc.select("document.result").take(1)>>>[Row(result=['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'])]

2- Tokenizer

Tokenizer() uses for identifying the tokens in SparkNLP. Here is to form tokenizer:

tokenizer= Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token")

Also, tokenizer function provides so many parameters to make our task conveniently. For example;

setExceptions(StringArray): If you have a list of some composite words that are you don’t want to split, this is useful for you. contextChars(StringArray): If you don’t want to split some characters such as parentheses, question marks etc. it is useful. It needs string array. setTargetPattern() : If you want to identify a candidate for tokenization within basic regex rules, it is useful for you. Defaults to \S+ which means anything except a space.

These are general parameters, there are many parameters to use case by case.

3- Sentence Detector

SentenceDetector() finds a sentence bounds in raw text. Here is how I formed it:

sentenceDetector= SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence")

An useful parameter here:setCustomBounds(string): If you want to separate sentences by custom characters, it is for you.

4- Normalizer

normalizer() cleans dirty characters after regex pattern and removes words based on given dictionary. Implementation like following:

normalizer= Normalizer()\ .setInputCols(["token"])\ .setOutputCol("normalized")\ .setLowercase(True)\ .setCleanupPatterns(["[^\w\d\s]"])

Now, let’s explain the parameters.

.setLowercase(True) : It’s lowercasing the tokens, as a default false. .setCleanupPatterns(["[^\w\d\s"]) : It needs regular expressions list for normalization, as a default [^A-Za-z]. If you determine like I did above, it will remove punctuations, keep alphanumeric chars.

5- Stopwords Cleaner

StopWordsCleaner() uses for dropping the stopwords from text. Here is the implementation:

stopwordsCleaner =StopWordsCleaner()\ .setInputCols(["token"])\ .setOutputCol("cleaned_tokens")\ .setCaseSensitive(True)

Explaining of parameters:.setCaseSensitive(True) : Whether to do a case sensitive comparison over the stop words. .setStopWords() : If you have the words to be filtered out. It needs Array[String].

6- Token Assembler

TokenAssembler() uses for assembling back again the cleaned tokens. Implementation like following:

tokenAssembler= TokenAssembler()\ .setInputCols(["sentence", "cleaned_tokens"])\ .setOutputCol("assembled")

7- Stemmer

The goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Here is the implementation:

stemmer= Stemmer()\ .setInputCols(["token"])\ .setOutputCol("stem")

8- Lemmatizer

Both in stemming and in lemmatization, we are reducing given word to it’s root. However, there are some differences between that. In stemming, the algorithms don’t actually know the meaning of the word in the language it belongs to. In lemmatization, the algorithms have this knowledge. These algorithms refer a dictionary to understand the meaning of the word before reducing. For example, lemmatization algorithms know that the word “went” is derived from the word “go” and hence, the lemma will be “go”. But a stemming algorithm wouldn’t be able to do the same. There could be over-stemming or under-stemming, and word “went” could be reduced to either “wen” or keep it as “went”. So, here is the implementation of Lemmatizer() :

We will first pull the lemmatization dictionary from link. Then, implement the lemmatization.

lemmatizer= Lemmatizer()\ .setInputCols(["token"])\ .setOutputCol("lemma")\ .setDictionary("AntBNC_lemmas_ver_001.txt", value_delimiter="\t", key_delimiter="->")

We selected the dictionary with the .setDictionary() parameter.

Putting All Processes Into a Spark ML Pipeline

Fortunately, all the SparkNLP annotators and transformers can be used within Spark ML Pipelines. We just created annotators and transformers that we need. Now, we will put them into a pipeline. After that, we will fit the pipeline and transform it with our dataset. Let’s start!

Importing spark ml pipeline: from pyspark.ml import Pipeline

Putting annotators and transformers into the pipeline:

nlpPipeline= Pipeline(stages=[ documentAssembler, tokenizer, sentenceDetector, normalizer, stopwordsCleaner, tokenAssembler, stemmer, lemmatizer])

Now, in order to using the pipeline with different dataframes we will create an empty dataframe to fitting the pipeline.

empty_df= spark.createDataFrame([[""]]).toDF("text")model= nlpPipeline.fit(empty_df)

Nice, we just builded model. It’s time to apply it to our dataset. For do this, we will use transform() function like following;

result= model.transform(df)

Examining The Results

Well, we will examine the results that what we did. But first, we will import functions library from pyspark sql in order to making some process. from pyspark.sql import functions as F

Let’s begin the examination with tokens and normalized tokens.

result.select("token.result" ,"normalized.result")\ .show(5, truncate=30)>>>+------------------------------+------------------------------+| result| result|+------------------------------+------------------------------+|[Go, until, jurong, point, ...|[go, until, jurong, point, ...||[Ok, lar, ..., Joking, wif,...|[ok, lar, joking, wif, u, oni]||[Free, entry, in, 2, a, wkl...|[free, entry, in, 2, a, wkl...||[U, dun, say, so, early, ho...|[u, dun, say, so, early, ho...||[Nah, I, don't, think, he, ...|[nah, ı, dont, think, he, g...|+------------------------------+------------------------------+only showing top 5 rows

Tokens that seems in the first column has normalized in the second column. Let’s check the cleared data from stopwords like following;

result.select(F.explode(F.arrays_zip("token.result", "cleaned_tokens.result")).alias("col"))\ .select(F.expr("col['0']").alias("token"), F.expr("col['1']").alias("cleaned_sw")).show(10)>>>+---------+----------+| token|cleaned_sw|+---------+----------+| Go| Go|| until| jurong|| jurong| point|| point| ,|| ,| crazy|| crazy| ..|| ..| Available||Available| bugis|| only| n|| in| great|+---------+----------+only showing top 10 rows

As you see above, dropped some stopwords by stopwords cleaner.

Token assembler contains normalized tokens. Now, let’s compare the sentence detector result and token assembler result.

result.select(F.explode(F.arrays_zip("sentence.result", "assembled.result")).alias("col"))\ .select(F.expr("col['0']").alias("sentence"), F.expr("col['1']").alias("assembled")).show(5, truncate=30)>>>+------------------------------+------------------------------+| sentence| assembled|+------------------------------+------------------------------+| Go until jurong point, crazy.| Go jurong point, crazy|| .| ||Available only in bugis n g...|Available bugis n great wor...|| Cine there got amore wat.| Cine got amore wat|| .| |+------------------------------+------------------------------+only showing top 5 rows

Also, we can check the sentence numbers of the words like following;

result.withColumn("tmp", F.explode("assembled"))\ .select("tmp.*").select("begin", "end", "result", "metadata.sentence").show(5, truncate=30)>>>+-----+---+------------------------------+--------+|begin|end| result|sentence|+-----+---+------------------------------+--------+| 0| 21| Go jurong point, crazy| 0|| 29| 28| | 1|| 31| 74|Available bugis n great wor...| 2|| 84|101| Cine got amore wat| 3|| 109|108| | 4|+-----+---+------------------------------+--------+only showing top 5 rows

Seems nice! Now we will compare the tokens, stems and lemmas. Also, in this part we will see how to covert spark dataframe into pandas data frame easily.

You can compare stems and lemmas nicely. For “available” word, stem chanced it as “avail” because stemmmer algorithm doesn’t know what words mean.

Well, we’ve seen some basic preprocessing steps throughout this article. I suggest you visit JohnSnowLabs that is official developer of SparkNLP in order to access more info and detail. Also, there is great introduction as colab notebook .

Thanks for your read and support!

【本文地址】

公司简介

联系我们