BETA

spaCyのpipeline周りに詳しくなる(なりたい)

投稿日:2020-04-01
最終更新:2020-04-01

注意

以下はjupyter notebookで記述した、出力された内容をmarkdownでダウンロードして貼り付けたものです。

https://github.com/booink/spacy-trial1/tree/master
こちらの公開リポジトリに動作環境を反映してあります。

30分程度しか手を動かせていないのをお試しでmarkdown出力しただけのペラペラな内容なので、読み応えはありませんので悪しからず。


https://spacy.io/usage/processing-pipelines

上から写経していく。

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the default models consists of a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

要は nlp メソッドにテキストを渡すと、トークン化したテキストをDocクラスのオブジェクトに入れて返してくれると。
そのDocオブジェクトは pipeline という仕組みで、連鎖的に処理した結果をDocオブジェクトのバケツリレーをするってことかな。
pipeline には tagger、parser、entity recognizer(ner) があるよ。

なるほど。
docオブジェクトの型を見てみよう。

import spacy  

nlp = spacy.load("en")  
doc = nlp("This is a text")  
type(doc)  
---------------------------------------------------------------------------  

OSError                                   Traceback (most recent call last)  

<ipython-input-4-69cc80a89d2d> in <module>  
      1 import spacy  
      2   
----> 3 nlp = spacy.load("en")  
      4 doc = nlp("This is a text")  
      5 type(doc)  


/usr/local/lib/python3.7/site-packages/spacy/__init__.py in load(name, **overrides)  
     28     if depr_path not in (True, False, None):  
     29         deprecation_warning(Warnings.W001.format(path=depr_path))  
---> 30     return util.load_model(name, **overrides)  
     31   
     32   


/usr/local/lib/python3.7/site-packages/spacy/util.py in load_model(name, **overrides)  
    167     elif hasattr(name, "exists"):  # Path or Path-like to model data  
    168         return load_model_from_path(name, **overrides)  
--> 169     raise IOError(Errors.E050.format(name=name))  
    170   
    171   


OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.  

enモデルが無いよ、って怒られました。

https://spacy.io/usage/models

QuickStartの通りにやってみる

!python -m spacy download en_core_web_sm  
Collecting en_core_web_sm==2.2.5  
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)  
[K     |████████████████████████████████| 12.0 MB 476 kB/s eta 0:00:01  
[?25hRequirement already satisfied: spacy>=2.2.2 in /usr/local/lib/python3.7/site-packages (from en_core_web_sm==2.2.5) (2.2.4)  
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.2)  
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.0.3)  
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (4.44.1)  
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.18.2)  
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.23.0)  
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (7.4.0)  
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.2)  
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.6.0)  
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.1.3)  
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.0)  
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (46.0.0)  
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.4.1)  
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.2)  
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.4)  
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2.9)  
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (1.25.8)  
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2019.11.28)  
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.7/site-packages (from catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (1.6.0)  
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/site-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.1.0)  
Building wheels for collected packages: en-core-web-sm  
  Building wheel for en-core-web-sm (setup.py) ... [?25ldone  
[?25h  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.2.5-py3-none-any.whl size=12011738 sha256=4e741a4ef6924b14806dc4789ff4156bf93b98c79d33f5959516f6a04c73f4bb  
  Stored in directory: /tmp/pip-ephem-wheel-cache-yazrb305/wheels/51/19/da/a3885266a3c241aff0ad2eb674ae058fd34a4870fef1c0a5a0  
Successfully built en-core-web-sm  
Installing collected packages: en-core-web-sm  
Successfully installed en-core-web-sm-2.2.5  
[38;5;2m✔ Download and installation successful[0m  
You can now load the model via spacy.load('en_core_web_sm')  

ダウンロードできた。
コードを実行してみる

import spacy  
nlp = spacy.load("en_core_web_sm")  
---------------------------------------------------------------------------  

OSError                                   Traceback (most recent call last)  

<ipython-input-6-14d257ed08ca> in <module>  
      1 import spacy  
----> 2 nlp = spacy.load("en_core_web_sm")  


/usr/local/lib/python3.7/site-packages/spacy/__init__.py in load(name, **overrides)  
     28     if depr_path not in (True, False, None):  
     29         deprecation_warning(Warnings.W001.format(path=depr_path))  
---> 30     return util.load_model(name, **overrides)  
     31   
     32   


/usr/local/lib/python3.7/site-packages/spacy/util.py in load_model(name, **overrides)  
    167     elif hasattr(name, "exists"):  # Path or Path-like to model data  
    168         return load_model_from_path(name, **overrides)  
--> 169     raise IOError(Errors.E050.format(name=name))  
    170   
    171   


OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.  

むむ。
jupyter notebook上だとアカンのか?
一度Dockerfileに書いてビルドし直してみる。

ビルドし直してみた。
再度実行してみる。

import spacy  
nlp = spacy.load("en_core_web_sm")  

何もエラー出ない。成功か。
docの型を見てみよう。

doc = nlp("This is a text")  
type(doc)  
spacy.tokens.doc.Doc  

spacy.tokens.doc.Doc なるほど。
pipeline は何が設定されているか。

for p in nlp.pipeline:  
    print(p)  
('tagger', <spacy.pipeline.pipes.Tagger object at 0x7fc3c78613d0>)  
('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7fc39292ede0>)  
('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7fc3928c5360>)  

ふむふむ。
tagger、parser、ner 確かに。

ちなみに、モデルのQuickStart見てたら、こんな書き方↓もできるみたい。

import en_core_web_sm # 文字列でloadするモデルを指定する方法の他に、モジュールとして読み込む方法があるようだ  
nlp = en_core_web_sm.load() # 引数なしの load メソッドが nlp を返すのか  
doc = nlp("This is a text")  
print(doc)  

for p in nlp.pipeline:  
    print(p)  
This is a text  
('tagger', <spacy.pipeline.pipes.Tagger object at 0x7fc3903805d0>)  
('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7fc3928bad70>)  
('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7fc3928ba9f0>)  

nlp オブジェクトってなんだ

type(nlp)  
spacy.lang.en.English  

ふーん

技術ブログをはじめよう Qrunch(クランチ)は、プログラマの技術アプトプットに特化したブログサービスです
駆け出しエンジニアからエキスパートまで全ての方々のアウトプットを歓迎しております!
or 外部アカウントで 登録 / ログイン する
クランチについてもっと詳しく

この記事が掲載されているブログ

Booinkの技術記録

よく一緒に読まれる記事

0件のコメント

ブログ開設 or ログイン してコメントを送ってみよう