我实现了一个人工机器学习写作器，实现人工机器写作,未经许可，禁止转载！英文

文章由Byrx.net分享于2019-03-23 10:03:13评论（265）

我实现了一个人工机器学习写作器，实现人工机器写作,未经许可，禁止转载！英文

本文由编橙之家 - Ree Ray 翻译，sunshinebuel 校稿。未经许可，禁止转载！
英文出处：Jacob Plaster。欢迎加入翻译组。

我最近在赫尔大学完成了我的进阶人工智能模块。它特别棒。“机器学习”技术尤其吸引我，大量基于此的潜在应用看起来前途无量。当我克服了（人工神经）网络工作原理的陡峭学习曲线后，我决定是时候写点什么了。

人工神经网络（ANN）写作器

在我全力搜索互联网上来研究奇迹般的机器学习的同时，我偶然发现了个 github 上的项目，使用了递归神经网络（recurrent neural network）来模仿莎士比亚的写作风格。我喜欢这个想法，所以也试着创造一个几乎完全不一样属于我自己的版本。我决定使用 scikit 这个机器学习库。这是因为它使用和配置起来都特别顺手。

Scikit 同样有着庞大的社区，里面包含了大量的教程，还有许多可以用来训练你自己的神经网络的样本数据集（example datasets ）。我创建的这个写手使用了多重支持向量机（SVM）引擎。一个向量机（vector machine）用来对句子结构化，多个小型向量机用对应从词汇表中选取单词算法。

句式结构化

句式结构化非常成功，我目前使用的算法结果准确率已经很高了。这个阶段中最打的障碍就是将训练数据归一化（normals）。我使用了 NLTK 这个自然语言的库来将训练数据转化成词性标签（phrase identifiers），例如：NN（名词），DET（限定词），$（标志）等等。

这代表着我可以利用这些标签对数据进行归一化，像下面这样：

["The", "cat", "jumped"] = ['DET', 'NN', 'VP]

一旦归一化以后就像下面这样：

['DET', 'NN', 'VP] = [0.2237823, 0.82392, 0.342323]

现在我只需要得到一个目标归一化后的值（target normal），并且将它代入神经网络中开始训练即可。从二进制大型对象（BLOB）中读取文本时，训练用词就是二进制大型对象中的下一个词，因此：

["The", "cat", "jumped"]["towards"] = ['DET', 'NN', 'VP]["PRP"] = [0.2237823, 0.82392, 0.342323][0.12121212]

接下来要做的是拿到大量 J.K Rowling《Harry Potter》的资源并且准备开始模仿她的句式结构。

词汇

词汇在本项目中无疑是最难的部分，我很清楚没有道理不使用递归神经网络，预测每个字母也是更好的办法。然而，我选择的方法产生的结果看起来特别有趣。

词汇以词序矩阵的形式包含在训练用的 BLOB 文件中。每个词分解成了词性标签接着进行归一化。归一化后的值和词汇依然被保留着，因为稍后将归一化的值转换回单词依然要利用此作为映射表。词汇看起来像这样：

[[(cat, [0.232342]), (bat, [0.2553535]), (dog, [0.345454]), (horse, [0.4544646]) ... ]
[(run, [0.12131], (jump, 0.232323), (fall, 0.43443434) ... ]
...
]

尝试应用

使用 HarryPotter（small）.txt

这个数据集包含了 346 个训练向量（training vectors）。是一个最小的训练集。

Input: "Harry dont look"
Output: "Harry dont look had at eleven for witches had been for eleven then with nothing had been for eleven then with nothing had been for eleven then with nothing had been for eleven "

Input: "Wizards sometimes do"
Output: "wizards sometimes do , Harry had been been , Harry had been been , Harry had been been , Harry had been been , Harry had been been , Harry had been been "

你可以看出神经网络很努力想要学习却缺乏训练数据。它几乎就陷入无限循环中了。

使用 HarryPotter(medium).txt

数据集包含了 2500 个训练向量。

Input: "Harry don't look"
Output: "Harry don't look , incredibly that a year for been parents in . followers , Harry , and Potter was been curse . Harry was up a year , Harry was been curse "

Input: "Wizards sometimes do"
Output: "wizards sometimes do parents."

现在你可以看出神经网络已经生成了我们预想的结果。因为增加了训练数据，它的性能得到了大幅度提升。

使用 HarryPotter(large).txt

数据集包含了 4550 个训练向量。

Input: "Harry don't look"
Output: "Harry don't look Petunia Harry , incredibly had on the address."
Now, apart form a missing full-stop in-between 'Petunia' and 'Harry' the sentence structure is almost prefect.

Input: "Wizards sometimes do"
Output: "Wizards sometimes do the address on the address on the address."
Here, the increase in training data has given a negative result. This is rare.

使用 MacbookAirBlog(large).txt

我随便在一个博客上拿到了一份文本。出乎意料的是这个数据集常常能产生更精确的结果。让我们看看。数据集中包含了 3576 个训练向量。

Input: "Harry don't look"
Output: "Harry don't look , also of a scaling on playing . Use for Control weight goes so cable and they've of placed it . you do to want things at at 2015."

Input: "Wizards sometimes do"
Output: "Wizards sometimes do When ports a scaling the have object , also of a scaling on playing ."

结果是正确的，但词汇有限。10,000 以上训练向量的数据集马上要来了。

进行一些单元测试

程序生成了非常准确的结果当我们试着对下一个顺序的词进行预测。但是当大量序列开始生成后，准确率就下降了。我创建了一个单元测试用来比较生成的下一个词与 J.K Rowling 实际创作中的词。我得到了下面的结果：

Failed Tests: (6/48)
[('very', 'RB'), ('likely', 'JJ'), ('replace', 'NN')]   Target: [('the', 'DT')]    Prediction: ['.'] 20.0%
[('entirely', 'RB'), ('once', 'RB'), ('Apple', 'NNP')]   Target: [('is', 'VBZ')]    Prediction: ['RBS'] 20.0%
[('once', 'RB'), ('Apple', 'NNP'), ('is', 'VBZ')]   Target: [('able', 'JJ')]    Prediction: ['RBR'] 20.0%
[('able', 'JJ'), ('to', 'TO'), ('bring', 'VB')]   Target: [('its', 'PRP$')]    Prediction: ['RP'] 20.0%
[('down', 'IN'), ('enough', 'RB'), (',', ',')]   Target: [('though', 'IN')]    Prediction: ['VBN'] 20.0%
[('though', 'IN'), ('this', 'DT'), ('may', 'MD')]   Target: [('take', 'VB')]    Prediction: ['.'] 20.0%


Non-Fatal failed Tests: (24/48)
[('The', 'DT'), ('12-inch', 'JJ'), ('Retina', 'NNP')]   Target: [('MacBook', 'NN')]    Prediction: [','] 40.0%
[('Retina', 'NNP'), ('MacBook', 'NNP'), ('is', 'VBZ')]   Target: [('Apple', 'NNP')]    Prediction: ['IN'] 40.0%
[('MacBook', 'NN'), ('is', 'VBZ'), ('Apple', 'NNP')]   Target: [("'", 'POS')]    Prediction: ['IN'] 40.0%
[('Apple', 'NNP'), ("'", 'POS'), ('s', 'NNS')]   Target: [('latest', 'JJS')]    Prediction: ['VBP'] 40.0%
[("'", 'POS'), ('s', 'NNS'), ('latest', 'JJS')]   Target: [('and', 'CC')]    Prediction: ['IN'] 40.0%
[('latest', 'JJS'), ('and', 'CC'), ('greatest', 'JJS')]   Target: [('notebook', 'NN')]    Prediction: ['.'] 60.0%
[('and', 'CC'), ('greatest', 'JJS'), ('notebook', 'NN')]   Target: [(',', ',')]    Prediction: ['NN'] 40.0%
[('greatest', 'JJS'), ('notebook', 'NN'), (',', ',')]   Target: [('and', 'CC')]    Prediction: ['DT'] 40.0%
[('notebook', 'NN'), (',', ','), ('and', 'CC')]   Target: [('will', 'MD')]    Prediction: ['NN'] 40.0%
[('and', 'CC'), ('will', 'MD'), ('very', 'RB')]   Target: [('likely', 'JJ')]    Prediction: [','] 40.0%
[('will', 'MD'), ('very', 'RB'), ('likely', 'JJ')]   Target: [('replace', 'NN')]    Prediction: ['TO'] 40.0%
[('the', 'DT'), ('MacBook', 'NNP'), ('Air', 'NNP')]   Target: [('entirely', 'RB')]    Prediction: ['NN'] 40.0%
[('MacBook', 'NN'), ('Air', 'NNP'), ('entirely', 'RB')]   Target: [('once', 'RB')]    Prediction: ['VBZ'] 60.0%
[('Air', 'NNP'), ('entirely', 'RB'), ('once', 'RB')]   Target: [('Apple', 'NNP')]    Prediction: ['RB'] 40.0%
[('Apple', 'NNP'), ('is', 'VBZ'), ('able', 'JJ')]   Target: [('to', 'TO')]    Prediction: ['NN'] 40.0%
[('to', 'TO'), ('bring', 'VB'), ('its', 'PRP$')]   Target: [('costs', 'NNS')]    Prediction: ['VB'] 40.0%
[('its', 'PRP$'), ('costs', 'NNS'), ('down', 'IN')]   Target: [('enough', 'RB')]    Prediction: ['DT'] 40.0%
[('costs', 'NNS'), ('down', 'RB'), ('enough', 'RB')]   Target: [(',', ',')]    Prediction: ['RB'] 40.0%
[(',', ','), ('though', 'IN'), ('this', 'DT')]   Target: [('may', 'MD')]    Prediction: ['JJS'] 40.0%
[('a', 'DT'), ('few', 'JJ'), ('generations', 'NNS')]   Target: [('.', '.')]    Prediction: ['VBP'] 60.0%
[('few', 'JJ'), ('generations', 'NNS'), ('.', '.')]   Target: [('It', 'PRP')]    Prediction: ['WRB'] 40.0%
[('.', '.'), ('It', 'PRP'), ('is', 'VBZ')]   Target: [('fresh', 'JJ')]    Prediction: ['DT'] 40.0%
[('It', 'PRP'), ('is', 'VBZ'), ('fresh', 'JJ')]   Target: [('on', 'IN')]    Prediction: ['TO'] 60.0%
[('is', 'VBZ'), ('fresh', 'JJ'), ('on', 'IN')]   Target: [('the', 'DT')]    Prediction: ['JJ'] 40.0%


Passed Tests: (14/48)
[('12-inch', 'JJ'), ('Retina', 'NNP'), ('MacBook', 'NNP')]   Target: [('is', 'VBZ')]    Prediction: ['NNP'] 100.0%
[('is', 'VBZ'), ('Apple', 'NNP'), ("'", 'POS')]   Target: [('s', 'NNS')]    Prediction: ['NNP'] 40.0%
[('s', 'NNS'), ('latest', 'VBP'), ('and', 'CC')]   Target: [('greatest', 'JJS')]    Prediction: ['JJ'] 40.0%
[(',', ','), ('and', 'CC'), ('will', 'MD')]   Target: [('very', 'RB')]    Prediction: ['RB'] 20.0%
[('likely', 'JJ'), ('replace', 'NN'), ('the', 'DT')]   Target: [('MacBook', 'NN')]    Prediction: ['NN'] 40.0%
[('replace', 'NN'), ('the', 'DT'), ('MacBook', 'NNP')]   Target: [('Air', 'NNP')]    Prediction: ['NN'] 40.0%
[('is', 'VBZ'), ('able', 'JJ'), ('to', 'TO')]   Target: [('bring', 'VBG')]    Prediction: ['VB'] 60.0%
[('bring', 'VBG'), ('its', 'PRP$'), ('costs', 'NNS')]   Target: [('down', 'IN')]    Prediction: ['IN'] 40.0%
[('enough', 'RB'), (',', ','), ('though', 'IN')]   Target: [('this', 'DT')]    Prediction: ['DT'] 40.0%
[('this', 'DT'), ('may', 'MD'), ('take', 'VB')]   Target: [('a', 'DT')]    Prediction: ['DT'] 40.0%
[('may', 'MD'), ('take', 'VB'), ('a', 'DT')]   Target: [('few', 'JJ')]    Prediction: ['NN'] 80.0%
[('take', 'VB'), ('a', 'DT'), ('few', 'JJ')]   Target: [('generations', 'NNS')]    Prediction: ['NN'] 40.0%
[('generations', 'NNS'), ('.', '.'), ('It', 'PRP')]   Target: [('is', 'VBZ')]    Prediction: ['VBP'] 40.0%
[('fresh', 'JJ'), ('on', 'IN'), ('the', 'DT')]   Target: [('market', 'NN')]    Prediction: ['NN'] 40.0%


Passed: 14   Non-Fatals: 24   Fails: 6
Network accuracy: 13.6%

通过命令行，你可以看到：

python3 main.py -utss -td "Datasets/MacbookAirBlog(large).txt"

我用同样的想法测试了词汇表：

Failed Tests: (19/46)
(12-inch, JJ)        Target: MacBook        Pred: Retina    20.0%
(latest, JJS)        Target: greatest        Pred: heaviest    20.0%
(and, CC)        Target: notebook        Pred: faster    20.0%
(MacBook, NNP)        Target: entirely        Pred: now    20.0%
(Air, NNP)        Target: once        Pred: now    20.0%
(entirely, RB)        Target: Apple        Pred: micro-USB    20.0%
(once, RB)        Target: is        Pred: theres    20.0%
(Apple, NNP)        Target: able        Pred: want    20.0%
(its, PRP$)        Target: down        Pred: on    20.0%
(costs, NNS)        Target: enough        Pred: portable    20.0%
(enough, JJ)        Target: though        Pred: of    20.0%
(though, IN)        Target: may        Pred: can    20.0%
(this, DT)        Target: take        Pred: have    20.0%
(may, MD)        Target: a        Pred: the    20.0%
(take, VB)        Target: few        Pred: later    20.0%
(a, DT)        Target: generations        Pred: thats    20.0%
(It, PRP)        Target: fresh        Pred: same    20.0%
(is, VBZ)        Target: on        Pred: in    20.0%
(on, IN)        Target: market        Pred: playing    20.0%


Non-Fatal failed Tests: (13/46)
(,, ,)        Target: 12-inch        Pred: many    40.0%
(The, DT)        Target: Retina        Pred: MacBook    40.0%
(MacBook, NNP)        Target: Apples        Pred: X    40.0%
(is, VBZ)        Target: latest        Pred: best    40.0%
(Apples, NNP)        Target: and        Pred: but    40.0%
(,, ,)        Target: will        Pred: can    60.0%
(and, CC)        Target: very        Pred: not    60.0%
(will, MD)        Target: likely        Pred: easy    40.0%
(very, RB)        Target: replace        Pred: power    40.0%
(the, DT)        Target: Air        Pred: MacBook    40.0%
(able, JJ)        Target: bring        Pred: be    40.0%
(to, TO)        Target: its        Pred: my    60.0%
(bring, VB)        Target: costs        Pred: things    40.0%


Passed Tests: (13/46)
(Retina, NNP)        Target: is        Pred: is   40.0%
(greatest, JJS)        Target: ,        Pred: ,   60.0%
(notebook, NN)        Target: and        Pred: and   80.0%
(likely, JJ)        Target: the        Pred: the   100.0%
(replace, NN)        Target: MacBook        Pred: MacBook   20.0%
(is, VBZ)        Target: to        Pred: to   60.0%
(down, IN)        Target: ,        Pred: ,   100.0%
(,, ,)        Target: this        Pred: the   80.0%
(few, JJ)        Target: .        Pred: .   100.0%
(generations, NNS)        Target: It        Pred: It   60.0%
(., .)        Target: is        Pred: is   40.0%
(fresh, JJ)        Target: the        Pred: the   100.0%
(the, DT)        Target: ,        Pred: ,   100.0%


Passed: 13   Non-Fatals: 13   Fails: 19

python3 main.py -utv -td "Datasets/MacbookAirBlog(large).txt"

如果预估（prediction estimation）超过 80% 就会被归为“通过（passed）”。

以上所有的结果都来自于“未完结”的程序，这也就是为什么它们看起来并不准确。

本实验只应用于教育，永不用于商业化。

如果你想查看这个项目，你可以在 github 上看到。

打赏支持我翻译更多好文章，谢谢！
打赏译者

打赏支持我翻译更多好文章，谢谢！

任选一种支付方式

热门文章：

我实现了一个人工机器学习写作器，实现人工机器写作,未经许可，禁止转载！英文