Huginn有着强大的采集功能,利用webhook agent完成批量输入,Microsoft Form存储输出。这里以采集Oxford的word origin为例。
总体的流程:SHELL–>webhook agent–>website agent–>post agent–>Microsoft Form
另外除了Microsoft Form,还可以使用其他的在线表单系统或是支持REST API的在线数据库。
新建表单
Long Answer确保采集内容过长时不会出现问题,anyone with the link can respond确保后期post agent方便提交表单
创建agent
webhook agent只能接收json格式的post请求包,可以利用curl的bash脚本批量提交。将下面代码中的link改成自己的webhook link,并保存成webhook.sh,并且与task.txt(按行保存需要采集的网站)同目录。其中while 函数用来读取task.txt的每一行,作为变量task的赋值,curl实现以json格式发送post请求包。
1 2 3 4 5 |
#!/bin/bash while read -r task do curl -H "Content-Type: application/json" -X POST -d "{\"url\":\"$task\"}" https://****.herokuapp.com/users/1/web_requests/48/form done < task.txt |
bash webhook.sh
运行脚本。
webhook agent,其中payload_path利用的是JSONPath的规则:
1 2 3 4 5 |
{ "secret": "form", "expected_receive_period_in_days": 1, "payload_path": "." } |
运行获得的events:
1 2 3 4 5 6 |
{ "url": "http://www.oxfordlearnersdictionaries.com/search/english/?q=package", "web_request": { "url": "http://www.oxfordlearnersdictionaries.com/search/english/?q=package" } } |
website agent,其中mode为merge表示将上一级webhook agent的url参数向下一级post agent传递,extract表示要提取的内容,提取可以利用CSS选择器和Xpath选择器。由于有些单词没有word origin,不会创建events,这样不利于后期与漏掉的单词区分。因此,可以多创建一个字段,提取每个页面都存在的元素,比如单词。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
{ "expected_update_period_in_days": "2", "url": "{{url}}", "type": "html", "mode": "merge", "extract": { "word": { "xpath": ".//*[@class='webtop-g']/h2", "value": "string(.)" }, "wordorigin": { "xpath": ".//*[@unbox='wordorigin']", "value": "string(.)" } } } |
浏览器的F12调出开发者工具栏,切换至网络选项卡,抓Microsoft Form的post请求包。
如果采集内容包含英文字符下的\和“,需要替换成\\和\”,防止Microsoft Form出错(其他在线表单系统是否有这种限制自行测试)。post agent:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
{ "post_url": "https://forms.office.com/formapi/api/50377c39-02d2-4bbd-adc9-8ded510e8cb4/users/f4298a1d-a4da-4108-b0ff-8a592db3013e/forms('OXw3UNICvUutyY3tUQ6MtB2KKfTapAhBsP-KWS2zAT5UNTFVTDE2QU83RjdTSDk1NzI0S1BZSEFNVC4u')/responses", "expected_receive_period_in_days": "1", "content_type": "form", "method": "post", "payload": { "answers": "[{\"questionId\":\"r0d501ba965bc4dd8ad204b70b667f6e8\",\"answer1\":\"{{word}}\"},{\"questionId\":\"raec212ddb768410fafe4b584ab1be09a\",\"answer1\":\"{{wordorigin}}\"}]", "startDate": "2017-09-14T07:26:26.706Z", "submitDate": "2017-09-14T07:26:47.127Z" }, "headers": { }, "emit_events": "true", "no_merge": "false", "output_mode": "clean" } |
下载表单