mirror of
https://github.com/d0zingcat/infloop.life.git
synced 2026-05-19 15:10:07 +00:00
542 lines
24 KiB
Markdown
542 lines
24 KiB
Markdown
---
|
||
title: "2018-11-18"
|
||
date: 2018-11-18T12:55:09+08:00
|
||
draft: true
|
||
---
|
||
|
||
这周刚巧基友在爬国学网站,爬出来的都是json,于是他想到了存在mongodb中,然后再导出为PDF。因为他跟我提了这件事情,联想到腾讯的招聘要求中有一条加分项就是了解过mongodb,心想自己也得去研究一下(于是???最后也没看mongo是怎么玩的是么?【逃】)。于是准备安装。在mac下直接使用`brew install mongodb`结果炸了:
|
||
```
|
||
php@7.0
|
||
mongodb: A full installation of Xcode.app 8.3.2 is required to compile this software.
|
||
Installing just the Command Line Tools is not sufficient.
|
||
Xcode 8.3.2 cannot be installed on macOS 10.11.
|
||
You must upgrade your version of macOS.
|
||
Error: An unsatisfied requirement failed this build.
|
||
```
|
||
<!--more-->嗯,欺负我是黑苹果还是EI Capitan 也就是10.11没法升级没法安装最新的Xcode。嗯,只能另辟蹊径,那我装个docker吧。`brew install docker` 一切正常,但是当pull的时候提示unix socket未启动,我一想这不就是daemon线程没起么,各种命令敲完之后发现好像自己犯了个错,没有装全组件。brew search一下还真不少,不知道该装哪个了。Google之后发现要装docker-machine,docker-composer还有一堆虚拟化的东西要配,觉得好麻烦,就直接翻docker官网,发现有现成的docker app下载,直接下载解压打开,提示我一定要10.12之后才行,我一口老血。删除之,另找方法。突然转念一想,我不是有虚拟机么,虚拟机里面装个,在宿主机上连接就行啦。果断尝试,刚好我的ubuntu cosmic还是新鲜出炉的流畅地一塌糊涂,装起东西来也是飞快,不过当时用linuxbrew安装docker的时候也陷入了窘境,因为不知道该装哪个,毕竟brew是从OSX移植过来的,毕竟原始是针对Mac平台的。只能去翻docker的官方文档,步骤如下:
|
||
|
||
```bash
|
||
sudo apt install apt-transport-https ca-certificates curl software-properties-common
|
||
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
|
||
sudo apt-key fingerprint 0EBFCD88
|
||
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
|
||
sudo apt update
|
||
sudo apt install docker-ce
|
||
```
|
||
|
||
但是当这些操作完了之后还是没能安装上docker,软件源中压根没有这个。查了下,发现因为ubuntu cosmic用的是debian 10的内核,换言之,lsb_release -cs 这个命令取到的是buster,这个版本太新了docker还没有适配。然后在这边找到了答案,其实很简单,[把系统版本直接写死](https://github.com/docker/for-linux/issues/442),也就是想这样: `sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"` 即可开心地食用,起码我还没遇到啥不能用的bug。另外值得一提的是,parallels使用默认的shared network模式即可实现宿机主机之间通讯,其实通过ifconfig查看也能发现端倪:
|
||
```
|
||
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
|
||
ether c4:17:fe:d4:d9:71
|
||
inet6 fe80::c617:feff:fed4:d971%en0 prefixlen 64 scopeid 0x5
|
||
inet 192.168.50.201 netmask 0xffffff00 broadcast 192.168.50.255
|
||
nd6 options=1<PERFORMNUD>
|
||
media: autoselect
|
||
status: active
|
||
p2p0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 2304
|
||
ether 06:17:fe:d4:d9:71
|
||
media: autoselect
|
||
status: inactive
|
||
vnic0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
|
||
options=3<RXCSUM,TXCSUM>
|
||
ether 00:1c:42:00:00:08
|
||
inet 10.211.55.2 netmask 0xffffff00 broadcast 10.211.55.255
|
||
media: autoselect
|
||
status: active
|
||
vnic1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
|
||
options=3<RXCSUM,TXCSUM>
|
||
ether 00:1c:42:00:00:09
|
||
inet 10.37.129.2 netmask 0xffffff00 broadcast 10.37.129.255
|
||
media: autoselect
|
||
status: active
|
||
```
|
||
|
||
vnic0就是parallels虚拟出来的网卡了,类似的宿机上也有这么一块虚拟网卡。不过一开始调试的时候并不痛,捣鼓了半天才发现是我之前偷懒在linux的.profile文件(用户登录时自动加载一次,但是alias命令貌似无效)中export了http_proxy和https_proxy,所以所有的流量都走代理了,当访问10.xxx的时候走了代理所以就不通了。
|
||
|
||
既然踩完了这几个坑,自然就想这样要总结一下,所以想要写个博客记录一下。突然想到一直以来欠着的就是一个静态资源服务器。先不说现在的各种OSS有多坑,对象存储机制下的key-value管理非常地不方便,当想要迁移的时候也总是因为各种适配性问题,各种组织不同导致以前写博客时的图片全都很难维护或者失去维护,除了这一点我自己的服务器已经是CN2 GIA线路了,速度应该还是可以的很快的,所以完全可以用静态文件托管的方式来维护博客相关的图片和附件。
|
||
|
||
能想到的方案有golang写一个文件服务器,然后自动去掉前缀进行访问,例如https://blog.d0zingcat.xyz/fileserver/aaa/xxx.png 就访问对应的静态目录下的aaa文件夹下的xxx.png文件(自动移除fileserver前缀)。这么做的好处是通过访问网址来判断是否访问静态资源服务器(而对应的文件直接食用rsync命令同步到服务器上即可),而且因为走的还是blog.d0zingcat.xyz这个域名,所以可以沿用这个域名的https证书。但是坏处是静态资源依赖于这个域名,存在误导性。而且还有个问题是这个需要写nginx的location规则,然额我nginx玩的少不会写。所以只能放弃。那么文件就很简单了,只有新搞一个域名,申请上https证书(因为我的博客开了全站https且是强制开启的,如果引入http的静态资源的话会没法加载)。
|
||
|
||
但是主要矛盾是穷(就跟上面想到用docker来玩mongodb一个意思),我觉得再为单独一个域名来申请证书代价有点昂贵,而这个域名还是我心血来潮搞的。所以就想到免费的https证书签发组织[letsencrypt](https://letsencrypt.org/)。但是之前用这个都是用的官方的工具certbot,非常地庞大和臃肿,不够清真。所以就想到了以前在一个博主(现在记不得网址了)那边看到[dehydrated](https://github.com/lukas2511/dehydrated)这个工具,可以很轻便地签发ACME证书。食用方法差不多如下:
|
||
|
||
1. 克隆[dehydrated](https://github.com/lukas2511/dehydrated)项目
|
||
2. 在dehydrated目录下创建config(文本文件,示例如下)、challenge(文件夹)、domains.txt(文本文件,示例如下)
|
||
3. 需要配置对应的nginx。如果没有,可能需要装一个。配置nginx跳转访问well-known目录下的对应文件。示例如下。但是值得注意的是配置了^~这条规则就不要配置普通的 location/ {} 这个块,否则会冲突导致无法正常访问到well-known目录下的文件。当然,也可以在这个目录下新建一个test.txt,随便写点什么,方便测试看是否配置正确。至于是什么机理导致的,我也不清楚。
|
||
4. 需要映射对应的路径到容器中,命令为 `docker run --rm -d -v /home/d0zingcat/nginx/nginx.conf:/etc/nginx/nginx.conf -v /home/d0zingcat/dehydrated/challenge:/var/www/dehydrated -p 80:80 -p 443:443 nginx`,通过这个来启动对应的容器。/home/d0zingcat是我的$BASEDIR,另外包括dehydrated也在这个目录下。
|
||
5. 使用命令 `./dehydrated --register --accept-terms` 来完成注册
|
||
6. 使用命令 `./dehydrated -c` 来完成签约
|
||
|
||
|
||
*我的config*
|
||
值得注意的是CA字段在自己尝试的时候最好改成带“-staging-”的这个,不然可能会因为测试次数频繁而被封(我是没达到过,以前可能有);
|
||
另外就是如果使用了staging的这个CA且测试成功签发了对应的证书,那也没啥用的,不能拿来部署的,只是测试用。但是如果就简单改一下CA再重新签发还是会续签这个测试证书,比较好的解决方法就是直接把dehydrated目录下的accounts、chains、certs目录全部都移除。
|
||
我也试过使用自签的csr和private key,但是签约POST的时候失败了(400,不是合理的JSON),没有深究,也懒得管了。如果你有兴趣那么尽可以复现这个问题,然后去项目下面提一个issue好了。
|
||
|
||
```
|
||
########################################################
|
||
# This is the main config file for dehydrated #
|
||
# #
|
||
# This file is looked for in the following locations: #
|
||
# $SCRIPTDIR/config (next to this script) #
|
||
# /usr/local/etc/dehydrated/config #
|
||
# /etc/dehydrated/config #
|
||
# ${PWD}/config (in current working-directory) #
|
||
# #
|
||
# Default values of this config are in comments #
|
||
########################################################
|
||
|
||
# Which user should dehydrated run as? This will be implictly enforced when running as root
|
||
DEHYDRATED_USER=d0zingcat
|
||
|
||
# Which group should dehydrated run as? This will be implictly enforced when running as root
|
||
DEHYDRATED_GROUP=d0zingcat
|
||
|
||
# Resolve names to addresses of IP version only. (curl)
|
||
# supported values: 4, 6
|
||
# default: <unset>
|
||
#IP_VERSION=
|
||
|
||
# Path to certificate authority (default: https://acme-v02.api.letsencrypt.org/directory)
|
||
CA="https://acme-v02.api.letsencrypt.org/directory"
|
||
#CA="https://acme-staging-v02.api.letsencrypt.org/directory"
|
||
|
||
# Path to old certificate authority
|
||
# Set this value to your old CA value when upgrading from ACMEv1 to ACMEv2 under a different endpoint.
|
||
# If dehydrated detects an account-key for the old CA it will automatically reuse that key
|
||
# instead of registering a new one.
|
||
# default: https://acme-v01.api.letsencrypt.org/directory
|
||
#OLDCA="https://acme-v01.api.letsencrypt.org/directory"
|
||
|
||
# Which challenge should be used? Currently http-01, dns-01 and tls-alpn-01 are supported
|
||
CHALLENGETYPE="http-01"
|
||
|
||
# Path to a directory containing additional config files, allowing to override
|
||
# the defaults found in the main configuration file. Additional config files
|
||
# in this directory needs to be named with a '.sh' ending.
|
||
# default: <unset>
|
||
#CONFIG_D=
|
||
|
||
# Directory for per-domain configuration files.
|
||
# If not set, per-domain configurations are sourced from each certificates output directory.
|
||
# default: <unset>
|
||
#DOMAINS_D=
|
||
|
||
# Base directory for account key, generated certificates and list of domains (default: $SCRIPTDIR -- uses config directory if undefined)
|
||
BASEDIR=$SCRIPTDIR
|
||
|
||
# File containing the list of domains to request certificates for (default: $BASEDIR/domains.txt)
|
||
DOMAINS_TXT="${BASEDIR}/domains.txt"
|
||
|
||
# Output directory for generated certificates
|
||
CERTDIR="${BASEDIR}/certs"
|
||
|
||
# Output directory for alpn verification certificates
|
||
#ALPNCERTDIR="${BASEDIR}/alpn-certs"
|
||
|
||
# Directory for account keys and registration information
|
||
ACCOUNTDIR="${BASEDIR}/accounts"
|
||
|
||
# Output directory for challenge-tokens to be served by webserver or deployed in HOOK (default: /var/www/dehydrated)
|
||
WELLKNOWN="${BASEDIR}/challenge"
|
||
|
||
# Default keysize for private keys (default: 4096)
|
||
#KEYSIZE="4096"
|
||
|
||
# Path to openssl config file (default: <unset> - tries to figure out system default)
|
||
#OPENSSL_CNF=
|
||
|
||
# Path to OpenSSL binary (default: "openssl")
|
||
#OPENSSL="openssl"
|
||
|
||
# Extra options passed to the curl binary (default: <unset>)
|
||
#CURL_OPTS=
|
||
|
||
# Program or function called in certain situations
|
||
#
|
||
# After generating the challenge-response, or after failed challenge (in this case altname is empty)
|
||
# Given arguments: clean_challenge|deploy_challenge altname token-filename token-content
|
||
#
|
||
# After successfully signing certificate
|
||
# Given arguments: deploy_cert domain path/to/privkey.pem path/to/cert.pem path/to/fullchain.pem
|
||
#
|
||
# BASEDIR and WELLKNOWN variables are exported and can be used in an external program
|
||
# default: <unset>
|
||
#HOOK=
|
||
|
||
# Chain clean_challenge|deploy_challenge arguments together into one hook call per certificate (default: no)
|
||
#HOOK_CHAIN="no"
|
||
|
||
# Minimum days before expiration to automatically renew certificate (default: 30)
|
||
#RENEW_DAYS="30"
|
||
|
||
# Regenerate private keys instead of just signing new certificates on renewal (default: yes)
|
||
#PRIVATE_KEY_RENEW="yes"
|
||
|
||
# Create an extra private key for rollover (default: no)
|
||
#PRIVATE_KEY_ROLLOVER="no"
|
||
|
||
# Which public key algorithm should be used? Supported: rsa, prime256v1 and secp384r1
|
||
#KEY_ALGO=rsa
|
||
|
||
# E-mail to use during the registration (default: <unset>)
|
||
#CONTACT_EMAIL=
|
||
|
||
# Lockfile location, to prevent concurrent access (default: $BASEDIR/lock)
|
||
#LOCKFILE="${BASEDIR}/lock"
|
||
|
||
# Option to add CSR-flag indicating OCSP stapling to be mandatory (default: no)
|
||
#OCSP_MUST_STAPLE="no"
|
||
|
||
# Fetch OCSP responses (default: no)
|
||
#OCSP_FETCH="no"
|
||
|
||
# OCSP refresh interval (default: 5 days)
|
||
#OCSP_DAYS=5
|
||
|
||
# Issuer chain cache directory (default: $BASEDIR/chains)
|
||
#CHAINCACHE="${BASEDIR}/chains"
|
||
|
||
# Automatic cleanup (default: no)
|
||
#AUTO_CLEANUP="no"
|
||
|
||
# ACME API version (default: auto)
|
||
#API=auto
|
||
```
|
||
|
||
*我的domains*
|
||
|
||
我只签单域名的,且不涉及别名和通配,所以这边看起来会比较简单一些。你可以看项目的docs下的domains_txt.md这个文档,有很详尽的介绍。
|
||
|
||
```
|
||
files.d0zingcat.xyz
|
||
```
|
||
|
||
*我的nginx.conf*
|
||
|
||
```
|
||
server {
|
||
listen 80;
|
||
server_name files.d0zingcat.xyz;
|
||
location ^~ /.well-known/acme-challenge {
|
||
alias /var/www/dehydrated;
|
||
}
|
||
// start
|
||
location / {
|
||
root /var/www/blog.d0zingcat.xyz;
|
||
index index.html;
|
||
}
|
||
// end
|
||
}
|
||
|
||
```
|
||
|
||
签约s协议完成之后会在dehydrated目录下看到多了account、chains、certs三个文件夹,但是我们只用的到certs文件夹。里面会有注册的域名,比如我的files.d0zingcat.xyz。进去找到privkey.pem(私钥)和fullchain.pem(注册密钥),复制出去到对应的nginx目录下(为了方便区分特地修改了名字)。然后修改docker的启动脚本为`docker run --add-host="localhost:172.17.0.1" --rm -d -v /home/d0zingcat/nginx/nginx.conf:/etc/nginx/nginx.conf -v /home/d0zingcat/nginx/data/files.d0zingcat.xyz.key:/var/www/https/files.d0zingcat.xyz.key -v /home/d0zingcat/nginx/data/files.d0zingcat.xyz.crt:/var/www/https/files.d0zingcat.xyz.crt -v /home/d0zingcat/dehydrated/challenge:/var/www/dehydrated -p 80:80 -p 443:443 nginx`,nginx对应的配置为:
|
||
|
||
```
|
||
user nginx;
|
||
worker_processes 1;
|
||
|
||
error_log /var/log/nginx/error.log warn;
|
||
pid /var/run/nginx.pid;
|
||
|
||
|
||
events {
|
||
worker_connections 1024;
|
||
}
|
||
|
||
|
||
http {
|
||
include /etc/nginx/mime.types;
|
||
default_type application/octet-stream;
|
||
|
||
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
|
||
'$status $body_bytes_sent "$http_referer" '
|
||
'"$http_user_agent" "$http_x_forwarded_for"';
|
||
|
||
access_log /var/log/nginx/access.log main;
|
||
|
||
sendfile on;
|
||
#tcp_nopush on;
|
||
|
||
keepalive_timeout 65;
|
||
|
||
#gzip on;
|
||
|
||
include /etc/nginx/conf.d/*.conf;
|
||
|
||
##
|
||
# SSL Settings
|
||
##
|
||
|
||
ssl_protocols TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE
|
||
#server {
|
||
# listen 80;
|
||
# server_name files.d0zingcat.xyz;
|
||
# location ^~ /.well-known/acme-challenge {
|
||
# alias /var/www/dehydrated;
|
||
# }
|
||
# #location / {
|
||
# # root /var/www/blog.d0zingcat.xyz;
|
||
# # index index.html;
|
||
# #}
|
||
#}
|
||
|
||
server {
|
||
listen 80;
|
||
server_name files.d0zingcat.xyz;
|
||
return 302 https://$host$request_uri;
|
||
}
|
||
server {
|
||
listen 443 http2 ssl;
|
||
ssl_certificate /var/www/https/files.d0zingcat.xyz.crt;
|
||
ssl_certificate_key /var/www/https/files.d0zingcat.xyz.key;
|
||
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
|
||
keepalive_timeout 70;
|
||
ssl_session_cache shared:SSL:10m;
|
||
ssl_session_timeout 10m;
|
||
ssl_session_tickets on;
|
||
ssl_stapling on;
|
||
ssl_stapling_verify on;
|
||
ssl_ciphers 'ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:ECDHE-RSA-DES-CBC3-SHA:ECDHE-ECDSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:DES-CBC3-SHA:HIGH:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA';
|
||
location / {
|
||
#root /var/www/blog.d0zingcat.xyz;
|
||
#index index.html;
|
||
#proxy_set_header X-Real-IP $remote_addr;
|
||
proxy_pass http://localhost:9000/;
|
||
}
|
||
server_name files.d0zingcat.xyz;
|
||
access_log /var/log/nginx/nginx.vhost.access.log;
|
||
error_log /var/log/nginx/nginx.vhost.error.log;
|
||
}
|
||
|
||
}
|
||
```
|
||
这个尽可照着改就是,不会有太大的问题,然后至此重启服务器就完成了服务器端的https的访问。就是90分钟要激活一次,那也挺累的。虽然隐隐约约还记得有hook机制,但是没精力去搞了,顶多就是到时候90天续期一次嘛。可以看道nginx的配置文件有反代了localhost:9000这个地址,这个就是我拿go写的很简单的静态资源服务器啦,源代码如下:
|
||
|
||
```golang
|
||
package main
|
||
|
||
import (
|
||
"net/http"
|
||
"os"
|
||
)
|
||
|
||
func main() {
|
||
args := os.Args[1:]
|
||
source := ""
|
||
cert := ""
|
||
key := ""
|
||
port := ""
|
||
if len(args) > 1 {
|
||
port = args[0]
|
||
source = args[1]
|
||
cert = args[2]
|
||
key = args[3]
|
||
} else {
|
||
port = "9000"
|
||
source = "tmp"
|
||
cert = "file.d0zingcat.xyz.pem"
|
||
key = "file.d0zingcat.xyz.key"
|
||
}
|
||
handler := getHandler(source)
|
||
http.ListenAndServe(":"+port, handler)
|
||
http.ListenAndServeTLS(":"+port, cert, key, handler)
|
||
}
|
||
|
||
func getHandler(dir string) http.Handler {
|
||
return http.FileServer(http.Dir((dir)))
|
||
}
|
||
|
||
```
|
||
编译了拿到服务器上使用命令`nohup ./directory-browser "static" "/home/d0zingcat/blog/data/blog.d0zingcat.xyz.crt" "/home/d0zingcat/blog/data/blog.d0zingcat.xyz.key" &`来运行静态资源(需要提前建好static文件夹)。然后本以为这个起好或者nginx也启动好,那么久万事大吉了。但是一次又一次的失败(502 bad gateway),让我焦头烂额不知道怎么处理。突然就有了一个要检查的项,不知道咋回事。但是
|
||
|
||
把[GCTT的golang系列教程](https://studygolang.com/subject/2)全部看完了,因为直接看GOPL实在是太慢了,一方面全英文理解起来没有中文快,另外就是大师就是大师,书籍中的知识深度、广度都非常深远、辽阔,涉及了太多的点就算泛读要一个个弄清楚(比如我就没有做书后的exercise)实在太废劲,就先拿一个浅显易懂的教程整体撸一遍。当然,如果英语好的话可以看[原文](https://golangbot.com/learn-golang-series/)。
|
||
|
||
因为总算看到了golang的信号量和协程,就写了个~~爬图的脚本~~看看多线程的威力。记得以前写的爬图脚本是用的python+bs4,单线程爬了要有好几个小时(当然,也跟电脑配置,网络环境有关系)。但是这次写的是直接解析html正则匹配出来之后直接多线程下载,1600张图片一共耗时2分半,非常的快,而且失败率也很低。关键代码可以见这个:
|
||
|
||
```golang
|
||
package spider
|
||
|
||
import (
|
||
"fmt"
|
||
"io/ioutil"
|
||
"net/http"
|
||
"os"
|
||
"path/filepath"
|
||
"regexp"
|
||
"strconv"
|
||
"strings"
|
||
"sync"
|
||
|
||
"github.com/d0zingcat/go-logger/logger"
|
||
)
|
||
|
||
var pagesCount int
|
||
var failedUrls []string
|
||
var mu *sync.Mutex = &sync.Mutex{}
|
||
|
||
func init() {
|
||
logger.SetRollingFile(".", "spider.log", 10, 50, logger.MB)
|
||
logger.SetLevel(logger.DEBUG)
|
||
htmlBytes, err := reqPage(HOME_URL)
|
||
if err != nil {
|
||
logger.Error("Get total page count failed!")
|
||
panic(err)
|
||
}
|
||
re := regexp.MustCompile(`<span aria-current='page' class='page-numbers current'>(\d+)</span>`)
|
||
pagesMatch := re.FindAllStringSubmatch(string(htmlBytes), -1)
|
||
if len(pagesMatch) > 0 && len(pagesMatch[0]) > 1 {
|
||
page := pagesMatch[0][1]
|
||
pagesCount, err = strconv.Atoi(page)
|
||
if err != nil {
|
||
logger.Error("Can not convert page number")
|
||
panic(err)
|
||
}
|
||
}
|
||
}
|
||
|
||
func Process(n int, dir string) {
|
||
count := pagesCount
|
||
flag := make([]int, pagesCount+1)
|
||
ch := make(chan int)
|
||
i := 1
|
||
for ; i <= pagesCount-n; i += n {
|
||
go dispatch(i, i+n, ch, dir)
|
||
}
|
||
go dispatch(i, pagesCount+1, ch, dir)
|
||
for count > 0 {
|
||
flag[<-ch] = 1
|
||
count--
|
||
}
|
||
logger.Info("Fail to get these urls: ", failedUrls)
|
||
}
|
||
|
||
func dispatch(start, end int, ch chan int, dir string) {
|
||
for i := start; i < end; i++ {
|
||
dynUrl := fmt.Sprintf(TEMPLATE_URL, i)
|
||
content, err := reqPage(dynUrl)
|
||
if err != nil {
|
||
logger.Error("Req ", dynUrl, " error!")
|
||
}
|
||
content = strings.Replace(content, "\r\n", "", -1)
|
||
content = strings.Replace(content, "\r", "", -1)
|
||
content = strings.Replace(content, "\n", "", -1)
|
||
re := regexp.MustCompile(`<li class="comment byuser(.*?</li>)`)
|
||
comments := re.FindAllString(content, -1)
|
||
for _, item := range comments {
|
||
re := regexp.MustCompile(`<img src="(.+?)".*?/>`)
|
||
imgs := re.FindAllStringSubmatch(item, -1)
|
||
err := storePic(imgs[0][1], dir, strconv.Itoa(i))
|
||
if err != nil {
|
||
// logger.Error(err)
|
||
continue
|
||
}
|
||
}
|
||
ch <- i
|
||
}
|
||
}
|
||
|
||
func storePic(url, location, prefix string) error {
|
||
if _, err := os.Stat(location); os.IsNotExist(err) {
|
||
err = os.Mkdir(location, 0744)
|
||
if err != nil {
|
||
logger.Error("create dir failed!")
|
||
return fmt.Errorf("Dir create fail")
|
||
}
|
||
}
|
||
ss := strings.Split(url, "/")
|
||
filename := ss[len(ss)-1]
|
||
resp, err := http.Get(url)
|
||
if err != nil {
|
||
logger.Error("Fail to request the pic: ", url)
|
||
failedUrls = conAppendSlice(failedUrls, url)
|
||
return err
|
||
}
|
||
bodyBytes, err := ioutil.ReadAll(resp.Body)
|
||
if err != nil {
|
||
logger.Error("Fail to read pic response: ", url)
|
||
failedUrls = conAppendSlice(failedUrls, url)
|
||
return err
|
||
}
|
||
|
||
err = ioutil.WriteFile(filepath.Join(location, prefix+"-"+filename), bodyBytes, 0744)
|
||
if err != nil {
|
||
logger.Error("Store pic failed: ", url)
|
||
failedUrls = conAppendSlice(failedUrls, url)
|
||
return err
|
||
}
|
||
return nil
|
||
}
|
||
|
||
func conAppendSlice(s []string, e string) []string {
|
||
mu.Lock()
|
||
s = append(s, e)
|
||
mu.Unlock()
|
||
return s
|
||
}
|
||
func reqPage(url string) (string, error) {
|
||
resp, err := http.Get(url)
|
||
if err != nil {
|
||
logger.Error("Fail to request the page")
|
||
return "", err
|
||
}
|
||
htmlBytes, err := ioutil.ReadAll(resp.Body)
|
||
if err != nil {
|
||
logger.Error("Fail to read response")
|
||
return "", err
|
||
}
|
||
return string(htmlBytes), nil
|
||
}
|
||
```
|
||
|
||
在看GOPL的过程中呢,看到了struct中匿名嵌入其他结构的用法,方法自推导的机制,方法也可以作为一个变量实现了类似于托管的机制。不过在往下推进的时候看到了Bit Vector,书中源码如下:
|
||
|
||
```golang
|
||
// An IntSet is a set of small non-negative integers.
|
||
// Its zero value represents the empty set.
|
||
type IntSet struct {
|
||
words []uint64
|
||
}
|
||
// Has reports whether the set contains the non-negative value x.
|
||
func (s *IntSet) Has(x int) bool {
|
||
word, bit := x/64, uint(x%64)
|
||
return word < len(s.words) && s.words[word]&(1<<bit) != 0
|
||
}
|
||
// Add adds the non-negative value x to the set.
|
||
func (s *IntSet) Add(x int) {
|
||
word, bit := x/64, uint(x%64)
|
||
for word >= len(s.words) {
|
||
s.words = append(s.words, 0)
|
||
}
|
||
s.words[word] |= 1 << bit
|
||
}
|
||
// UnionWith sets s to the union of s and t.
|
||
func (s *IntSet) UnionWith(t *IntSet) {
|
||
for i, tword := range t.words {
|
||
if i < len(s.words) {
|
||
s.words[i] |= tword
|
||
} else {
|
||
s.words = append(s.words, tword)
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
有点没看明白这个intset的机制,于是开始查资料。
|
||
|
||
Hashmap
|
||
|
||
- hashmap不线程安全、效率高、key,value可以为null
|
||
- hashset线程安全、效率低、key,value不能为空
|
||
|
||
Hashmap是一个数组链表,容量始终都是2的N次方(大于当前实际负载),阀值=负载因子*容量,当实际容量大于阀值时,就会进行扩容。
|
||
|
||
put时如果key为null,则取第一个bucket的entry,并根据链表一直往下找到key为null的,将value设为对应的值,否则新建一个entry。
|
||
如果key不为null,首先根据hashCode()方法获取key的哈希值,然后使用indexFor(hash,table.length)获取table中要存放的位置并存储。当两个元素的key的hashcode相同时,则在该位置上存放新的entry,并且指向原有的entry。所以对于一个table的某个位置而言,可能会存放多个键值对,但是最新的在最前面。
|
||
|
||
get时如果key为null则调用getForNullKey()方法,否则根据hash函数取到hash值,并根据indexFor()得到i的值,之后便利链表。如果有则返回对应的value。
|
||
|
||
|
||
|
||
|