爬取Macy网用户评价日志(3): 爬取comment的设计(具体执行)(Crawling Macy user evaluation log (3): crawling comment design (specific implementation))

step0. main函数。

          1)从mysql中抓取所有未请求的url; 创建url列表;

          2)  依次向url发送info爬虫request.

          3)  依次向url发送comment爬虫request.

step1. mysql抽取;

          1)查看rank3爬取的mysql数据,即具体产品页面url的数量。目前我爬取的数据已经超过了10000条以上。

               因此,需要考虑创建的“rank3 mysql提取类”的提取方法和顺序,以及提取的数量是否python的list可以放得下。

               ① 考虑python list的容量。

                      1———-32位python的限制是 536870912 个元素。

                      2———-64位python的限制是 1152921504606846975 个元素。

                      就目前来看,64位python的数量是可以放下10万条以上mysql的list的。所以暂时还是考虑使用cursor.fetchall()的方法。

step2. info爬取;

step3. comment爬取;

————————

step0. main函数。

1) grab all unsolicited URLs from MySQL; Create URL list;

2) send info crawler request to URL in turn

3) send a comment crawler request to the URL in turn

step1. mysql抽取;

< strong > < / strong > 1) view the MySQL data crawled by rank3, that is, the number of product page URLs. At present, I have crawled more than 10000 data.

Therefore, you need to consider the extraction method and order of the “rank3 MySQL extraction class” created, and whether the extracted quantity can fit in the python list.

① consider the capacity of Python list.

1 ————— the limit of 32-bit Python is 536870912 elements.

2 ————— the limit of 64 bit Python is 1152921504606846975 elements.

At present, the number of 64 bit Python can put down more than 100000 MySQL lists. So consider using cursor for the time being Method of fetchall().

step2. info爬取;

step3. comment爬取;