Слияние кода завершено, страница обновится автоматически
simspider - Net Spider Engine 1.Overview Simspider is a lightweight cross-platform web-spider engine, it provides a set of C function interface is used to quickly build your own net spider application, also provides an executable spider program are used to demonstrate how to use the function libaray. Simspider only depended on third-party libraries libcurl. Simspider support platform: * UNIX/Linux * WINDOWS Simspider function interface is very easy to use, the main process is as follows: * Create a spider engine environment * Set the spider engine environment * Crawl all pages recursively from entrance url * Destruction of the spider engine environment There are a large number of alternative options to customize your spider engine environment, as following: * Set request queue size * Set file extension name set of interest * Whether to allow null file extension name * Whether to allow climbed out of the current web site * Set the maximum depth of recursion * Set the HTTPS certificate file name * Set up interval of crawl * Set maximum number of concurrent crawl Simspider spider engine implements a flexible process framework, provides some callback functions pointer to the spider application designers want to crawl at any time point to add their own custom processing logic, as the following: * On building the HTTP request header * On building the HTTP request body ( POST data etc ) * After fetch HTTP response header * After fetch HTTP response body ( HTML data etc ) * Review queue after crawl 2.My first spider Use simspider engine function library is fairly easy to achieve a creeper, here is a simple example: [code] #include "libsimspider.h" int main() { struct SimSpiderEnv *penv = NULL ; int nret = 0 ; nret = InitSimSpiderEnv( & penv , NULL ) ; if( nret ) { printf( "InitSimSpiderEnv failed[%d]\n" , nret ); return 1; } nret = SimSpiderGo( penv , "" , "http://www.curl.haxx.se/" ) ; if( nret ) { printf( "SimSpiderGo failed[%d]\n" , nret ); return 1; } CleanSimSpiderEnv( & penv ); return 0; } [/code] The sample itself doesn't make any sense, because it doesn't throw out any data, but it certainly crawl all web pages of http://localhost:80/ May I think crawling is slowly, how can enable concurrent crawl. copy that ( 5 concurrents ) [code] SetMaxConcurrentCount( penv , 5 ); [/code] May I want to add some contents in the HTTP disguise my spider, add a callback function code, and set it to the crawler engine environment [code] funcRequestHeaderProc RequestHeaderProc ; int RequestHeaderProc( struct DoneQueueUnit *pdqu ) { CURL *curl = NULL ; struct curl_slist *header_list = NULL ; char buffer[ 1024 + 1 ] ; curl = GetDoneQueueUnitCurl(pdqu) ; header_list = curl_slist_append( header_list , "User-Agent: Mozilla/5.0(Windows NT 6.1; WOW64; rv:34.0 ) Gecko/20100101 Firefox/34.0" ) ; if( GetDoneQueueUnitRefererUrl(pdqu) ) { memset( buffer , 0x00 , sizeof(buffer) ); SNPRINTF( buffer , sizeof(buffer) , "Referer: %s" , GetDoneQueueUnitRefererUrl(pdqu) ); header_list = curl_slist_append( header_list , buffer ) ; } curl_easy_setopt( curl , CURLOPT_HTTPHEADER , header_list ); FreeCurlList1Later( pdqu , header_list ); return 0; } InitSimSpiderEnv ... SetRequestHeaderProc( penv , & RequestHeaderProc ); ... SimSpiderGo [/code] May I think real-time output current in my spider crawl url to the screen, add a callback function code, and set it to the crawler engine environment [code] funcResponseBodyProc ResponseBodyProc ; int ResponseBodyProc( struct DoneQueueUnit *pdqu ) { printf( ">>> [%s]\n" , GetDoneQueueUnitUrl(pdqu) ); return 0; } InitSimSpiderEnv ... SetResponseBodyProc( penv , & ResponseBodyProc ); ... SimSpiderGo [/code] May I want to output all url after the completion of all the sites, new a callback function code, and set it to the crawler engine environment [code] funcTravelDoneQueueProc TravelDoneQueueProc ; void TravelDoneQueueProc( char *key , void *value , long value_len , void *pv ) { struct DoneQueueUnit *pdqu = (struct DoneQueueUnit *)value ; printf( "[%5d] [%2ld] [%s] [%s]\n" , GetDoneQueueUnitStatus(pdqu) , GetDoneQueueUnitRecursiveDepth(pdqu) , GetDoneQueueUnitRefererUrl(pdqu) , GetDoneQueueUnitUrl(pdqu) ); return; } InitSimSpiderEnv ... SetTravelDoneQueueProc( penv , & TravelDoneQueueProc ); ... SimSpiderGo [/code] Is it easy? An complete example in installation package in src/simspider.c 3.Customize the spider engine environment * Set request queue size The spider environment needs two queues: the request queue and the done queue. The request queue size by default is SIMSPIDER_DEFAULT_REQUESTQUEUE_SIZE, If you want to adjust, the calling function ResizeRequestQueue. * Set file extension name set of interest The spider engine crawl site recursively, automatically set to ignore the file extensions are not interested in, for default is SIMSPIDER_DEFAULT_VALIDFILENAMEEXTENSION, you can call function SetValidFileExtnameSet to set customize. * Whether to allow null file extension name Call function AllowEmptyFileExtname to enable or not. * Whether to allow climbed out of the current web site Call function AllowRunOutofWebsite to enable or not. * Set the maximum depth of recursion Call function SetMaxRecursiveDepth to set depth of recursion. * Set the HTTPS certificate file name Call functoin SetCertificateFilename to set certificate file name * Set up interval of crawl Crawl between two pages need sleep , call function SetRequestDelay, units are seconds. * Set maximum number of concurrent crawl Serial crawl slowly too? multiplexing parallel crawl, calling function SetMaxConcurrentCount set of concurrent degree. Automatically adjust when set to SIMSPIDER_CONCURRENTCOUNT_AUTO. (experimental) 4.Callback functions * On building the HTTP request header Add some information on building HTTP request header , write a callback function and set into crawler engine environment. * On building the HTTP request body Add some information on building HTTP request body , write a callback function and set into crawler engine environment. * After fetch HTTP response header Get information on after fetch HTTP request header , write a callback function and set into crawler engine environment. * After fetch HTTP response body Get information on after fetch HTTP request body , write a callback function and set into crawler engine environment. * Review queue after crawl Convenient save some creeper results update into database. 5.Experience * How to output engine internal log Setup log file name on initialize spider engine environment , struct SimSpiderEnv **ppenv , char *log_file_format , ... ). * How to preset multiple entrance websites Function SimSpiderGo set only one entrance , if there are more than a portal entry, use the function AppendRequestQueue preset to spider engine environment , and then call SimSpiderGo(penv,NULL). * In building the HTTP request header in callback function, if you create a curl list (struct curl_slist *), must be at the back to release Call function FreeCurlList1Later~FreeCurlList3Later store into spider engine that release it at last. 6.the spider running demo demo source : src/simspider.c [code] $ time ./simspider 192.168.6.79 5 >>> [http://192.168.6.79/] >>> [http://192.168.6.79/curl-config.html] >>> [http://192.168.6.79/TheArtOfHttpScripting] >>> [http://192.168.6.79/libcurl/index.html] >>> [http://192.168.6.79/index.html] >>> [http://192.168.6.79/libcurl/libcurl.html] >>> [http://192.168.6.79/libcurl/libcurl-easy.html] >>> [http://192.168.6.79/libcurl/libcurl-multi.html] >>> [http://192.168.6.79/libcurl/libcurl-share.html] >>> [http://192.168.6.79/libcurl/libcurl-errors.html] >>> [http://192.168.6.79/curl.html] >>> [http://192.168.6.79/libcurl/curl_easy_cleanup.html] >>> [http://192.168.6.79/libcurl/curl_easy_duphandle.html] >>> [http://192.168.6.79/libcurl/curl_easy_escape.html] >>> [http://192.168.6.79/libcurl/curl_easy_getinfo.html] >>> [http://192.168.6.79/libcurl/curl_easy_init.html] >>> [http://192.168.6.79/libcurl/curl_easy_pause.html] >>> [http://192.168.6.79/libcurl/curl_easy_perform.html] >>> [http://192.168.6.79/libcurl/curl_easy_recv.html] >>> [http://192.168.6.79/libcurl/curl_easy_reset.html] >>> [http://192.168.6.79/libcurl/curl_easy_strerror.html] >>> [http://192.168.6.79/libcurl/curl_easy_unescape.html] >>> [http://192.168.6.79/libcurl/curl_escape.html] >>> [http://192.168.6.79/libcurl/curl_formadd.html] >>> [http://192.168.6.79/libcurl/curl_formfree.html] >>> [http://192.168.6.79/libcurl/curl_formget.html] >>> [http://192.168.6.79/libcurl/curl_free.html] >>> [http://192.168.6.79/libcurl/curl_getenv.html] >>> [http://192.168.6.79/libcurl/curl_easy_send.html] >>> [http://192.168.6.79/libcurl/curl_global_cleanup.html] >>> [http://192.168.6.79/libcurl/curl_global_init.html] >>> [http://192.168.6.79/libcurl/curl_global_init_mem.html] >>> [http://192.168.6.79/libcurl/curl_mprintf.html] >>> [http://192.168.6.79/libcurl/curl_multi_add_handle.html] >>> [http://192.168.6.79/libcurl/curl_multi_assign.html] >>> [http://192.168.6.79/libcurl/curl_multi_cleanup.html] >>> [http://192.168.6.79/libcurl/curl_multi_fdset.html] >>> [http://192.168.6.79/libcurl/curl_getdate.html] >>> [http://192.168.6.79/libcurl/curl_multi_info_read.html] >>> [http://192.168.6.79/libcurl/curl_multi_init.html] >>> [http://192.168.6.79/libcurl/curl_multi_perform.html] >>> [http://192.168.6.79/libcurl/curl_multi_remove_handle.html] >>> [http://192.168.6.79/libcurl/curl_multi_setopt.html] >>> [http://192.168.6.79/libcurl/curl_multi_socket.html] >>> [http://192.168.6.79/libcurl/curl_multi_socket_action.html] >>> [http://192.168.6.79/libcurl/curl_multi_strerror.html] >>> [http://192.168.6.79/libcurl/curl_multi_timeout.html] >>> [http://192.168.6.79/libcurl/curl_share_cleanup.html] >>> [http://192.168.6.79/libcurl/curl_share_init.html] >>> [http://192.168.6.79/libcurl/curl_share_setopt.html] >>> [http://192.168.6.79/libcurl/curl_share_strerror.html] >>> [http://192.168.6.79/libcurl/curl_slist_append.html] >>> [http://192.168.6.79/libcurl/curl_slist_free_all.html] >>> [http://192.168.6.79/libcurl/curl_strequal.html] >>> [http://192.168.6.79/libcurl/curl_unescape.html] >>> [http://192.168.6.79/libcurl/curl_version.html] >>> [http://192.168.6.79/libcurl/curl_version_info.html] >>> [http://192.168.6.79/libcurl/] >>> [http://192.168.6.79/libcurl/libcurl-tutorial.html] >>> [http://192.168.6.79/libcurl/curl_easy_setopt.html] [ 200] [ 1] [] [http://192.168.6.79/] [ 200] [ 2] [http://192.168.6.79/] [http://192.168.6.79/TheArtOfHttpScripting] [ 200] [ 2] [http://192.168.6.79/] [http://192.168.6.79/curl-config.html] [ 200] [ 2] [http://192.168.6.79/] [http://192.168.6.79/curl.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/index.html] [ 200] [ 4] [http://192.168.6.79/libcurl/curl_easy_getinfo.html] [http://192.168.6.79/libcurl/] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_cleanup.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_duphandle.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_escape.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_getinfo.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_init.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_pause.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_perform.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_recv.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_reset.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_send.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_setopt.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_strerror.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_unescape.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_escape.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_formadd.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_formfree.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_formget.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_free.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_getdate.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_getenv.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_global_cleanup.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_global_init.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_global_init_mem.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_mprintf.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_add_handle.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_assign.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_cleanup.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_fdset.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_info_read.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_init.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_perform.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_remove_handle.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_setopt.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_socket.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_socket_action.html] [ 404] [ 4] [http://192.168.6.79/libcurl/curl_multi_socket.html] [http://192.168.6.79/libcurl/curl_multi_socket_all.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_strerror.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_timeout.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_share_cleanup.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_share_init.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_share_setopt.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_share_strerror.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_slist_append.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_slist_free_all.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_strequal.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_unescape.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_version.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_version_info.html] [ 200] [ 2] [http://192.168.6.79/] [http://192.168.6.79/libcurl/index.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-easy.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-errors.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-multi.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-share.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-tutorial.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl.html] real 0m0.452s user 0m0.062s sys 0m0.360s [/code] 7.Finally It's time to download it and enjoy yourself. Welcome to contact me ^_^ Project Homepage : https://github.com/calvinwilliams/simspider Author EMail : calvinwilliams.c@gmail.com
Вы можете оставить комментарий после Вход в систему
Неприемлемый контент может быть отображен здесь и не будет показан на странице. Вы можете проверить и изменить его с помощью соответствующей функции редактирования.
Если вы подтверждаете, что содержание не содержит непристойной лексики/перенаправления на рекламу/насилия/вульгарной порнографии/нарушений/пиратства/ложного/незначительного или незаконного контента, связанного с национальными законами и предписаниями, вы можете нажать «Отправить» для подачи апелляции, и мы обработаем ее как можно скорее.
Комментарии ( 0 )