At http://www.warez-dnb.com/ Im working on a bit of Ruby code that when finished is going to make 20 second samples of every single post.
I decided the simplest way to do this would to have a bot that scraped the website for Rapidshare links, went away and downloaded the links, extracted them, made the sample and then uploaded them to our FTP server. (I did say simplest, not most elegant)
After a file is upload this PHP script tests if the samples are uploaded yet. http://www.whysoscared.com/php-function-check-if-a-file-or-url-exists
To make the Ruby script I needed something that could not only screen scrape the webpage but also login. I was using hpricot but then we decided to make Rapidshare links on Warez-DnB only show up to registered users.
require ‘rubygems’
require ’scrubyt’
#only the following parts should need editing
baseurl = “http://www.warez-dnb.com/”
username = “User”
password = “Password”
data = Scrubyt::Extractor.define do
fetch “#{baseurl}?action=login”
fill_textfield ‘user’, username
fill_textfield ‘passwrd’, password
submit
fetch “#{baseurl}?topic=672″
link ‘//div/code’
end
This is the script that will actually scan each page. Using Scrubyt is super simple first it fetches the login page at Warez-DnB fills the username and password field and submits. Now its logged in.
After this the plan is to use a loop based off the RSS feed to see if the post is in the listings category and extract the Rapidshare links.
Its in fairly simple stages at the moment but ill post the majority of the script when its completed.