Backing up all discussions in a facebook group with perl

The Facebook group named “A Consortium of Pub-Going, Loose and Forward Women” has been hacked more than 6 times in the last one week itself. You must have heard of, the’re the pinkchaddi girls. :p

This script is set as a cronjob on my computer to back up the group each hour (for this group with about 147 discussions, it takes about 20 minutes, but then, my ISP sucks). By changing the first url in the script, it should work properly on any group. The script takes the first discussion page of the group, then takes each discussion, compiles all the posts one after the other and dumps them to text files in a folder, then zips them and emails them to a specified email id.

To email, you need access to a mail provider who gives you smtp access. The script can authorize itself.

You need these dependencies for it to work.

libarchive-zip-perl
Net_SSLeay.pm (Open SSL)
IO-Socket-SSL
Authen-SASL
Net::SMTP::Multipart

Just search for them with your favorite package manager or use cpan to install them. For my first perl script, I am pretty happy with it. :) You can take this script and use it for yourself, and make any modifications you want. If you do make any improvements, consider posting it back in the comments, so I could use it too.

And oh, keep in mind that facebook officially isn’t too happy with you taking the content off their site through scripts.

#! /usr/bin/perl
# The following dependencies are required
# libarchive-zip-perl
# Net_SSLeay.pm (Open SSL)
# IO-Socket-SSL
# Authen-SASL
# Net::SMTP::Multipart; had to install this from cpan on ubuntu

require LWP::UserAgent;
use LWP::Simple;
#customize here ################################
#the gmx mail service is used for sending mail.
$backuprecepient = ‘admin@bkpman.com’; #to whom the email should be sent to
$smtpuser=’fbbackup@maildomain.com’;
$smtppassword=’P4ssw0rdH3r3′;
$firstlink=’http://www.facebook.com/l.php?u=http%3A%2F%2Fwww.facebook.com%2Fboard.php%3Fuid%3D49641698651&h=9414c6f6348d4f61fe31bf4f46cf9da1′; #the script expects the link to the first page of the discussions. Login *must* not be required.

################################################
$activitylog=”";
$activity=”";
$ua=new LWP::UserAgent;
$ua->agent(‘Mozilla/5.0′);
# we have two lists, one is the list with links that need to be navigated to, say ‘tonav’.
# we get the page, then add the page link to the one navigated, say ‘hasnav’.
# on each navigation, we get all pageno links and check if they exist in ‘tonav’ or ‘hasnav’, if not add those to ‘tonav’. And add current page to ‘hasnav’.
# the loop continues as long as there are links in tonav.
@hasnav=();
@tonav=($firstlink);
@discussionlinks=();
while (scalar(@tonav)>0){
$discussionlistlink=pop(@tonav);
push(@hasnav, $discussionlistlink);
$request=new HTTP::Request(‘GET’, $discussionlistlink);
$response=$ua->request($request);
my($maincontent)=$response->content;
$maincontent=~ m/<li class=”current”>(.*?)</li></ul></div></div>/;
$pagenossource= $1;
#facebook wants to fuck up my script. :( ugly hack. must be a better way. firefox seems to interpret it properly though.
$pagenossource=~ s/amp;//gi;
@links=$pagenossource=~ m/<a href=”(.*?)”>d/gis;
#we find the further discussion list pages links here
for $link (@links) {
$disclinkexists=0;
for $tonavitem (@tonav)
{
if ($link eq $tonavitem) { $disclinkexists=1; }
}
for $hasnavitem (@hasnav)
{
if ($link eq $hasnavitem) { $disclinkexists=1;}
}
if ($disclinkexists==0) { push(@tonav,$link);     }
}
#we get the actual discussion links here
@links=$maincontent=~ m/<h2 class=”topic_title datawrap”><a href=”(.*?)”>/gis;
for $link (@links)
{
$disclinkexists=0;
for $discitem (@discussionlinks)
{
if ($link eq $discitem) { $disclinkexists=1; }
}
if ($disclinkexists==0)
{
my($linkreg)=$link;
$linkreg=~ s/amp;//gi;
push(@discussionlinks, $linkreg);
}
}
}
@timeData = localtime(time);
$directoryname=”fbbackup”.join(”, @timeData);
mkdir “$directoryname”, 0770 unless -d “$directoryname”;
$discussioncount=0;
for $discussionlink (@discussionlinks)
{
$discussioncount++;
#for each link, go the discussion page, follow each page and find the page no. links. Finally discussionpages will have the list of discussion pages
@hasnav=();
@tonav=($discussionlink);
@discussionpages=();
@links=();
$topic = “”;
while (scalar(@tonav)>0){
$discussionpage=pop(@tonav);
push(@hasnav, $discussionpage);
$request=new HTTP::Request(‘GET’, $discussionpage);
$response=$ua->request($request);
my($maincontent)=$response->content;
if ($topic eq “”)
{
$maincontent=~ m/<h2>Topic: <span>(.*?)</span>/gis;
my($topicwrapper)=$1;
$topicwrapper=~ m/>(.*?)</gis;
$topic=$1;
}
$maincontent=~ m/<div class=”pagerpro_container”><ul class=”pagerpro”>(.*?)</div></div>/gis;
$pagenossource= $1;
#facebook wants to fuck up my script. :( ugly hack. must be a better way. firefox seems to interpret it properly though.
$pagenossource=~ s/amp;//gi;
@links=$pagenossource=~ m/<a href=”(.*?)” onclick/gis;
#we find the further discussion list pages links here
for $link (@links) {
$disclinkexists=0;
for $tonavitem (@tonav)
{
if ($link eq $tonavitem) { $disclinkexists=1; }
}
for $hasnavitem (@hasnav)
{
if ($link eq $hasnavitem) { $disclinkexists=1;}
}
if ($disclinkexists==0) { push(@tonav,$link); }
}
}
if ($topic)
{
$topic=HTML::Entities::decode($topic);
(my $filename = $topic)=~ tr/a-zA-z0-9/_/cs;
$filename=$filename.”.txt”;
$activity=”=========================================n$discussioncount of “. scalar(@discussionlinks) . “nTopic: “.$topic.”n”;
print $activity;
$activitylog=$activitylog.$activity;
@hasnav=reverse(@hasnav);
if (scalar(@hasnav)>1) { pop(@hasnav); }
$postcount=0;
$filecontent=”";
$activity= scalar(@hasnav) . ” pages in discussion.n”;
print $activity;
$activitylog=$activitylog.$activity;
for $hasnavitem (@hasnav)
{
$activity= $hasnavitem.” initiated.”;
print $activity;
$activitylog=$activitylog.$activity;
$request2=new HTTP::Request(‘GET’, $hasnavitem);
$response2=$ua->request($request2);
my($discussioncontent)=$response2->content;
$done=0;
$discussioncontent=~ m/<div id=”all_threads”>(.*?)</div></div></div><div class=”UIWashFrame_SidebarAds”>/gis;
$discussionpageallthreads=$1;
@posts=$discussionpageallthreads =~ m/<div class=”post_index”>(.*?)<ul class=”actionspro”>/gis;
for $post (@posts)
{
$postcount++;
my($postreg)=$post;
$postreg=~ m/Post #(.*?)</gis;
$postindex=$1;
$postreg=~ m/<span class=”author_header”><strong>(.*?)</strong>/gis;
$author=$1;
$postreg=~ m/timestamp”>(.*?)</span>/gis;
$timestamp=$1;
$postreg=~ m/<div class=”post_message”>(.*?)</div>/gis;
$postmessage=$1;
$postmessage=~ s/<br />/n/gi;
$postmessage=~ s/<.*?>//gi;
$postmessage=~ s/  */ /gi;
$postmessage=HTML::Entities::decode($postmessage);
$filecontent=$filecontent .”Post #” . $postindex . ” by ” . $author . ” (” . $timestamp . “)n”;
$filecontent=$filecontent . $postmessage . “n”;
$filecontent=$filecontent . “————————————————————————-n”;
}
$activity = “…completedn”;
print $activity;
$activitylog=$activitylog.$activity;
}
$filecontent=”Topic: ” . $topic . “n” . $postcount . ” posts in discussion n====================================n” . $filecontent;
open (DISCNPAGE, “>$directoryname/$filename”);
print DISCNPAGE $filecontent;
close(DISCNPAGE);
}
}

# Create a Zip file

use Archive::Zip qw( :ERROR_CODES :CONSTANTS );
my $zip = Archive::Zip->new();

# Add a directory
my $dir_member = $zip->addTree( “$directoryname/”,”$directoryname/”  );

# Save the Zip file
unless ( $zip->writeToFileNamed(“$directoryname”.”.zip”) == AZ_OK ) {
die “unable to save the zip file the files are probably backed up in directory $directoryname”;
}
#delete the backup folder
use File::Path;
$delA = “$directoryname”;
rmtree([$delA]);

use Net::SMTP::Multipart;
my $to = $backuprecepient;
my $subject = “Backup $directoryname”;
my $body = “Backup zip file attached.n ————————-nn$activitylog”;

my $from = $smtpuser;
my $password = $smtppassword;
my $smtp;

if (not $smtp = Net::SMTP::Multipart->new(‘mail.gmx.com’,
Port => 25,
Debug => 0)) {
die “Could not connect to servern”;
}

$smtp->auth($from, $password)
|| die “Authentication failed!n”;
$smtp->Header(To=>$to, Subj=>$subject, From=>”$from”);
$smtp->Text($body);
$smtp->FileAttach(“$directoryname.zip”);
$smtp->End();

unlink(“$directoryname.zip”);

Aha! just found that there is a radio button ‘show HTML literally’ below the blogger edit box. Now I can post my code directly into blogger. :)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>