Backing up all discussions in a facebook group with perl

The Facebook group named “A Consortium of Pub-Going, Loose and Forward Women” has been hacked more than 6 times in the last one week itself. You must have heard of, the’re the pinkchaddi girls. :p

This script is set as a cronjob on my computer to back up the group each hour (for this group with about 147 discussions, it takes about 20 minutes, but then, my ISP sucks). By changing the first url in the script, it should work properly on any group. The script takes the first discussion page of the group, then takes each discussion, compiles all the posts one after the other and dumps them to text files in a folder, then zips them and emails them to a specified email id.

To email, you need access to a mail provider who gives you smtp access. The script can authorize itself.

You need these dependencies for it to work.

libarchive-zip-perl
Net_SSLeay.pm (Open SSL)
IO-Socket-SSL
Authen-SASL
Net::SMTP::Multipart

Just search for them with your favorite package manager or use cpan to install them. For my first perl script, I am pretty happy with it. :) You can take this script and use it for yourself, and make any modifications you want. If you do make any improvements, consider posting it back in the comments, so I could use it too.

And oh, keep in mind that facebook officially isn’t too happy with you taking the content off their site through scripts.

#! /usr/bin/perl
# The following dependencies are required
# libarchive-zip-perl
# Net_SSLeay.pm (Open SSL)
# IO-Socket-SSL
# Authen-SASL
# Net::SMTP::Multipart; had to install this from cpan on ubuntu

require LWP::UserAgent;
use LWP::Simple;
#customize here ################################
#the gmx mail service is used for sending mail.
$backuprecepient = ‘admin@bkpman.com’; #to whom the email should be sent to
$smtpuser=’fbbackup@maildomain.com’;
$smtppassword=’P4ssw0rdH3r3′;
$firstlink=’http://www.facebook.com/l.php?u=http%3A%2F%2Fwww.facebook.com%2Fboard.php%3Fuid%3D49641698651&h=9414c6f6348d4f61fe31bf4f46cf9da1′; #the script expects the link to the first page of the discussions. Login *must* not be required.

################################################
$activitylog=”";
$activity=”";
$ua=new LWP::UserAgent;
$ua->agent(‘Mozilla/5.0′);
# we have two lists, one is the list with links that need to be navigated to, say ‘tonav’.
# we get the page, then add the page link to the one navigated, say ‘hasnav’.
# on each navigation, we get all pageno links and check if they exist in ‘tonav’ or ‘hasnav’, if not add those to ‘tonav’. And add current page to ‘hasnav’.
# the loop continues as long as there are links in tonav.
@hasnav=();
@tonav=($firstlink);
@discussionlinks=();
while (scalar(@tonav)>0){
$discussionlistlink=pop(@tonav);
push(@hasnav, $discussionlistlink);
$request=new HTTP::Request(‘GET’, $discussionlistlink);
$response=$ua->request($request);
my($maincontent)=$response->content;
$maincontent=~ m/<li class=”current”>(.*?)</li></ul></div></div>/;
$pagenossource= $1;
#facebook wants to fuck up my script. :( ugly hack. must be a better way. firefox seems to interpret it properly though.
$pagenossource=~ s/amp;//gi;
@links=$pagenossource=~ m/<a href=”(.*?)”>d/gis;
#we find the further discussion list pages links here
for $link (@links) {
$disclinkexists=0;
for $tonavitem (@tonav)
{
if ($link eq $tonavitem) { $disclinkexists=1; }
}
for $hasnavitem (@hasnav)
{
if ($link eq $hasnavitem) { $disclinkexists=1;}
}
if ($disclinkexists==0) { push(@tonav,$link);     }
}
#we get the actual discussion links here
@links=$maincontent=~ m/<h2 class=”topic_title datawrap”><a href=”(.*?)”>/gis;
for $link (@links)
{
$disclinkexists=0;
for $discitem (@discussionlinks)
{
if ($link eq $discitem) { $disclinkexists=1; }
}
if ($disclinkexists==0)
{
my($linkreg)=$link;
$linkreg=~ s/amp;//gi;
push(@discussionlinks, $linkreg);
}
}
}
@timeData = localtime(time);
$directoryname=”fbbackup”.join(”, @timeData);
mkdir “$directoryname”, 0770 unless -d “$directoryname”;
$discussioncount=0;
for $discussionlink (@discussionlinks)
{
$discussioncount++;
#for each link, go the discussion page, follow each page and find the page no. links. Finally discussionpages will have the list of discussion pages
@hasnav=();
@tonav=($discussionlink);
@discussionpages=();
@links=();
$topic = “”;
while (scalar(@tonav)>0){
$discussionpage=pop(@tonav);
push(@hasnav, $discussionpage);
$request=new HTTP::Request(‘GET’, $discussionpage);
$response=$ua->request($request);
my($maincontent)=$response->content;
if ($topic eq “”)
{
$maincontent=~ m/<h2>Topic: <span>(.*?)</span>/gis;
my($topicwrapper)=$1;
$topicwrapper=~ m/>(.*?)</gis;
$topic=$1;
}
$maincontent=~ m/<div class=”pagerpro_container”><ul class=”pagerpro”>(.*?)</div></div>/gis;
$pagenossource= $1;
#facebook wants to fuck up my script. :( ugly hack. must be a better way. firefox seems to interpret it properly though.
$pagenossource=~ s/amp;//gi;
@links=$pagenossource=~ m/<a href=”(.*?)” onclick/gis;
#we find the further discussion list pages links here
for $link (@links) {
$disclinkexists=0;
for $tonavitem (@tonav)
{
if ($link eq $tonavitem) { $disclinkexists=1; }
}
for $hasnavitem (@hasnav)
{
if ($link eq $hasnavitem) { $disclinkexists=1;}
}
if ($disclinkexists==0) { push(@tonav,$link); }
}
}
if ($topic)
{
$topic=HTML::Entities::decode($topic);
(my $filename = $topic)=~ tr/a-zA-z0-9/_/cs;
$filename=$filename.”.txt”;
$activity=”=========================================n$discussioncount of “. scalar(@discussionlinks) . “nTopic: “.$topic.”n”;
print $activity;
$activitylog=$activitylog.$activity;
@hasnav=reverse(@hasnav);
if (scalar(@hasnav)>1) { pop(@hasnav); }
$postcount=0;
$filecontent=”";
$activity= scalar(@hasnav) . ” pages in discussion.n”;
print $activity;
$activitylog=$activitylog.$activity;
for $hasnavitem (@hasnav)
{
$activity= $hasnavitem.” initiated.”;
print $activity;
$activitylog=$activitylog.$activity;
$request2=new HTTP::Request(‘GET’, $hasnavitem);
$response2=$ua->request($request2);
my($discussioncontent)=$response2->content;
$done=0;
$discussioncontent=~ m/<div id=”all_threads”>(.*?)</div></div></div><div class=”UIWashFrame_SidebarAds”>/gis;
$discussionpageallthreads=$1;
@posts=$discussionpageallthreads =~ m/<div class=”post_index”>(.*?)<ul class=”actionspro”>/gis;
for $post (@posts)
{
$postcount++;
my($postreg)=$post;
$postreg=~ m/Post #(.*?)</gis;
$postindex=$1;
$postreg=~ m/<span class=”author_header”><strong>(.*?)</strong>/gis;
$author=$1;
$postreg=~ m/timestamp”>(.*?)</span>/gis;
$timestamp=$1;
$postreg=~ m/<div class=”post_message”>(.*?)</div>/gis;
$postmessage=$1;
$postmessage=~ s/<br />/n/gi;
$postmessage=~ s/<.*?>//gi;
$postmessage=~ s/  */ /gi;
$postmessage=HTML::Entities::decode($postmessage);
$filecontent=$filecontent .”Post #” . $postindex . ” by ” . $author . ” (” . $timestamp . “)n”;
$filecontent=$filecontent . $postmessage . “n”;
$filecontent=$filecontent . “————————————————————————-n”;
}
$activity = “…completedn”;
print $activity;
$activitylog=$activitylog.$activity;
}
$filecontent=”Topic: ” . $topic . “n” . $postcount . ” posts in discussion n====================================n” . $filecontent;
open (DISCNPAGE, “>$directoryname/$filename”);
print DISCNPAGE $filecontent;
close(DISCNPAGE);
}
}

# Create a Zip file

use Archive::Zip qw( :ERROR_CODES :CONSTANTS );
my $zip = Archive::Zip->new();

# Add a directory
my $dir_member = $zip->addTree( “$directoryname/”,”$directoryname/”  );

# Save the Zip file
unless ( $zip->writeToFileNamed(“$directoryname”.”.zip”) == AZ_OK ) {
die “unable to save the zip file the files are probably backed up in directory $directoryname”;
}
#delete the backup folder
use File::Path;
$delA = “$directoryname”;
rmtree([$delA]);

use Net::SMTP::Multipart;
my $to = $backuprecepient;
my $subject = “Backup $directoryname”;
my $body = “Backup zip file attached.n ————————-nn$activitylog”;

my $from = $smtpuser;
my $password = $smtppassword;
my $smtp;

if (not $smtp = Net::SMTP::Multipart->new(‘mail.gmx.com’,
Port => 25,
Debug => 0)) {
die “Could not connect to servern”;
}

$smtp->auth($from, $password)
|| die “Authentication failed!n”;
$smtp->Header(To=>$to, Subj=>$subject, From=>”$from”);
$smtp->Text($body);
$smtp->FileAttach(“$directoryname.zip”);
$smtp->End();

unlink(“$directoryname.zip”);

Aha! just found that there is a radio button ‘show HTML literally’ below the blogger edit box. Now I can post my code directly into blogger. :)

Solving the /elgg/acton/systemsettings/install not found error in elgg

While installing elgg on the fedora server, I faced this problem when I accessed http://address/elgg

The first installation form came up and on submitting I was presented with the error:

Not Found. The requested URL /elgg/action/systemsettings/install was not found on this server

However it is a well documented error and is easy to solve. Edit /etc/httpd/conf/httpd.conf and Edit the line saying 

AllowOverride none 

to

AllowOverride all

in the global section, not in the directory section. You could probably do it directory wise too but I didn’t bother.

The edit the .htaccess file in /elgg to point to the correct directory as

RewriteBase /elgg/

(I just needed to uncomment that line). Now reload Apache Configuration

/etc/init.d/httpd reload

Works fine now.

Solving Fedora Core 10 troubles, Network Interface with Static IP not brought up automatically

I installed fedora on another old system that was lying around.

Unfortunately the asus board that it came with an intel graphics bug due to which the gui based anaconda installer wasn’t diplaying correctly. To install in text mode, when the primary screen comes up and asks you to choose a boot option
select “Install or upgrade an existing system” and then press the tab key
You will see the following line

vmlinuz initrd=initrd.img

Add text to the line and make it

vmlinuz initrd=initrd.img text

Hit enter and Fedora install should start in text mode.

During the install, most of the setup went smoothly aside from the following hassles
1. A normal user account was not created
2. Fedora will not start up in Graphical mode.
3. The Static IP I set for my system works, but the ethernet interface is not brought up after boot. So after each boot I have to manually go to the system and do a
ifup eth0
Now that is not acceptable since I plan to put this in the basement anyway to run as a sync server. I can’t be running down each time. So lets solve #3 for now.

So here we go.
Firstly, turn off Network Manager and turn on network

chkconfig NetworkManager off
chkconfig network on

then edit the file /etc/ sysconfig/network-scripts/ifcfg-eth0 as required. On mine it looks like this

# Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+
DEVICE=eth0
HWADDR=–snipped–
ONBOOT=yes
BOOTPROTO=none
NETMASK=–snipped–
IPADDR=–snipped–
USERCTL=no
IPV6INIT=no
NM_CONTROLLED=no
GATEWAY=–snipped–
TYPE=Ethernet
DNS1=192.168.1.1

DNS1 and DNS2 can be whatever your DNS addresses are. HWADDR should be whatever your MAC address is and is probably already entered. NETMASK, IPADDR and GATEWAY are self explanatory. ONBOOT and NM_CONTROLLED need to be set as above for the interface to be brought up properly.

A shutdown -r now works ok and I am able to ssh back within seconds again after it reboots.
Next post on creating the new user account

Old/New Fedora 10 Server

There was an old machine lying around in the basement which noone wanted.  I had my eyes on it for a long time but it had only about 64 MB of RAM. Finally I begged for some RAM and got it from Binoy. And installed Fedora Core 10 on it the same day. Actually I had a lot of webapps and servers lying all over my network and wanted to put up all my personal and experimental stuff on one system.

Just the default install, then httpd, trac, subversion and glpi and it is still working like a charm. I backed up all data from my office servers and moved it there.

There were however a few issues that needed a quick sorting out. Hope it helps someone.

Most servers, even after install do not autostart on boot. You need to set the appropriate chkconfig entries. I had trouble with apache and mysql. However sshd starts automatically.

chkconfig –level 2345 httpd on
chkconfig –level 2345 mysqld on

Also the firewall port for http was disabled by default. Opening

system-config-firewall 

and enabling the port fixed it quickly.

Everything looks hunky dory. Now for some actual coding.