Category Archives: Data Analysis

|S,p.a,m| Saga


After starting this Weblog in January 2015 and as a new user of WordPress, I tried to customize the settings that would bring pertinent comments to this blog site and discourage trolling, tangential, self-serving, generalized, content-praising, off-topic messages.

In February, I published two posts, one at the beginning of February, and one on February 19. The first post had a Greek letter in the title. The second post had all English words.

Prior to February 19, what was merely a low single digit daily trickle of comments turned into a deluge starting on February 21. Comments were made not only to my blog posts – one of which was WordPress’ “Hello, World” – but also to my “About …” page and a Sample Page. I thought it amusing that the commentary on these pages praised with the same fervor as my actual blog posts.

From February 21 through February 25, 2015, I received nearly 800 comment |S,p.a,m| messages. I was curious to see if I could analyze these messages further beyond their individual hooks to communicate back.

 [1] WHEN: Date and Time

I noticed that they arrived in clumps at differing times each day. So I created an overlapping frequency distribution using 24 hourly intervals and superimposed 4 days, 7 hours of Comment Email notifications based on arrival time (Universal Time) for moderation. It looks like this [Click to remove vaseline and enlarge]:

Frequency Distributions
The following table shows some descriptive characteristics:

Characteristic  # Emails Date  Time Interval(s)
13  2-21-2015 8-9PM
32 2-22-2015 5-6AM
13 2-23-2015 11AM-12PM
46 2-24-2015 1-2PM, 2-3PM
24 2-25-2015 1-2AM
Total Messages
65  2-21-2015 6AM-Midnight
212  2-22-2015
116  2-23-2015
312  2-24-2015
75 2-25-2015 Midnight-7AM
Peak Arrival(s)
46  2-21-2015 5-10PM
118  2-22-2015 4AM-12PM
61  2-23-2015 8AM-3PM
190  2-24-2015 10AM-5PM
143  2-24-2015 & 7PM-3AM

The Peak Arrival duration seemed to be between 7 and 8 hours, once it got started on 2-22-15, with somewhat increasingly higher numbers of Emails showing up. It felt like new UTC time zones were activated over the days. Moscow and the Middle East on 2-22-15, Western Europe on 2-23-15, Iceland, Brazil and into New York on 2-24-15 and Vladivostok on the evening of 2-24-15 and continuing into 2-25-15.

Since I was spending a successively greater amount of my day reading the comments (to make sure a genuine comment wasn’t lurking) and isolating these moderation requests with no expectation that it would taper off soon and with full expectation that my next blog post would produce even greater unwanted feedback. I decided to “pull the plug”.

I updated the comment rules to include all those wishing to comment to register first. This stopped all the comments completely by 7:30AM UT on 2-25-15. It appears that the comments I experienced were programmatically generated, once postings on my blog site was made available to these sources.

I was curious also about who and where these false commentators came from. At first, I thought I could capture salient characteristics of each comment in an Excel Spreadsheet, but it required me to do repeated copy and paste from my Browser to Excel. As the comments mounted, it appeared that I would invest a large amount of time chasing upwards of 780 messages.

So I decided to review the E-mail message stream sent to me that indicated that a comment just arrived and needed approval moderation. Because such a single E-mail message could be associated with more than one comment, there were 605 such E-mail notifications, still making for a fairly large sample.

In order to process the information in a few minutes rather than a few weeks, I decided to use my Unix/Linux skills on MacOSX via the underlying commandline application called terminal.

The Characteristics I sought were:

  • [1] Date and Time of arrival of a message
  • [2] Page or Post that was targeted
  • Author (with IP address)
  • [3] Author alone
  • [4] Whois IP address information and Corresponding Cities and Countries
  • Email addresses that were used
  • [5] Email domains on the right side of the @ sign
  • [6] URL with a segregation of https and http communication protocols
  • [7] Selected Key words or phrase characteristics used in messages

I diligently moved all inbox mail in sent from WordPress requesting moderation to a mail account |S,p.a,m| folder.

Next, I checked with Google about where the MacOSX keeps the directory of a mail account’s |S,p.a,m| folder.

Once this was known, I began to write a bash script to extract into separate files the information listed above (as well as produce more analyzed reports).

Here is the bash script:

#! /bin/bash
USAGE="Usage: spamscript.bash"
# Created by Robert 2/25/2015 to Analyze Blog Spam
# Note: Missing Mesgs between START and FINISH values; 607 files total/799 spam

# Data of interest: Date and Time [1] [Done]
# Target post [2] [Done] See SortedSpamData2.txt for counts
# Author + IP Addresses [3] [Done] See SpamRawData3A.txt for Author only
# whois for IP address (May be forged) [4] [Done]
# Email Address (Especially Email Domain) [5] [Done] also
# SpamRawData5A.txt for email domains only
# URL [6] [Done] Segregate https: and http:
# Key Words or Phrases (i.e. 2 words ending in !) [7] [Done]

# Extract Date and Time and store in $OUTFILE1

awk '/^Date: /' $DIR/*.$FILESUFFIX | sed -e 's/^Date: //' > $OUTFILE1

# Extract Target post (Use html line in body in case of duplicates)

awk '/^http:\/\/\/blog\/2015/' $DIR/*.$FILESUFFIX | sed -e 's#^' | sed -e 's#/$##' > $OUTFILE2
cat $OUTFILE2 | sort | uniq -c | sort -rn > $OUTFILE2A

# -------------> below finds line in all messages
awk '/^A new comment on the post /' $DIR/*.$FILESUFFIX | sed -e 's/^A new comment on the post //' | sed -e 's/ is waiting for.*$//' > $OUTFILE2B
awk '/^A new pingback on the post /' $DIR/*.$FILESUFFIX | sed -e 's/^A new pingback on the post //' | sed -e 's/ is waiting for.*$//' >> $OUTFILE2B
awk '/^A new trackback on the post /' $DIR/*.$FILESUFFIX | sed -e 's/^A new trackback on the post //' | sed -e 's/ is waiting for.*$//' >> $OUTFILE2B
cat $OUTFILE2B | sort | uniq -c | sort -rn > $OUTFILE2C

# Extract Author Name/Link with IP information (Use Author: line in body)

awk '/^Author : /' $DIR/*.$FILESUFFIX | sed -e 's/^Author : //' > $OUTFILE3
awk '/^Author : /' $DIR/*.$FILESUFFIX | sed -e 's/^Author : //' | sed -e 's/(.*$//' > $OUTFILE3A
sort $OUTFILE3A | uniq -c | sort -rn > $OUTFILE3B
cut -d " " -f1 $OUTFILE3A | sort | uniq -c | sort -rn > $OUTFILE3C

# Extract IP Address from Whois : line

awk '/^Whois : /' $DIR/*.$FILESUFFIX | sed -e 's/^Whois : http:\/\/\/rest\/ip\///' > $OUTFILE4
sort $OUTFILE4 | uniq -c | sort -rn > $OUTFILE4A

# Extract E-mail Address E-mail : line

awk '/^E-mail : /' $DIR/*.$FILESUFFIX | sed -e 's/^E-mail : //' > $OUTFILE5
awk '/^E-mail : /' $DIR/*.$FILESUFFIX | sed -e 's/^E-mail : .*@//' > $OUTFILE5A
sort $OUTFILE5A | uniq -c | sort -rn > $OUTFILE5B

# Extract URL : line

awk '/^URL : /' $DIR/*.$FILESUFFIX | sed -e 's/^URL : //' > $OUTFILE6
sort $OUTFILE6 | uniq -c | sort -rn > $OUTFILE6A

# Extract Keywords from Comment text
# Consider as key words:
# See KEYWORDS Variable above

for i in $KEYWORDS
awk '/^Comment:/,/^Approve it:/' $DIR/*.$FILESUFFIX | grep "$i"
done >> $OUTFILE7

while read i
echo $(grep -c "$i" $OUTFILE7) $i
cat $OUTFILE7A | sort -rn > $OUTFILE7B

# END OF spamscript.bash

The following is a partial sample E-mail notification instance to me that allows the script to work properly:

From: WordPress [Masked]
Subject: [Math-Linux Insights] Please moderate: "Permutations Count On Factorials"
Date: February 22, 2015 6:20:13 PM PST

A new comment on the post "Permutations Count On Factorials" is waiting for your approval

Author : Rodent Exterminator St Catherines (IP: ,
E-mail :
Whois :
You need to take part in a contest for one of the greatest blogs on the web.

I am going to recommend this web site!

[2] WHICH: Post Targets

The following table shows the number of comment Email sent to the targeted post (with post date) for approval:

# Post Title Post Date
130 Permutations Count On Factorials 2-19-2015
101 Factorials For Fun 1-15-2015
96 π Places 1-21-2015
86 Sample Page 1-15-2015
77 About Math-Linux Insights 1-15-2015
71 Hello world! 1-15-2015
36 π GPS (Greater Precision Solutions) 2-4-2015

It was interesting that the commenters’ “programs” failed to distinguish my original content from WordPress generated content (or perhaps it was deliberate). Normally, Moderator rejection is nearly certain if the comment’s context is inappropriate or misdirected.

The most recent post in February had the largest number of Comment E-mails associated during the 6 days it existed and unregistered comments were allowed. Posts with html characters (i.e. of the form: &xxx; ) had unexpectedly fewer comments, especially since the GPS post came after the places post.

[3] WHO: Author

Each Comment has associated with it an Author name. This can be a userid or a link to a web page, video or web page description or gibberish. From the script generated files, I manually summarized and counted a family of descriptors to a single “Author” or keyword name. This was recounted and displayed based on a reverse numerical sort, general keywords that are associated with an Author name two or more times are:

25 Pest Control
15 best
12 Atlas Chalet
12 travertine tile
10 Bed Bug
9 Manhattan
9 dental
9 roof
8 =D7=A9=D7=99=D7=A8=D7=95=D7=AA=D7=99
8 how to repair
8 personal injury
7 Pest
7 Richmond Virginia (VA) best personal injury
6 cleaning
6 cosmetic
5 commercial
5 cost of
5 criminal
5 hail
5 how much
5 plumber
5 residential
5 tile
4 Insect
4 abogados de accidente Miami
4 brooklyn
4 dentist
4 find
4 hvac
4 title loans
3 Ant
3 Carpenter Ant
3 Exterminator
3 Home
3 Rat
3 Wildlife
3 affordable
3 cash
3 dui
3 estimated
3 get
3 marble
3 paid
3 payday
3 replacing
3 sealing
3 teeth
2 24 hr emergency
2 Ants
2 Atlas Chalet Warranty Cobb
2 Brittany
2 Click [Hh]ere
2 Defective Atlas Chalet Shingles
2 Home Pest Control Service
2 Pesticide
2 Residential Pest Control
2 Richmond
2 Roach
2 Rodent
2 Roof
2 TX
2 air conditioner
2 appliance repair
2 atlanta
2 best music
2 best paid survey sites
2 best personal injury attorney Richmond Virginia
2 blitz brigade hack
2 cat toys
2 cheap
2 commercial appliance repair
2 cost to
2 credit repair
2 deer hunter 2014
2 family law
2 free
2 garage door
2 go
2 google
2 great site
2 herpes cure
2 how to find a
2 it
2 jetpack joyride
2 knights and dragons
2 lawyer
2 miami
2 nyc
2 paid survey sites
2 plumbing
2 pou
2 pozyczka
2 queens
2 replacement gas furnace
2 reviews
2 seo
2 seo plugin
2 shingles repair
2 solar panel
2 solar power
2 storm
2 surveys for money
2 teeter hang ups review
2 title
2 try this site
2 tucson dui help
2 zesp=C3=B3=C5=82 na wesele Krak=C3=B3w

[4] WHERE: Country, State, City

WordPress associates an IP (v4) address (e.g. of the form nnn.nnn.nnn.nnn, where each nnn ranges independently between 0 and 255) with every comment. This allows me to pinpoint how many times the same IP address is used to issue a comment. Also, there is a website, What Is My IP? that lets you enter an IP address and it returns, among other things, the Latitude, Longitude, City, State and Country and a map segment associated with that IP address location.

In a file, I extracted the counts of IP addresses associated with the comments and then manually augmented that file with the Country, State and City. The highest repeated IP address ( was 24, from Phoenix, Arizona. I then added up all the counts from each of the same cities and produced the following table (minus the IP addresses):

# Location
465 Arizona Phoenix
22 New York Buffalo
15 France Paris
12 Sweden Stockholm
7 Germany Frankfort
5 Nevada Henderson
5 California Los Angeles
5 United Kingdom London
4 Texas Dallas
3 Indiana Zionsville
3 Canada Quebec Montreal
3 China Caizi Zhen
3 Delaware Dover
3 New York New York City
3 Netherlands Dronten
3 Romania Media
3 Netherlands Amsterdam
2 Colorado Fort Collins
2 Alabama Montgomery
2 Florida Miami
2 Russia Moscow

# Location
2 Switzerland Zurich
2 Romania Moldova
2 California Fresno
1 Lebanon Beirut
1 China Sichuan Mianzhu
1 Taiwan Taipei
1 China Shaanxi Xian
1 Poland Warsaw
1 Germany Berlin
1 Illinois Chicago
1 Illinois Lombard
1 Maryland Baltimore
1 Russia Dubna
1 Washington Spokane
1 Russia Saint Petersburg
1 Germany Munich
1 North Carolina Greensboro
1 Ukraine Uzhgorod
1 Poland Gdansk
1 Romania Lasl
1 Russia Murmansk Kovdor
1 Iowa Cloud
597  Total

Therefore `bb (465/597)` or nearly 78% of the |s,p.a,m| comes from 44 servers (with unique IP addresses) in Phoenix, Arizona. These 44 unique servers are counted using:

   cat SortedSpamData4C.txt | grep –c ‘Arizona Phoenix’

[5] REPLY: Email Address

Turning now to the Email addresses associated with the 597 comment Emails, there were 49 unique email addresses used. had the preponderance of Email domains with 185; The primary domains .de (Germany) had 8 and .net had 11.

# Domain
# Domain
597  Total

[6] WHY: Click On URL

Next, we consider the URLs associated with the Comments. In addition to the 597 comments were 6 trackbacks (generated by others) and 2 pingbacks (generated by me).

Protocols used: https = 375 http = 230 Total = 605

Specific Selected URL Domains (enclosed with /s):

17 .*.[Gg]

Keyword URL Domains (No right / )

# URL pattern
16 .*survey.*
8 .*porn.*

As can be seen, youtube and its variant represented 435 URLs that commenters used to represent themselves for viewing purposes.

[7] WHAT: Message Keywords

The following table shows the keywords I determined to be peculiar to |S,p.a,m| comments and their popularity, in descending order. Based on the script output file, the following keywords were manually summarized and commandline counted.

# Keyword
294 nice
271 won
228 ans
199 info
137 pleasant
114 off topic
80 fastidious
47 brussels
45 heads up
42 creative writing
39 useful information
38 convey
28 comeback
27 masterpiece
25 pals
24 subject matter
24 donate
23 whilst
22 you relied on the video
22 go after your heart
21 hyperlink
20 at the glance out
19 peer
19 onderful site
14 loading velocity
14 arena
13 such a lot
11 uncanny
9 vefy
9 bravery
8 must read
7 I say to you
6 un-ambiguity
6 take a signal
6 preserveness
6 energetic article
6 I have a mission
4 precious knowledge
1 what a information
1 unpredicted
1 killing my time
1 did not happened
1 Grrrr…

So there it is. Probably much more than you wanted to know about how comment
|S,p.a,m| can find you as a result of normal blog post publicity on WordPress’ part.

Perhaps this data can offer insights to others plagued with similar experiences.

We may have to go back to the old tradition, as when researchers wrote papers in the last century. They invited you to read their paper (and offer improvements). Nothing went viral for years, if ever.

The Internet’s ability as a conduit and enabler for many “agents” to visit sites, leave (non)messages or even crash websites through directed, overwhelming traffic are instances of unwanted popularity. It is analogous to Standup Comedians being heckled or Government Spokespeople/Invited Speakers being vocally protested or the Signal being accompanied by a significant amount of Noise.

The most elegant solution to this, to my mind, lies in creating a virtual lightning rod decoy to attract the unwanted (Linux has a facility called /dev/null .), while creating and sheltering a direct conduit for those who wish to offer dialogue via cogent questions, differing opinions or other helpful reactions.