Yeah yeah, don’t carp to me about SQL. I’ll have you know that the language is mathematically derived, and is the correct way to deal with data. God Bless all those elementary school teachers who tried to impress upon us the importance of the intersection of the set of all apples and the set of all thing that Johnny possessed, as they were on to someting, but they didn’t understand it any more than we did, and it showed. They had the teachers’ edition and we did not, and that was that.
That nonsense of apples and Johnnies is a problem which still plagues database instruction to this day. Do you know the name Alfreds Futterkiste? I’ll stab that bastard if I ever see him. I recently saw an example describing how one might wish to filter a sharepoint list combining criteria with some AND and OR operators. Well and good, but the filters they combined were “class = mammal” AND (“species = cat” OR “species = dog”). This is patently stupid — every dog and/or cat is a mammal, so there is no need to filter on class here, like Sillitoe. And so it goes, where fruits and mammals and occasionally make/model/year are trotted out as stultifyingly inappropriate examples to teach us how to use SQL.
So it’s no wonder that I churn out highly successful but appallingly bad SQL such as the following, trying to identify remaining lumps in my quest to sift the junk users up outta here:
SELECT
SUBSTRING(user_email, LOCATE('@', user_email) + 1)
AS da_domain,
REVERSE(SUBSTRING(user_email, LOCATE('@', user_email) + 1))
AS da_reverse,
REVERSE(SUBSTRING(REVERSE(SUBSTRING(user_email, LOCATE('@', user_email) + 1)),
LOCATE('.', REVERSE(SUBSTRING(user_email, LOCATE('@', user_email) + 1))) + 1))
AS da_domainlocal,
REVERSE(SUBSTRING(REVERSE(SUBSTRING(user_email, LOCATE('@', user_email) + 1)),1,
LOCATE('.', REVERSE(SUBSTRING(user_email, LOCATE('@', user_email) + 1))) - 1))
AS da_domaintop,
count(*)
FROM wp_users
GROUP BY da_domainlocal
ORDER BY COUNT(*) DESC
LIMIT 1000;
What we are doing here:
TAKE the email address of a user: ev.il.sp.am.mer.84383@morons.stupidmail.com.
SPLIT it at the ‘@’ symbol and take the latter portion, which is the domain, called here ‘da_domain’ just to avoid namespace collisions. I’m lazy, not stupid. This could give us ‘morons.stupidmail.com’.
And so we REVERSE that, yielding ‘moc.liamdiputs.snorom’. Then SPLIT at the first dot, yielding (on the left of the dot) the top level domain (TLD), which we will call ‘da_domaintop’, and on the right of the dot, the local part of the domain, which we call ‘da_domainlocal’. Good thing we aren’t working with AD or domain controllers etc. BUT we also must REVERSE these, so that our TLD which was ‘moc’ becomes ‘com’, and our local domain which was ‘liamdiputs.snorom’ becomes ‘morons.stupidmail’.
We don’t actually need the ‘da_reverse’ portion, but it is left in there for reference. My reference. So I can understand the rest. Why? Because I am too lazy to figure out CTEs or subqueries or whatever right now. So I have this poor machine rolling and unrolling, reversing and unreversing things all over the place. This is truly awful SQL. You see, I cannot simply define ‘da_reverse’ and then WORK WITH IT, because until the blasted thing runs, ‘da_reverse’ does not exist. So I have to go everywhere that I would have wanted to operate on ‘da_reverse’ and PASTE IN the same operations which generated it. So it’s doing an astounding amount of unnecessary work, like James Joyce. Well better it than me!
ANYWAY, this gives us some powerful tools. We can then group by TLD or local domain, and count the number of bogus users INFESTING MY BLOG who registered from each/any of those domain parts. Here’s one of the possible outputs, grouping by ‘da_domainlocal’:
| da_domain | da_reverse | da_domainlocal | da_domaintop | count(*) ![]() |
|
|---|---|---|---|---|---|
| gmail.com | moc.liamg | gmail | com | 2998 | |
| yahoo.com | moc.oohay | yahoo | com | 963 | |
| hotmail.comom | momoc.liamtoh | hotmail | comom | 423 | |
| meta.ua | au.atem | meta | ua | 375 | |
| all.codyting.com | moc.gnitydoc.lla | all.codyting | com | 202 | |
| aol.com | moc.loa | aol | com | 196 | |
| thefmail.com | moc.liamfeht | thefmail | com | 135 | |
| comcast.net | ten.tsacmoc | comcast | net | 73 |
As you can see, the bulk of the remaining users are here from gmail, but this is out of over eleven thousand remaining users, to that nearly 3K does not tell much of the story. Obviously, I can delete all users coming from the nonsense domain ‘hotmail.comom’. Jut as obviously, I should not delete the populations of yahoo, gmail, aol, and so forth. There could be real people in there. For those, my next bulk-reducing trick will be to count the number of periods in the local part of the email address, which will get a lot for very little effort. These guys: ev.il.sp.am.mer.84383@whatever.
There’s an issue hiding in the data above, which is that I have overspecified (grabbed too many fields), and the results are not what they appear. Because the table above was ordered up with GROUP BY ‘da_domainlocal’ and so forth, the count only applies to THAT column, and any data listed alongside in other columns is only an example, selected essentially at random. That is, YES it is true that there are 423 hotmail.[something] accounts, but I only know that at least one of them is actually hotmail.comom. The rest are very likely hotmail.com. So it is important to know what this barbaric SQL is saying and what it is NOT saying.
So you get domains like yandex, which is the same as the search engine problem for frigging pinterest. I do not ever want pinterest to show up in my search results. Unfortunately, you cannot just block pinterest.com, as that miserable noise content comes from everywhere — pinterest.com, pinterest.co.us, pinterest.co.jp, pinterest.co.cn, pinterest.co.ca, pinterest.co.tk, and it just never ends. Pinterest links are the real cosmic noise background. Well, now DuckDuckGo now supports wildcards in the filters for result domains, but they didn’t always. Which brings us back to yandex, and yahoo for that matter, and probably gmail now that I think about it. Haven’t looked that far yet.
da_domain 1 |
count(*) | |
|---|---|---|
| zuxuhanuvunu.pro | 1 | |
| … | … | |
| yaroslav-samoylov.com | 1 | |
| yandex.ua | 6 | |
| yandex.kz | 10 | |
| yandex.by | 5 | |
| yaicae.fun | 1 | |
| yahoo.it | 1 | |
| yahoo.in | 1 | |
| yahoo.fr | 1 | |
| yahoo.es | 3 | |
| yahoo.de | 1 | |
| yahoo.com.sg | 1 | |
| yahoo.com | 950 | |
| yahoo.co.uk | 9 | |
| yahoo.co.jp | 2 | |
| yahoo.co.in | 1 | |
| yahoo.co | 3 | |
| yahoo.ca | 3 | |
| y.otramovida.net | 6 |
This is when it’s handy to use the domainlocal part, which reduces like this:
| yaroslav-samoylov | 1 |
| yandex | 21 |
| yaicae | 1 |
| yahoo.com | 1 |
| yahoo.co | 12 |
| yahoo | 963 |
| y.otramovida | 6 |
It’s not perfect, but it’s better, and it is informative.
I would really need to get into CTEs, potentially recursive due to the recursive nature of domain names, in order to formalize numbers about domains and so forth.
You may wonder what the point of all this is, as easy answers are not exactly popping out of the woodwork here. Well, I already scooped up a bunch of the easy answers, deleting about 7K bogus users just by looking at obvious nonsense domains right out of the WordPress user management screen and then finding, filtering, and deleting matches for those. Really, the next valuable thing I could do is probably look for users whose email addresses are festooned with too many dot characters. They do that to break up the bad word filters, and wind up creating a far handier tell. but that’s not going so tolve alot of problems. To be most effective that search will be conducted without reference to domains at all :-/
I’ve wanted to develop this capacity for some time — I still have nearly 90K comments on another blog that I have to get rid of. That will of course be done by finding bad users and deleting them, and their comment go with them. But this one is priority.
As I get further into reminding myself how SQL works, eventually I will get brave enough to run DELETE queries. For now, it’s just SELECT to point me toward value-added manual deletion activities, and so it goes, like Kurt Vonnegut. Sigh.
Well, in case you wanted some truly disgusting, abusive SQL (not even in a clever sense), I got you covered.
