Category Uncategorized

New feature: CRAWL STATS a free SEO crawler with Safecont

We are pleased to announce that we bring news and this does not stop.
It is already official the launch of “CRAWL STATS”, you have available and completely free of charge a crawler included in all Safecont accounts. 

A new tab appears in your user dashboard. Once you update or launch an analysis, you will have information and data on crawling, with listings for each type of URL, and other metrics. Until you update your analysis, the data will not appear.
In the main screen of the tab you will see “Crawled stats” a summary of the state of the domain with a pie chart and some very interesting information:

Unique indexable pages

Non-indexable pages

Pages that give a code other than 200 (301, 302, 404, 500, …)


Crawlable pages over the limit that you have put to the analysis (you may have launched an analysis on 5,000 urls, but our crawler has found more pages)

And a great novelty our Crawl Score: A metric created by us, which evaluates the difficulty that bots find to crawl our domain and find all the indexable urls, taking out a single metric weighted by the weight of each level according to importance. Further down on the same page, you have the “Crawled URLs per level” graphic where you have the information of each depth level.

All this information is downloadable in CSV where you will have many more details with which to work as your index / follow status, where the url was found, where it points, status code, your pagerisk, similarity, pagerank, hub value, authority value, the semantic cluster to which it belongs, and much more:

Haz click para ampliar
Then you will find 3 boxes with more information:

Non indexed URLs: With information of redirections, noindex and other circumstances that can make a page not indexable. Clicking on any will go to the detail page with the full list of urls that compose it and other relevant data.

Non 200 status URLs: Where

Wild Query Strings and Duplicate Content

Recently, Robin Rozhon wrote an interesting post about duplicate content and how small changes reduce the amount of indexed pages in a site, increasing the traffic received by landing pages and, thus, the revenue obtained by that site.
Rozhon’s thesis is that when you have the same content duplicated in several indexed URLs, you lost control about the crawling of you site, relying on Google’s judgment and spreading all your possible visitors in several URLs that would have to fight among them for the same content. In the post I linked before, the author explains why they make a reduction of 80% of their indexed URLs (from 500.000 pages to only 100.000). Before the change, only the 8.55% indexed URLs generated at least one session in a month; after deindexing URLs, the 49.7% indexed URLs generated organic traffic.
In Safecont, we totally agree: duplicate content is something to avoid. The main source of duplicate content are query parameters used to generate content views, faceted navigation or track users. The last one is really a bad idea. There are better ways to track users, but if you are obliged to use track parameters, it is necessary to avoid its indexing because each pass of the crawler will generate useless URLs that will duplicate with the main landing pages of your site.
Regarding content views, your should ask yourself: Does this parameter change content seen by the user? If the answer is no, you should avoid indexing it. If the answer is yes, index it only if the change is noticeable. For example a parameter used to order some listing by price should not be indexed.
Facets pose the same problem. If you combine several facets with several filters, you could generate thousands of URLs that would not add content to your site. In general, you should use only facets for individual category pages and use non indexable filters to increase the refinement of a search. For example, you could generate a landing for “adidas jackets” but not for “adidas jackets under 200$”.
The usual method to search URL parameters is looking for them in the code

Semantics for your SEO

This week we talk about one of Safecont parts that can help improve the quality of a page. I refer to the “SEMANTIC” tab of Safecont. In this part of the tool we have summarized the semantic information of a domain in different ways.
On the one hand we have the TFIDF again. If in previous occasions we look at the TFIDF of each URL of a site, in the SEMANTIC tab we will focus on the general TFIDF of a domain. On the one hand we have the general TFIDF of the most frequent words of the domain, calculated as the average of the TFIDFs of all the URLs of the system. This graphic can give us information about what words are important in our domain and give us an overview of the use we make of them. As we did in the article that talked about individual TFIDF of each URL, we can look at those words that have TFIDF too high (values ​​well above the average).  As in this graph we show the average values ​​of the domain, a high value in this table implies generally high values ​​in the domain, which will be words that we will be using too much in each of the pages in which it appears.
The other way to see the TFIDF of the system is the graph of the same tab in which we show the relationship of TFIDF value of each word and number of URLs in which it appears.
In this graphic the most used words in the domain are represented as a point. The horizontal axis represents the number of URLs in which that word appears and the vertical axis represents the average TFIDF value that word has among all the pages in which it is used. If we leave the mouse on one of these points, we will see to what word it refers and the values ​​that it has. Likewise, you can zoom at will. This graphic can be used for several things:

Search words with TFIDF close to 0 and that are therefore used in all system

Ranking pages: Hubs and Authorities

This week we are still talking about web architecture, and how we can work on it through Safecont.
One of the least known parts of our tool is the page listings of a domain by Hub or Authority scores. Although we do not usually mention them much in our videos, these scores also serve to measure the importance of the pages of a domain and improve the architecture of a site in an alternative to the typical Pagerank algorithm.

While Pagerank focuses on sorting the pages by the probability that they are visited at random, the HITS (Hyperlink-Induced Topic Search) algorithm is based on the idea that there are two types of pages on the Internet:

Hub-type pages are those that, although they do not provide much information on a topic, link to the pages that do.
Authority type pages are those that contribute content on a topic to a website and are therefore linked by many Hubs pages related to that topic.

It is necessary to emphasize that the two types of value (Hub and Authority) are not exclusionary. The main page of a site usually has high scores in Authority (it is linked from the whole site) and Hub (it links to many pages with high Authority scores). Let’s see how we can use these scores to improve the structure of our site.
We have placed the page listings by their Hub or Authority score in the “Architecture” tab of our tool. In that section you can find two links to the lists of URLs ordered by their weight as Hub and as Authority. Let’s see some examples:
This website is the online store of one of the surfing fashion brands. If we look at your list of Auths we see the following:
As you can see the root has a high Auth weight, this is logical because it is linked from most pages of the site. However we see a curious thing, the Hub score is very low.
Normally it would have a score close to 1.0 because the usual in an e-commerce is that this page

TFIDF: the quest for normality

A few days ago we talked about the TFIDF and we used it to measure an URL quality, seeing how it was affected based on the number of unique words in that URL. In this post we will continue talking about TFIDF but in this case we will see how to use it to detect words that are overused in the same URL. When this happens, it is possible that Google considers that we are trying to over-exploit this word and, therefore, we could not position it in any way. So, the question that needs to be asked is: How far is it possible to repeat a word without suffering a penalty or a restriction on the part of the search engines that prevent positioning it correctly?
To understand the problem of negative repetitions we must first understand how words are distributed in a “natural” text. Some of the most accepted hypotheses are that the frequency of appearance of a word in a text follows one of these distributions: Zipf, log-normal o normal. For simplicity, we are going to assume that the words distribution in a text follows a normal distribution. This means that if we chose random words using a normal distribution, the result would have frequencies similar to any language. This is going to be useful because it will allow us to know if the distribution of words in our texts can be considered natural. Let’s see what form a normal distribution has.
As can be seen, a normal has a mean, μ, around which most values ​​are grouped, and a standard deviation, σ, which determines the greater or lesser slope of the curve. In our case this would translate as follows: Most words will have a frequency of appearance close to the average and there will be a few at the ends that either appear too little or too much.  To measure how little or too much we can use the 68-95-99,7 rule and that can be seen in the previous graph. This means that 68% of the words will have a frequency in the range [μ-σ, μ + σ], 95% will