Category Uncategorized
We are pleased to announce that we bring news and this does not stop.
It is already official the launch of “CRAWL STATS”, you have available and completely free of charge a crawler included in all Safecont accounts.
A new tab appears in your user dashboard. Once you update or launch an analysis, you will have information and data on crawling, with listings for each type of URL, and other metrics. Until you update your analysis, the data will not appear.
In the main screen of the tab you will see “Crawled stats” a summary of the state of the domain with a pie chart and some very interesting information:
Unique indexable pages
Non-indexable pages
Pages that give a code other than 200 (301, 302, 404, 500, …)
Crawlable pages over the limit that you have put to the analysis (you may have launched an analysis on 5,000 urls, but our crawler has found more pages)
And a great novelty our Crawl Score: A metric created by us, which evaluates the difficulty that bots find to crawl our domain and find all the indexable urls, taking out a single metric weighted by the weight of each level according to importance. Further down on the same page, you have the “Crawled URLs per level” graphic where you have the information of each depth level.
All this information is downloadable in CSV where you will have many more details with which to work as your index / follow status, where the url was found, where it points, status code, your pagerisk, similarity, pagerank, hub value, authority value, the semantic cluster to which it belongs, and much more:
Haz click para ampliar
Then you will find 3 boxes with more information:
Non indexed URLs: With information of redirections, noindex and other circumstances that can make a page not indexable. Clicking on any will go to the detail page with the full list of urls that compose it and other relevant data.
Non 200 status URLs: Where
Read more >>
Recently, Robin Rozhon wrote an interesting post about duplicate content and how small changes reduce the amount of indexed pages in a site, increasing the traffic received by landing pages and, thus, the revenue obtained by that site.
Rozhon’s thesis is that when you have the same content duplicated in several indexed URLs, you lost control about the crawling of you site, relying on Google’s judgment and spreading all your possible visitors in several URLs that would have to fight among them for the same content. In the post I linked before, the author explains why they make a reduction of 80% of their indexed URLs (from 500.000 pages to only 100.000). Before the change, only the 8.55% indexed URLs generated at least one session in a month; after deindexing URLs, the 49.7% indexed URLs generated organic traffic.
In Safecont, we totally agree: duplicate content is something to avoid. The main source of duplicate content are query parameters used to generate content views, faceted navigation or track users. The last one is really a bad idea. There are better ways to track users, but if you are obliged to use track parameters, it is necessary to avoid its indexing because each pass of the crawler will generate useless URLs that will duplicate with the main landing pages of your site.
Regarding content views, your should ask yourself: Does this parameter change content seen by the user? If the answer is no, you should avoid indexing it. If the answer is yes, index it only if the change is noticeable. For example a parameter used to order some listing by price should not be indexed.
Facets pose the same problem. If you combine several facets with several filters, you could generate thousands of URLs that would not add content to your site. In general, you should use only facets for individual category pages and use non indexable filters to increase the refinement of a search. For example, you could generate a landing for “adidas jackets” but not for “adidas jackets under 200$”.
The usual method to search URL parameters is looking for them in the code
Read more >>
Read more >>
Read more >>
A few days ago we talked about the TFIDF and we used it to measure an URL quality, seeing how it was affected based on the number of unique words in that URL. In this post we will continue talking about TFIDF but in this case we will see how to use it to detect words that are overused in the same URL. When this happens, it is possible that Google considers that we are trying to over-exploit this word and, therefore, we could not position it in any way. So, the question that needs to be asked is: How far is it possible to repeat a word without suffering a penalty or a restriction on the part of the search engines that prevent positioning it correctly?
To understand the problem of negative repetitions we must first understand how words are distributed in a “natural” text. Some of the most accepted hypotheses are that the frequency of appearance of a word in a text follows one of these distributions: Zipf, log-normal o normal. For simplicity, we are going to assume that the words distribution in a text follows a normal distribution. This means that if we chose random words using a normal distribution, the result would have frequencies similar to any language. This is going to be useful because it will allow us to know if the distribution of words in our texts can be considered natural. Let’s see what form a normal distribution has.
As can be seen, a normal has a mean, μ, around which most values are grouped, and a standard deviation, σ, which determines the greater or lesser slope of the curve. In our case this would translate as follows: Most words will have a frequency of appearance close to the average and there will be a few at the ends that either appear too little or too much. To measure how little or too much we can use the 68-95-99,7 rule and that can be seen in the previous graph. This means that 68% of the words will have a frequency in the range [μ-σ, μ + σ], 95% will
Read more >>