Machine Readable Internet
Internet Content
Content served by URIs are structured by design as it is meant to be processed by browsers and applications to create a human readable page. Renderable contents are structured as per the syntax defined by the languages they are written in (HTML, CSS and JS). Consumable content returns mostly JSON data which is meant to be processed by a compatible client. But the actual content remains opaque and none of this gives any standardised way to understand the underlying meaning.
Humans vs Machine
Publicly available content on the internet is not just consumed by us, a lot of internet traffic is generated by web crawlers who keep surfing the web 24/7 for one purpose or another. Most common of them are search spiders, programs written to keep a database of discoverable internet space. We all know how useful these services are, they let consumers find the right information quickly and allow creators to find the right audience easily. Machines are our friends and try really hard to assign a meaning to the things they read and then match them with the question they get asked. It’s time to help them and give them some hints so they can save time and energy and present the result they want us to give in a more presentable manner.
Navigating the web
For us it’s really simple, just got to address bar type some keywords and instantly a list of most relevant links are presented to us. and if we know the exact public address of the resource we can directly go there without asking anyone. Machine gets it hard, but still remains polite while roaming our servers. As soon as they come to know about a resource domain
they first check if they are welcome there by looking at the rules specified by the resource owner robots.txt
. As a resource maintainers you should always keep an updated robots.txt and guide to the resource that you think should be useful for them. Complete specification is defined here. To allow everything simply put this in your robots.txt
placed at the top level of your domain and subdomain, like makeall.dev.
# for all machines
User-agent: *
# disallow nothing
Disallow:
Once getting the permission the bot dives into the root url /
and starts looking for links to other pages. If they find some in the catalogue, crawl, parse and repeat again. But content changes over time too so they come back again after some time to refresh what they saw before. This is really tiring even for bots and this is another thing where we can help them by providing a sitemap
, most content management systems generate them automatically like this, but it can be created manually as well. more information can be found here
Standardizing Structure
Navigating is one thing, but then comes the actual challenge, understanding the content. Various types of content exist across the internet. schema.org is a corporate founded and community driven organization that defines these types and provides a common way in which they can be structured. The server maintainers can add the structure information in the script
tag as a JSON-LD
object.
<script type="application/ld+json">
[
{
"@context": "http://schema.org",
"@type": "BreadcrumbList",
...
},
{
...
}
]
</script>
The Happy Bot
A bot is happy when it sees a welcoming robots.txt
, a well maintained sitemap
and standardized structured data about the page neatly tucked in. Let’s see the journey of a happy bot, once it gets a domain these are the steps it takes.
Read Domain Permissions
Looking up and reading robots.txt
domain = 'https://makeall.dev'
url = "{0}/robots.txt".format(domain) # get robots.txt url
rp = urllib.robotparser.RobotFileParser() # create the robots parser
rp.set_url(url) # set and read properties
rp.read()
user_agent = "*" # use default or your custom to test
# get politeness parameters
rp.request_rate(user_agent)
rp.crawl_delay(user_agent)
Lookup Sitemaps
# get list of sitemap xmls URLs
sitemaps = rp.site_maps()
Queue URLs
Read all sitemaps and find next URLs to visit
import xml.etree.ElementTree as ET
response = requests.get(sitmap_url)
sitemap = response.text
urls = []
tree = ET.parse(sitemap)
root = tree.getroot()
for url in root:
for loc in url:
if 'loc' in loc.tag:
urls.append(loc.text)
Reading Content
Fetch the URLs if robots allows it to fetch
if rp.can_fetch(user_agent, url):
response = requests.get(url)
url_text = response.text
HTMLParser
HTMLParser to get JSON LD data
class JsonLdHTMLParser(HTMLParser):
def __init__(self, url, datadir):
super().__init__()
self.found = False
self.url = url
self.timestamp = datetime.datetime.now().timestamp()
self.datadir = datadir
def get_filename(self):
key = "{0}|{1}".format(self.url, self.timestamp)
filename = "{0}.json".format(
hashlib
.sha224(key.encode('utf-8'))
.hexdigest()
)
return os.path.join(self.datadir, filename)
def handle_starttag(self, tag, attrs):
if tag == 'script':
for attr in attrs:
if attr[0] == 'type' and attr[1] == 'application/ld+json':
self.found = True
def handle_endtag(self, tag):
if self.found:
self.found = False
def handle_data(self, data):
if self.found:
try:
jsonld = {}
jsonld["data"] = json.loads(data)
jsonld["url"] = self.url
jsonld["time"] = self.timestamp
with open(self.get_filename(), 'w+') as fp:
fp.write(json.dumps(jsonld))
except Exception as e:
print("handle_data : {0}".format(e))
Extract Structured Data
Use the parser and extract json-ld data
parser = JsonLdHTMLParser(url, datadir)
parser.feed(url_text)
# parser will save the content in a json file
Read Data
Parse the json file
# json_file
with open(json_file) as fp:
try:
data = json.loads(fp.read())
print("[{0}][{1}]".format(data["time"], data["url"]))
if type(data["data"]) is list:
for d in data["data"]:
print("{0} : {1}".format(d["@context"], d["@type"]))
else:
print("{0} : {1}".format(data["@context"], data["@type"]))
except Exception as e:
print("load files : {0}".format(e))
context and type helps bot to check its own repository of schemas and map the data appropriately
Conclusion
Help the bots of the internet learn to read !