Skip to main content
  1. Tags/

web crawler

data pipeline plans

I’ve been tinkering with a couple of things and toying with another couple of ideas that when all combined form what I would call a data pipeline. I quite like the idea of building this data pipeline so I’ve decided that I shall do just that.

The things I’ve been working on already are a web scraper and a database. The other ideas I’ve been keen to explore are a data portal and an orchestrator. In reverse order the specific technologies are:

For the web scraper I’m going to revisit and rewrite the thoroughly unstructured and messy python scripts I wrote for my UK Freight Transport System map data. For the database I’m going to continue working on my geolab personal PostGIS server. The CKAN data portal is presently the least explored area, but I like what I see, and look forward to learning more about it. Lastly is Apache Airflow, which will literally tie all the other components together, in theory. As in, it could run the scraper on a schedule, transform its results and stick them in PostGIS, then something something prepare an export suitable for sharing via CKAN? We shall see.

GNU Terry Pratchett through the Looking Glasses

“A man is not dead while his name is still spoken.”

One day I discovered the existence of Looking Glasses. It was so long ago now that I don’t remember exactly what led to it, but I know I found http://www.bgplookingglass.com and became quite intrigued by it.

A Looking Glass is a system that network operators might use to find out Internet routing and BGP-related information. They provide insight into how a particular router connects the Autonomous Systems that make up the internet.

But this post isn’t about Looking Glasses, it’s about something else I found. So back to my story…