Laying the foundation for an open data catalogue project

As I mentioned in one of my previous posts, Carto’s careers page has a number of tests that are proving to be useful and fun personal projects.

The next project I decided to have a go at was, Laying the foundation stone for an Open Data catalog

The basic idea is:

  • Get the data, tables containing statistical data from the Spanish census and geometry of municipalities as shapefiles
  • Import the data into a PostgreSQL/PostGIS instance, there should be two tables, one for the geometry and one for the statistical data
  • Combine the data and description files from the previous step in order to create a table with statistical indicators
  • Write a couple of SQL queries to address the following: Get the population density of each of the municipalities of Madrid, Get the name of the 10 provinces with the highest percentage of people with university degrees

Building my solution, I wanted to focus on the things that would be more interesting. So I haven’t used python to download and unzip the data, as this would be straightforward and uninteresting for me. I just downloaded it myself.

I have also assumed that the task expects that the database has already been built and the python script just needs to create and populate the tables, not set up the whole database. I created my PostGIS database using a combination of postgresapp and DBeaver

I have outlined my script below and just wanted to point out the libraries I have used and what they were used for in general terms:

  • Pandas – wrangling and prepping the census CSVs
  • Geopandas – wrangling and prepping the shapefiles
  • psycopg2 and SQLAlchemy – Getting the data handled by pandas/geopandas into the database
import os
import geopandas
import psycopg2
import pandas
from sqlalchemy import create_engine 

#database name
db = 'db_name'
#database user
db_user = 'db_user'
#database hostname
db_host = 'db_host'
#database use password
db_password = 'db_password'
engine = create_engine('postgresql+psycopg2://'+db_user+':'+db_password+'@'+db_host+'/'+db)
#path to location of indicadores_seccion_censal csvs
path_to_csv_folder = '/indicadores_seccion_censal_csv/'
#path to folder containing shapefiles
path_to_muni_shp_folder = '/shapefiles/'

#Prep CSV to a pandas dataframe
csv_names = os.listdir(path_to_csv_folder)
csv_paths = [path_to_csv_folder+csv_file for csv_file in csv_names]
df_list = [pandas.read_csv(f, dtype = {'cpro':str,'cmun':str}) for f in csv_paths]
for entry in df_list:
    entry['id_ine']=entry['cpro']+entry['cmun']
full_df = pandas.concat(df_list)
#Write pandas dataframe to db table
full_df.to_sql('indicadores_seccion_censal', con=engine)

#This section reads the municipality shape as preparation to send to postGIS
shp_names = os.listdir(path_to_muni_shp_folder)
shp_paths = [path_to_muni_shp_folder+shp_file for shp_file in shp_names if shp_file[-3:] == 'shp']
gdf_list = [geopandas.read_file(path) for path in shp_paths]

#This section writes the municipality geometry file to postGIS table
try:
    conn = psycopg2.connect(f"dbname={db} user={db_user} host={db_host} password={db_password}")
except:
    print ("Can't connect to database")
cur = conn.cursor()
#Create tables
cur.execute("CREATE TABLE muni_poly (id_ine text, name text, geom geometry);")
cur.execute("ALTER TABLE muni_poly ALTER COLUMN geom type geometry(Geometry, 4326);")
#Populate rows with geometries
for shapefile in gdf_list:
    for index, row in shapefile.iterrows():
        in_geom = 'SRID=4326;'+row.geometry.wkb_hex
        in_name = str(row.rotulo)
        in_id = str(row.id_ine)
        SQL = "INSERT INTO muni_poly(geom,name,id_ine) VALUES ((%s::geometry), (%s), (%s));"
        data = (in_geom, in_name, in_id)
        cur.execute(SQL,data)
conn.commit()
cur.close()
conn.close()

I have also created a couple of the SQL Queries as suggested in the test.

This query returns the population per square kilometre of the municipalities of Madrid province. (cpro is the attribute containing province codes, Madrid’s code is 28, id_ine are the municipality codes)

SELECT 
	p.name, 
	SUM(t1_1), 
	isc.id_ine, 
	st_area(st_geogfromwkb(p.geom))/1000000 AS area_sq_km, 
	ROUND(CAST(SUM(t1_1)/(st_area(st_geogfromwkb(p.geom))/1000000) AS NUMERIC),2) AS pop_per_sq_km
FROM 
	indicadores_seccion_censal isc
INNER JOIN 
	muni_poly p ON isc.id_ine = p.id_ine
WHERE 
	isc.cpro = '28'
GROUP BY
	isc.id_ine, 
	p.name, 
	p.geom
ORDER BY
	pop_per_sq_km DESC;

I have included the top 5 rows from the result of this query ie. the municipalities in Madrid with the highest population densities in descending order.

namesumid_inearea_sq_kmpop_per_sq_km
Coslada890652804912.19991752278337300.46
Madrid318662028079605.969727688485258.71
Fuenlabrada1968802805839.73317961143834955.05
Alcorcón1666352800734.09235540316324887.75
Parla1218302810625.08027251552684857.6
I have noticed there are some issues with some characters, this is a possible improvement if I do further work on this project.

Below is my query that provides the 10 municipalities with the highest proportion of people with university degrees.

SELECT 
	SUM(t1_1) AS province_pop, 
	isc.cpro AS province_code, 
	SUM(t12_5) AS tot_degree, 
	ROUND(CAST(SUM(t12_5)/SUM(t1_1)*100 as numeric),2) AS perc_degree
FROM 
	indicadores_seccion_censal isc
INNER JOIN 
	muni_poly p ON isc.id_ine = p.id_ine
WHERE 
	t12_5 IS NOT NULL
GROUP BY 
	isc.cpro
ORDER BY 
	perc_degree DESC
LIMIT 10;
province_popprovince_codetot_degreeperc_degree
638674028148737523.29
11495354822194519.31
7037752012869018.29
528700479653518.26
5468955898924518.09
31812015680517.86
6348303111176017.6
345730375991017.33
9655355016339016.92
10386903317290516.65

I haven’t got the province names loaded into the database at the moment. But once those are, it would just be a simple join on the province code attribute.

Like my last post about one of these Carto test projects, I plan to add some more to this post about some of the choices and decisions I made , but I was keen to get this post up so I can share it first.