Choice and technical, legal, and ethical analysis of the source datasets
Alignment and creation of LOD datasets and its metadata
Extract most relevant trends and visualize possible correlations
The process lifecycle has been described in an ad hoc documentation
Please select the reference academic year (default 2015/16):
The data show that no correlation can be found between the amount of fees and scholarships, and the percentage of enrolled international students:
| # | 2016 | 2017 | 2018 | 2019 |
|---|---|---|---|---|
| 1 | Perugia Stranieri (39,57%) | Perugia Stranieri (38,70%) | Perugia Stranieri (36,68%) | Perugia Stranieri (36,63%) |
| 2 | Rozzano (MI) Humanitas Univ. (25,43%) | Rozzano (MI) Humanitas Univ. (27,19%) | Bra Scienze Gastronomiche (28,24%) | Bra Scienze Gastronomiche (27,98%) |
| 3 | Bra Scienze Gastronomiche (23,91%) | Bra Scienze Gastronomiche (26,65%) | Rozzano (MI) Humanitas Univ. (26,68%) | Roma Saint Camillus (27,50%) |
| 4 | Bolzano (15,67%) | Reggio Calabria - Dante Alighieri (17,31%) | Reggio Calabria - Dante Alighieri (18,12%) | Rozzano (MI) Humanitas Univ. (25,05%) |
| 5 | Torino Politecnico (14,12%) | Bolzano (14,64%) | Milano Politecnico (15,28%) | Reggio Calabria - Dante Alighieri (18,69%) |
Format: .csv .tsv .json .xml
Metadata: Provided
URI: 2016ContribuzioneMedia
Format: .csv .tsv .json .xml
Metadata: Provided
URI: 2017ContribuzioneMedia
Format: .csv .tsv .json .xml
Metadata: Provided
URI: 2018ContribuzioneMedia
Format: .csv .tsv .json .xml
Metadata: Provided
URI: 2019ContribuzioneMedia
Format: .csv .tsv .json .xml
Metadata: Provided
URI: 2016SpesaInterventi
Format: .csv .tsv .json .xml
Metadata: Provided
URI: 2017SpesaInterventi
Format: .csv .tsv .json .xml
Metadata: Provided
URI: 2018SpesaInterventi
Format: .csv .tsv .json .xml
Metadata: Provided
URI: 2019SpesaInterventi
Format: .csv .tsv .json .xml
Metadata: Provided
URI: IscrittiAteneo
International Students 2016-2019
ID: intstudent
Provenance: MIUR (USTAT)
Creation Date: più di 1 anno fa
Format: .csv .tsv .json .xml
Metadata: Provided
URI: IscrittiStranieriAA
Initinere 2016
ID: 2016
Creation Date: 13 November 2022
Format: .csv
Metadata: Provided
URI: Initinere2016
Initinere 2017
ID: 2017
Creation Date: 13 November 2022
Format: .csv
Metadata: Provided
URI: Initinere2017
Initinere 2018
ID: 2018
Creation Date: 13 November 2022
Format: .csv
Metadata: Provided
URI: Initinere2018
Initinere 2019
ID: 2019
Creation Date: 13 November 2022
Format: .csv
Metadata: Provided
URI: Initinere2019
In itinere is an open data project analyzing the enrollment of international students in Italian universities over a reference period of four academic years (a.y. 2015/2016 - 2018/2019). Indeed, in the last decades Italy has stressed the importance of opening academia to international students and researchers. To this end, exchange project are funded and international and English-taught degree courses have been opened. Moreover, specific institutions have been founded such as the "Università per stranieri" such as the one in Perugia (already in 1925), Siena (1992) and "Dante Alighieri" in Reggio Calabria (2007).
As mentioned in the previous sections of this webpage, the main goal of the project is to analyze possible factors which influence the choice of the university from the perspective of an international student. To address this question, after the analysis of the available data, we chose to reformulate this goal in a more circumscribed research question: are scholarships and fees riliable indicators to describe the presence of international students in Italian universities? For the sake of simplicity (expecially in the implementation of algorithms), we translated with "scholarships" several expenditures made by universities (also dorms, canteens, international mobilites etc.), grouped in the Italian term "diritto allo studio".
To what concerns the availability of open data, we relied on the catalogue dati.gov.it. This platform is part of the Italian ongoing process of digital transition, "Piano di Crescita Digitale": it includes 56.893 datasets, mainly from local and regional administration. National datasets are scarce (almost 8%): this results in a difficulty in fetching exaustive data for our project. The entirety of the source datasets have been taken from USTAT, the data portal of tertiary education. Its section "Open Data" is the most updated, yet some datasets do not provide information after a. y. 2018/19. We also considered to integrate them with other sources, such as ISTAT: unfortunately, it either provides information and useful observations referenced to different academic years (often older than the ministerial one) or groups information geographically, which is incompatible with the subdivision per insitution adopted by MIUR datasets.
To better investigate the access to new data, it is necessary to have access to more detailed data. In particular, further researches will surely benefit from datasets including observations about:
The project was carried out by:
The workload was divided as following:
As mentioned, the ten source datasets (dowloaded as .csv files) come from the same ministerial portal, USTAT. From now on it will be cited as MIUR, after the domain of the URL, even though the former "Ministero dell'Istruzione, dell'Università e della Ricerca" have been splitted into "Ministero dell'Università e della Ricerca" and the Ministry of Education (its name changed considerably during the last governments). As listed in the previous sections, the source datasets are:
| Topic | Portal | Dataset(s) |
|---|---|---|
| Contribuzione e interventi atenei. Contribuzione media |
MIUR | 2016, 2017, 2018, 2019 |
| Diritto allo Studio Universitario (DSU) Regionale. Spesa per interventi |
2016, 2017, 2018, 2019 | |
| Iscritti. Iscritti per ateneo | 2016-2019 | |
| Iscritti. Iscritti stranieri per ateneo | 2016-2019 |
Before the mashup, the datasets are filtered as both telematic univerisities and AFAM (fine and performing arts academies) have been excluded. Moreover, expecially in the first two topics, not all the univerisites are present in all the datasets (which justfies considerably differences in some of the visualization): this is case of three Roman universities,
distElement(dframe1, dframe2) in the Python script used to create a temporary datasets for the linechart visualization (see §7).
The four output datasets are obtained using the algorithms described in this Jupyter Notebook: please make reference to this document for a more precise description of the rationale and of the computational steps beneath this phase. We decided to keep the diachronical distinction also in the output datasets: as a consequence the association between source and output dataset is the following (indicated by reference year).
| Output | Scholarships | Paid fees | Students | International |
|---|---|---|---|---|
| 2016 | 2016 | 2016 | 2016-2019 | 2016-2019 |
| 2017 | 2017 | 2017 | ||
| 2018 | 2018 | 2018 | ||
| 2019 | 2019 | 2019 |
The qualitative analysis of the datasets follows the guidelines available on the Italian official governmental portal "Docs Italia". The four reference categories are accuracy, consistency, completeness and timeliness. They are drawn from AgID Determinazione Commissariale n. 68/2013, which states that (description of the categories was translated in English here):
In relazione allo specifico contesto d'uso e alle finalità perseguite dalla norma, le basi di dati critiche devono assicurare il valore intrinseco dei dati in modo che gli attributi dei dati stessi siano adeguati rispetto alle caratteristiche di "inerenza" definite nell'ambito del suddetto standard ISO/IEC 25012, di seguito sintetizzate:
- accuratezza: il dato, e i suoi attributi, rappresenta correttamente il valore reale del concetto o evento cui si riferisce;
- attualità (o tempestività di aggiornamento): il dato, e i suoi attributi, è del “giusto tempo” (è aggiornato) rispetto al procedimento cui si riferisce;
- coerenza: il dato, e i suoi attributi, non presenta contraddittorietà rispetto ad altri dati del contesto d’uso dell'amministrazione titolare;
- completezza: il dato risulta esaustivo per tutti i suoi valori attesi e rispetto alle entità relative (fonti) che concorrono alla definizione del procedimento
As stated on the aforementioned portal, it is garanteed that all the provided data respect these four criteria. Indeed, generally the datasets are of good quality, yet minor flaws can be highlighted. In the following table, a concise description is provided (datasets are analyzed by typology and chronological distinction is not reflected, as data concerning the same field appears to be coerent):
| Dataset | Accuracy | Consistency | Completeness | Timeliness |
|---|---|---|---|---|
| Scholarships | Only in some cases detached seats are considered separatedly from main campus | Missing information about Roma - Link Campus University (2016, 2017) and Roma - Università Europea (2016) | No information available after a.y. 2018/2019 | |
| Fee | No distinction between detached seats of a university from the main campus | No information available after a.y. 2018/2019 | ||
| Total students | ||||
| International students |
To analyse the legal aspects of the original datasets used for this project the following check list was adopted as a tool to check the evaluation of some specific topics: privacy issues, IPR policy, licenses, limitations on public access, economical conditions, temporal aspects of the dataset.
| To check | Paid Fees | Scholarships | Total Students | International Students |
|---|---|---|---|---|
| Free of any personal data as defined in the Regulation (EU) 2016/679? | Yes | Yes | Yes | Yes |
| Free of any indirect personal data that could be used for identifying the natural person? | Yes | Yes | Yes | Yes |
| Free of any particular personal data (art. 9 GDPR)? | Yes | Yes | Yes | Yes |
| Free of any information that combined with common data available in the web, could identify the person? | Yes | Yes | Yes | Yes |
| Free of any information related to human rights? | Yes | Yes | Yes | Yes |
| Do you use a tool for calculating the range of the risk of de-anonymization? | Not needed | Not needed | Not needed | Not needed |
| Are you using geolocalization capabilities? | Yes | Yes | Yes | Yes |
| Did you check that the open data platform respect all the privacy regulations? | Yes | Yes | Yes | Yes |
| Do you know who are in your open data platform the Controller and Processor of the privacy data of the system? | Yes | Yes | Yes | Yes |
| Do you have checked the privacy regulation of the country where the dataset are physically stored? | Yes | Yes | Yes | Yes |
| Do you have non-personal data? | Yes | Yes | Yes | Yes |
| To check | Paid Fees | Scholarships | Total Students | International Students |
|---|---|---|---|---|
| Do you have created and generated the dataset? | No | No | No | No |
| Are you the owner of the dataset? | No | No | No | No |
| Are the dataset free from third party licenses or patents? | Yes | Yes | Yes | Yes |
| Do you have checked if there are some limitations in your national legal system for releasing some kind of datasets with open license? | Yes | Yes | Yes | Yes |
| To check | Paid Fees | Scholarships | Total Students | International Students |
|---|---|---|---|---|
| Do you release the dataset with an open data license ? | Yes | Yes | Yes | Yes |
| Do you include the clause: "In any case the dataset can’t be used for re-identifying the person"? | No | No | No | No |
| Do you release the API (in case you have) with an open source license? | Not needed | Not needed | Not needed | Not needed |
| Do you check that the open data/API platform license regime is compliance with your IPR policy? | Not needed | Not needed | Not needed | Not needed |
| To check | Paid Fees | Scholarships | Total Students | International Students |
|---|---|---|---|---|
| Do you check that the dataset concerns your institutional competences, scope and finality? | Yes | Yes | Yes | Yes |
| Do you check the limitations for the publication stated by your national legislation or by the EU directives? | Yes | Yes | Yes | Yes |
| Do you check if there are some limitations connected to the international relations, public security or national defence? | Yes | Yes | Yes | Yes |
| Do you check if there are some limitations concerning the public interest? | Yes | Yes | Yes | Yes |
| Do you check the international law limitations? | Yes | Yes | Yes | Yes |
| Do you check the INSPIRE law limitations for the spatial data? | Yes | Yes | Yes | Yes |
| To check | Paid Fees | Scholarships | Total Students | International Students |
|---|---|---|---|---|
| Do you check that the dataset could be released for free? | Yes | Yes | Yes | Yes |
| Do you check if there are some agreements with some other partners in order to release the dataset with a reasonable price? | Not needed | Not needed | Not needed | Not needed |
| Do you check if the open data platform terms of service include a clause of “non liability agreement” regarding the dataset and API provided? | Yes | Yes | Yes | Yes |
| In case you decide to release the dataset to a reasonable price do you check if the limitation imposed by the new directive 2019/1024/EU are respected? | Not needed | Not needed | Not needed | Not needed |
| In case you decide to release the dataset to a reasonable price do you check the e-Commerce directive1 and regulation? | Not needed | Not needed | Not needed | Not needed |
| To check | Paid Fees | Scholarships | Total Students | International Students |
|---|---|---|---|---|
| Do you have a temporary policy for updating the dataset? | No | No | No | No |
| Do you have some mechanism for informing the end-user that the dataset is updated at a given time to avoid mis-usage and so potential risk of damage? | No | No | No | No |
| Did you check if the dataset for some reason can’t be indexed by the research engines (e.g. Google, Yahoo, etc.)? | Yes | Yes | Yes | Yes |
| In case of personal data, do you have a reasonable technical mechanism for collecting request of deletion (e.g. right to be forgotten)? | Not needed | Not needed | Not needed | Not needed |
Lastly a fundamental aspect of the legal analysis conducted, concerned the choice of the license under which to publish the project and its data, therefore first the original licenses of the analysed datasets have been considered. Despite having all been published by the same organization, two different licenses have been employed: IODL v2.0 for the datasets on Paid Fees and Scholarships and A1 Public Domain for the Students and International Students.
Both being equally (IODL v2.0) or less (A1 Public Domain) restrictive we have decided tp publish the output datasets under a CC-BY 4.0 license requiring only notice of Attribution.
To compare the different possible licenses we referred to the specific documentation of each and to the Licensing Assistant tool provided by data.europa.eu
| Original License | Output License | |
|---|---|---|
| MIUR-Paid Fees (2016-19) | IODL v2.0 | CC-BY 4.0 |
| MIUR-Scholarships (2016-19) | IODL v2.0 | |
| MIUR-Total Students | A1 Public Domain | |
| MIUR-International Students | A1 Public Domain |
As seen in the previous sections each dataset analysed and re-used in this project has been collected from the MIUR-USTAT Open Data platform, that gathers data from Italian universities and publishes them in compliance with the Legislative Decree n. 33 March 14th on Publicity, Transparency and Diffusion of Information for Public Administration. As declared on their webpage, these date are undoubtedly extremely sensible, covering aspects related to genre, age, residence and citizenship of universities' students and the ethical aspect of their handling has to be carefully considered.
As far as we could see, no specific ethical issue was encountered, yet it is worth analysing each dataset in detail:
All original datasets considered for this project have been provided by the MIUR-USTAT Open Data Portal and have all been described following the same metadata schema, displaying notable temporal information (publication date and following modifications). Nonetheless other pieces of information where missing, such as license and creator: indeed, they need to be retrieved elsewhere on the website, challenging the accessibility for the users.
An overview of the output on the metadata analysis is shown below, highlithing weakenesses and differences between the datasets.
| MIUR-Paid Fees | MIUR-Scholarships | MIUR-Total Students | MIUR-International Students | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 2016 | 2017 | 2018 | 2019 | 2016 | 2017 | 2018 | 2019 | 2016-19 | 2016-19 | |
| distribution format | ||||||||||
| license | ||||||||||
| last modify | ||||||||||
| creator | ||||||||||
| format | ||||||||||
| creation time | ||||||||||
| media type | ||||||||||
| datastore active | ||||||||||
| has views | ||||||||||
| id | ||||||||||
| last modified | ||||||||||
| license type | ||||||||||
| on same domain | ||||||||||
| package id | ||||||||||
| position | ||||||||||
| revision id | ||||||||||
| state | ||||||||||
| url type | ||||||||||
All data and information available on this website have been published under a CC-BY 4.0 license and are compliant with the FAIR principles:
Being developed as a final project for the course "Open Access and Digital Ethics" (MA "Digital Humaninities and Digital Knowledge"), University of Bologna for the a.y. 2022/2023, there is no current intention of updating the resource in the future. On the contrary, the source of the original datasets here analysed does provide annual updates and is the most reliable source of information in this field. Consequently, the team would like to draw the attention of any interested users to their platform for any further reference oh how this data may change in the next years.
Even though the research question is proven wrong, data can be used to provide further visualizations. The following section proposes different graphic interpretations of the data: they allow to analyse macro- and microtrends on both a chronological and geographical basis. They have been realized (where not differently states) through the library Google Developers Charts: the implementation of algorithms based on this library and on D3.js allow the user to interactively change the displayed information. Please note that most of the provided charts aims at merging together variables with different measurements (euros and percentages): for sake of readability, this made necessary the creation of a secondary vertical axis. Unfortunately (expecially in time series), this could lead to visualization biases: sharp trends could actually be justified by just different reference values on the second y axis.
Scatterplot:
Select the variables and the reference year. Then click on the button to visualize the result.
| X Axis: | |
|---|---|
| Y Axis: | |
| Year: |
Timeseries:
Select a university and visualize the change of the observations during the reference period.
The creation of the source data set is implemented in this Python file. The code is also responsible for the creation of parts of the option HTML tags.
Cluster:
A new dataset was created, averaging the values of the timeseries. This clusterization is interesting, expecially if we analyse the composition of the first cluster against the other two: public univerisites tend to populate the former group, while private the latter ones.
Note that only the institutions present in all four datasets have been considered for clustering (Roma Europea, Roma Link Campus and Roma Saint Camillus are hence excluded).
Please read the following Jupyter Notebook for the reference code and the 3D visualization of the clusters
Heatmap:
These maps can visualized at the following links (they
will redirect you to a new window):
Please read the following Jupyter Notebook for the reference code and customizable visualization of the heatmaps
The produced mashed-up datasets have been described in their metadata specification following DCAT AP version 2.0.0, an RDF vocabulary designed to facilitate interoperability between data-catalogs published on the Web, and the resulting Turtle serialization can be accessed and downloaded by selecting the item of interest in the following list.
The datasets have been analysed both as single individual dcat:Dataset and as a whole as a dcat:Catalog gathering them all. The main used metadata properties have been summarized in the next table, with specific attention at including both mandatory elements and the highest number of reccomended and optional information as possible to meaningfully enrich the collection.
| Catalog | Datasets | |
|---|---|---|
| Identifiers | dcterms:identifier, dcterms:title | dcterms:identifier, dcterms:title |
| Description | dcterms:description, dcat:keyword | dcterms:description, dcat:keyword |
| Temporal | dcterms:issued, dcterms:modified | dcterms:issued, dcterms:modified, dcterms:temporal |
| Spatial | dcterms:spatial, prov:wasDerivedFrom | |
| Composition | dcat:datasets | |
| Agents | dcterms:publisher, dcterms:creator | dcterms:publisher, dcterms:creator |
| Legal | dcterms:rights, dcterms:license | dcterms:license, dcterms:rightsHolder |
| Distribution | dcat:distribution | |
| Language | dcterms:language | dcterms:language |
| Web | foaf:homepage |