The LA Referencia statistics system operates on a shared infrastructure in Amazon AWS, maintained as part of the services provided by LA Referencia thanks to contributions from member countries and SCOSS support.

The infrastructure is based on a set of open components published on GitHub as part of the commitment to contribute to the Global Open Science Ecosystem.

Database, management, and identifier normalization libraries components
Storage and event preservation components
Processing, cleaning, normalization, and event aggregation components
Web services components for repositories and aggregators

By clicking here, you can access all the source code for the components

Database, management, and identifier normalization libraries components

Usage Statistics Service DB

Access to code and installation manuals

https://github.com/lareferencia/lareferencia-usage-stats-db

Administration and Orchestration System

Access to code and installation manuals

https://github.com/lareferencia/lareferencia-usage-stats-admin

Storage and event preservation component

AWS S3 Storage and Matomo to S3

Access to code and installation manuals

https://github.com/lareferencia/lareferencia-usage-stats-processor

Processing, cleaning, normalization, and event aggregation component

Processing software developed in Python, aimed at filtering and normalizing the information stored in S3 Parquet, which is later persisted in Elastic/OpenSearch indices.

Loading Parquet files from Amazon S3
This step loads Parquet files from a specific repository on a given date. During this process, user session data and associated events are extracted.

Bot filtering
The purpose of this step is to improve the reliability of the statistical data. The filter allows the identification and removal of sessions and events generated by bots, ensuring that only authentic data is analyzed.

Asset filtering
Similar to the previous step, this phase seeks to further enhance the quality of statistical data. Here, erroneous events like “thumbnail downloads” are detected and removed, when the statistics collector incorrectly registers a thumbnail view as a download.

Metric calculation
In this step, visits, downloads, and links associated with a specific session and its identifier (identifier) are calculated. Additionally, a new metric called “conversions” is introduced, which is based on combinations of views with downloads or views with links.

Data aggregation
This phase aggregates data by item (identifier), calculating views, downloads, links, and conversions of an item, regardless of the sessions that accessed it. These data are also aggregated by the country of origin of the event.

Identifier normalization
The goal of this step is to homogenize and standardize the identifiers (identifiers) from repositories, improving data consistency.

ElasticSearch/OpenSearch indexing
In this final step, the last adjustments in the data flow are made to ensure effective and efficient indexing in OpenSearch or ElasticSearch.

S3 to Elastic/OpenSearch Pipeline

Access to code and installation manuals

https://github.com/lareferencia/lareferencia-usage-stats-processor

Web services components for repositories and aggregators

Usage Statistics Web Services

Access to code and installation manuals

https://github.com/lareferencia/lareferencia-usage-services