ΕΕΛΛΑΚ - Λίστες Ταχυδρομείου

Re: Flexible GovDoc Scanner

Subject: Re: Flexible GovDoc Scanner
From: "Giannis E. Skitsas" <iskitsas [ at ] gmail [ dot ] com>
Date: Mon, 7 Apr 2025 09:27:59 +0300

Hi Bekarys,

Feel free to join our discord server <https://discord.gg/EgerE5nzvk> for
more information and discussions on GovDoc Scanner. We have also
created a Frequently
Asked Questions <https://github.com/flexivian/govdoc-scanner/discussions/1> in
the Github repository discussions section. You can find most asked
questions there.

Regarding your question, robots.txt on https://publicity.businessportal.gr/
stops us from crawling on this website, so you can update/modify your
proposals assuming that the crawler is out of scope. Alternatively, we can
provide you with some specific PDFs, e.g. from 20 companies, downloaded
directly from GEMI portal, and we can store them in our own storage server
to be used by the scanner tool.

In order to find a PDF for a company you can follow those steps:
1. Navigate to https://publicity.businessportal.gr/
2. Enter 'udev' in the search bar and select the company

[image: image.png]
3. Select "History of Greek Business Registry announcements" ("Ιστορικό
ανακοινώσεων Γ.Ε.ΜΗ." in Greek) to find related PDFs
[image: image.png]

4. Download the PDF 'Announcement of establishment by OSS'

[image: image.png]

The goal is to be able to parse specific data from PDFs (expecting various
templates to be used by the lawyers of each company) and extract
information for legal representatives, eg. I was representing UDEV, and
this info can be found under section 6

[image: image.png]

So the challenge here is the language, various doc templates, the specific
info we want to extract from files which probably have been exported to
PDFs in various ways, and on top of that to keep version history of the
extracted data.

Regards,
Giannis
--
Ioannis Skitsas

On Thu, Apr 3, 2025 at 5:22 PM Бекарыс Сериков <bekaryss03 [ at ] gmail [ dot ] com> wrote:

> Hello, I’m Bekarys Serikov, a third-year student at Nazarbayev University,
> and I’m excited about contributing to the "Flexible GovDoc Scanner" project
> for GSoC 2025. I’ve been exploring the ΓΕΜΗ portal (
> publicity.businessportal.gr) to understand the crawling task, which
> mentions gathering "PDF documents." However, when I search the site, I
> mostly see web-based results rather than PDFs. Could someone clarify where
> these PDF documents are located or how they’re accessed on the portal? I’d
> really appreciate any guidance!
>
> Sincerely,
> Bekarys Serikov
> ----
> Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων
> που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer
> of Code - A discussion list for student developers and mentors of Google
> Summer of Code projects.,
> https://lists.ellak.gr/gsoc-developers/listinfo.html
> Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ.
> ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.
>

----
Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects.,
https://lists.ellak.gr/gsoc-developers/listinfo.html
Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.

αναφορές

Flexible GovDoc Scanner, Бекарыс Сериков

πλοήγηση μηνυμάτων

προηγούμενο ημερολογιακά: Interest in Generative AI Agent for Personalized Music Recommendations
επόμενο ημερολογιακά: Zhengfei Li (Alex) eCodeOrama Proposal
προηγούμενο βάσει θέματος: Flexible GovDoc Scanner
επόμενο βάσει θέματος: Expressing Interest in Exploring Triplestore Alternatives