Dear Ioakim, As Dr. Karounos wrote Python provides some helpful libraries both for machine learning (scikit learn) as well as for text processing and nlp (e.g. nltk). Definitely java can be used in place. Candidate references to Greek government entities (these can be the named entities, e.g. General Secretariat of ..., or Mayor of ...) in the text can be found either using regex and machine learning (once training samples can be found) and the same holds for the assigned responsibilities. More details can be discussed once the project begins. Stanford's CoreNLP https://stanfordnlp.github.io/CoreNLP/ and Apache's OpenNLP https://opennlp.apache.org/ are the two tools to check if you are going to work with Java. Iraklis On Sat, Mar 24, 2018 at 6:20 PM, Theodoros G. Karounos <t [ dot ] karounos [ at ] gmail [ dot ] com > wrote: > Please find my answers in-line. > > 2018-03-24 11:08 GMT+02:00 ioaktheo <ioaktheo [ at ] teiser [ dot ] gr>: > >> Dear Sirs, >> >> I am writing this email to you with regards to my interest in the project >> named «Extraction of Responsibilities per unit in public sector >> organizations from the Government Gazette». Having read through the details >> of the project I would like to ask some questions so that I can understand >> better the requirements. I would be very grateful if you have the time to >> answer these questions before I submit my proposal. >> >> First, I see that the knowledge prerequisites include Python, Java and >> Machine Learning. I’m more familiar to Java, Machine learning and Data >> mining. I haven’t worked with Python, but I am willing to sit and work with >> this language before Google Summer of Code starts. Is Python going to be >> used for Machine learning purposes? >> > *Python is preferred for machine learning but JAVA does the job as well.* > >> Secondly, am I right in understanding that Machine Learning is used to >> automatically find and match «specific Named Entities types with references >> to assigned responsibilities-services per unit and links between the two >> must be extracted» is one of the main issues of this project? >> > *Yes one of the main tasks of this project is from the text in the PDF's > of the law that define the governance of a Greek government entities you > should extracted in hierarchical order the assigned > responsibilities-services for each unit of that institution. * > >> >> If so, am I right in thinking the steps required include: Preprocessing >> the data, Data integration, Hierarchical or partitioned clustering, >> Categorization and correlation rules? >> > > *Yes this is the approach in a few words, you should expand it in your > project. But we will discuss this extensively with all the mentors( > https://ellak.gr/wiki/index.php?title=GSOC2018_Projects#Extraction_of_Responsibilities_per_unit_in_public_sector_organizations_from_the_Government_Gazette > <https://ellak.gr/wiki/index.php?title=GSOC2018_Projects#Extraction_of_Responsibilities_per_unit_in_public_sector_organizations_from_the_Government_Gazette> > ) once we have the project approved.* > >> >> Finally, I am bit confused about the NER module. Is there any more >> information on this subject? >> > *Please read this( https://nlp.stanford.edu/software/CRF-NER.html > <https://nlp.stanford.edu/software/CRF-NER.html> ), there are plenty more > resources, search Google Scholar( > https://scholar.google.gr/scholar?hl=el&as_sdt=0%2C5&q=Named+Entity+Recognizer&btnG= > <https://scholar.google.gr/scholar?hl=el&as_sdt=0%2C5&q=Named+Entity+Recognizer&btnG=> > ), etc... * > > >> >> Thank you in advance. >> Best regards >> Ioakeim >> >> >> ---- >> Λαμβάνετε αυτό το μήνυμα απο την λίστα: Γενική λίστα αλληλογραφίας που >> απευθύνεται σε developers/contributors έργων ανοικτού λογισμικού - A >> general discussion list for developers/contributors of open-source projects, >> https://lists.ellak.gr/opensource-devs/listinfo.html >> >> Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. >> ταχυδρομείου στη διεύθυνση <opensource-devs+unsubscribe [ at ] ellak [ dot ] gr>. >> >> > > > -- > Jiddu Krishnamurti: If we can really understand the problem, the answer > will come out of it, because the answer is not separate from the problem. > > http://karounos.gr/blog/, Key-ID: 85AE3458 > > > ---- > Λαμβάνετε αυτό το μήνυμα απο την λίστα: Γενική λίστα αλληλογραφίας που > απευθύνεται σε developers/contributors έργων ανοικτού λογισμικού - A > general discussion list for developers/contributors of open-source projects, > https://lists.ellak.gr/opensource-devs/listinfo.html > > Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. > ταχυδρομείου στη διεύθυνση <opensource-devs+unsubscribe [ at ] ellak [ dot ] gr>. > >
---- Λαμβάνετε αυτό το μήνυμα απο την λίστα: Γενική λίστα αλληλογραφίας που απευθύνεται σε developers/contributors έργων ανοικτού λογισμικού - A general discussion list for developers/contributors of open-source projects, https://lists.ellak.gr/opensource-devs/listinfo.html Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <opensource-devs+unsubscribe [ at ] ellak [ dot ] gr>.