ΕΕΛΛΑΚ - Λίστες Ταχυδρομείου

Re: [opensource-devs] Adding Greek language to Spacy - Project Proposal

Hello again,
Thanks for detailed and informative feedback.
I have updated my proposal in order to include the ideas/resources
suggested from both the mentors.

Moreover, I wrote a detailed step-by-step algorithm on how I am planning to
implement sentiment analysis of Greek documents - I guess that it was a bit
unclear before my previous email so I decided to explain it more. I
searched for approaches that were used for other languages and I adjusted
mine in order to combine the best features of them and take advantage of
the work that will be done for the integration of Greek language to spacy
platform (makes use of the dependency graph).
Note: the greek-sentiment-lexicon Mr Louridas mentioned is really useful
for the successful implementation of this approach.

As Mr Gogoulos asked, I searched more about the structure of FEK documents
and now I have included some use cases and testing metrics we could use. I
also included details and examples of punctuation rules that are language
dependent.

Added to this, I altered a bit my approach regarding the lemmatizer. Now it
less data-dependent and thus more likely to work.

I also altered some parts in which I found some inconsistencies and I
replaced the unreliable sources with more trustworthy ones. *Last but not
least, I noticed that for a reason I don't understand at some point the
abstract and the problem statement was removed from my proposal. I don't
know which version have you seen, but the current version has restored that
content.*

In general I think that the proposal now is even more detailed and easy to
follow. Bearing in mind your initial positive feedback, I feel confident
that it is ready for submission.
Of course, I am open to further suggestions and improvements but in case
you don't have please let me know in order to submit it for evaluation.

Thanks again for your time and your valuable feedback. I am waiting for
your response.
Have a nice day,
Ioannis Daras

(PS: I updated the link of the proposal in order to enable comments to the
document, so feel free to comment whatever you like).


On 18 March 2018 at 17:11, Panos Louridas <louridas [ at ] grnet [ dot ] gr> wrote:

> Two things:
>
> * There exists a Greek sentiment lexicon, developed in the context of a
> EU-funded project: https://github.com/MKLab-ITI/greek-sentiment-lexicon
>
> * The blog on Eurotalk mentions that Greek has 5 million words, while
> English has 170000. The reference for Greek is given to be the Guinness
> Book of Records. I don’t know if it really says anything like that, but it
> is not a scientific publication. Anyway, it is not easy to count words, see
> https://www.economist.com/blogs/johnson/2010/06/counting_words. Moreover,
> if you include words like those found in Homeric poems, I am sure you can
> come up with an impressive number about a language that is more difficult
> to understand to most Greeks than English. For what’s worth you can also
> look the Wikipedia entry on number of words in different languages, keeping
> in mind that what count as a “word” varies https://en.wikipedia.org/wiki/
> List_of_dictionaries_by_number_of_words.
>
> * BTW, a source of new words may be online forums. I doubt existing
> dictionaries have words like “μπανάρω”.
>
> Best,
>
> πλ
>
> > On 18 Mar 2018, at 16:19, Giannhs Daras <daras [ dot ] giannhs [ at ] gmail [ dot ] com> wrote:
> >
> > Hello!
> > I would like to thank both of you for your time and your detailed
> feedback. I really appreciate it and I would take it into account in order
> to improve my application.
> > Mr Gogoulos, thanks for your nice words. Regarding your questions:
> > 1) Fortunately, I will have finished the majority of my obligations
> before the start of GSOC. The only constraint will be my exams but this
> period lasts less than a month and with good organization of my time, I
> will manage to devote some time then. In any case, I will compensate for
> any loss of time during this period with hard work in the months following.
> > 3) As Mr Louridas noticed, there are some reliability issues with this
> source because it is not for demotic Greek; the main reason for mentioning
> it is that it provides us with an abstract, but still measurable, guess of
> how large or small can this list be for Greek language. But still, I agree
> with your idea; there is no point in trying to reinvent the wheel. We
> should use whatever is available in order to save time and focus on
> programming rather than collecting data. For that reason, I believe that we
> should take advantage of the organization brand in order to obtain the
> information we want; it is far more persuasive than a single person asking
> for data in digital format.
> > 4) I haven't done much research on the web but I will soon. I guess that
> in other languages the data are provided from institutions which do
> language research but I will search it and I will post an update here. As
> for my own approach, I think it will get the job done. It seems complicated
> but the idea behind it is really simple; find some origin words (which is
> easy to find from web or other sources) and then every new word you come
> across relate it to the most similar word from the initial list; meaning
> the word with the least Hamming Distance. For example, "αμπελοφιλοσοφία"
> may be a word of the initial list. The word "αμπελοφιλόσοφος" is not in the
> initial list but it's hamming distance with the word in the list is small
> (smaller than any other word in the list) so these words are going to be
> grouped together. Of course, there are cases in which this approach may
> leads to wrong groupings ; but these cases are very rare compared to the
> size of the list we need to construct so this approach is a good place to
> start. Of course, I am open to other approaches and I am going to rethink
> it, maybe I will find something else to do. But generally, I staunchly
> believe that is a promising strategy.
> > 5) Have a look here: https://github.com/sloria/TextBlob/blob/
> 90cc87ab0f9e25f37379079840ec43aba59af440/textblob/en/en-sentiment.xml .
> It is what text-blob does. Each adjective has a specific polarity but based
> on its sense. Each word can have many different senses based on context. A
> simple "not" can change the sense of a word. I guess that we can find words
> like not that can change polarity and then using the dependency parser
> spacy.io provides we can find the sense a word has based on the context
> it is used. The rest is easy: we can average polarities of adjectives or
> even use some simple classifier (take a look at here:
> https://github.com/sloria/TextBlob/blob/90cc87ab0f9e25f37379079840ec43
> aba59af440/textblob/en/sentiments.py).
> > About FEK, give me some time to read the structure of the document and
> then I will return with updates. One easy feature could be recognizing
> named entities, but I think that together we can think something more
> sophisticated than that.
> > 6) Fixed. Sorry for that, I changed username recently.
> >
> > Mr Louridas, thanks for your feedback and your suggestions. They seem
> pretty interest, I will take them into account. About your questions
> related to the lengths of Greek and English language, I compared those two
> sources: https://en.oxforddictionaries.com/explore/how-many-words-
> are-there-in-the-english-language, http://eurotalk.com/blog/2013/
> 02/08/so-did-you-know-you-can-speak-greek/ . It says that english
> language has 170.000 words while Greek has 5 million. I am not quite sure
> about the quality of this information, but at first sight there is a
> significant difference that we should not overlook in the design and the
> implementation of the project. About the norm-exceptions numbers, as you
> correctly mentioned, my source is not reliable but the meaning of this
> mention was to illustrate that the Greek and English language numbers for
> norm-exceptions are comparable. Concerning sources for norm-exceptions in
> Greek language your email is really informative. I will have a look on that
> and I will write back to you to inform you about my findings. I will also
> update my application in order to include the sources you mentioned (Greek
> schoolbooks, wikipedia, Hunspell dictionary). Thanks again for your time.
> >
> > I hope that this e-mail answers your questions. I will try to include
> most of the information in my application.
> > Please don't hesitate to ask me if you have any further questions. Any
> comments to my answers are more than welcome.
> >
> > Thanks again and I am waiting to hear from you soon.
> > Kind regards,
> > Ioannis Daras
> >
> >
> > On 17 March 2018 at 23:53, Panos Louridas <louridas [ at ] grnet [ dot ] gr> wrote:
> > I second Markos’s comments, some of my own:
> >
> > * Apart from crawlers, you can get dumps of the Greek wikipedia to
> obtain a large greek corpus.
> >
> > * You may also find of use the dictionary provided by Hunspell.
> >
> > * Where did you find the figure that Greek is 29 times richer than
> English?
> >
> > * Same for the number of greek norm exceptions?
> >
> > * Also, take into account that your reference is written in
> katharevousa, not demotic Greek, and look more oriented to phrasal verbs,
> so YMMV. In general norm exceptions in English are an issue because of the
> global nature of the language, compare British vs. American English. Norm
> exceptions in Greek are more likely to be found in manuals of good usage,
> for example “έχω απηυδήσει” vs. “έχω απαυδήσει”. The correct form is the
> second one, contrary to what most people think. There are plenty of such
> manuals, like one written by Prof. Babiniotis (Λεξικό των Δυσκολιών και των
> Λαθών στη Χρήση της Ελληνικής). That said, I wouldn’t loose too much sleep
> over such things, language and language rules evolve. There has been
> considerable research on this in English: "A rule is the tombstone of a
> thousand exceptions" (https://www.ncbi.nlm.nih.gov/
> pmc/articles/PMC2460562/).
> >
> > * Regarding the lemmatizer, again Hunspell may be of use.
> >
> > * Apart from the official government gazette, another source of data are
> the official Greek schoolbooks, all of which are available online. They are
> also a good test for checking the performance on language targeted for
> different age and education levels.
> >
> > Keep on the good work,
> >
> > πλ
> >
> > > On 17 Mar 2018, at 17:44, Markos Gogoulos <mgogoulos [ at ] mist [ dot ] io> wrote:
> > >
> > > Hi Ioannis,
> > >
> > > I had a look on your presentation+cv and I have to admit I'm really
> impressed by the wide range of your research interests + knowledge.
> > > I'm very positive that you can deliver the proposal you've written. It
> is very well writen and I'll try in the next few days to get it better and
> maximize chances for getting it accepted. I'd like to encourage other
> participants on the list to do so, please have a look as well.
> > >
> > > So for now just some very quick feedback:
> > >
> > > 1) The plan seems very nice to me, a question is if you're confident
> you'll have the time to cope with GSOC work (which is full time) along with
> the other research + enterpreneurial projects of yours.
> > >
> > > 2) I like what you suggest about bigger number of stop words for the
> Greek language (also was wondering why english version on Spacy has that
> few)
> > >
> > > 3) Regarding norm-exceptions, you mention http://www2.media.uoa.gr/
> language/gafl/details.php?bid=352 , I think that getting this list would
> save some time (plus produce accurate results) so may I suggest that we
> contact the author and kindly ask if this can be provided on a digital
> format?
> > > Please consider this for other tasks as well, for example the
> lemmatization. There should be open research on the greek web for these
> things, maybe it's not public (due to the fact that github and similar
> resources weren't a common thing to do when they produced the research) but
> I'm positive that authors would be willing to provide their work if we ask
> them. The idea is if there are good sources of data, we should profit from
> work already done and not redo it (obviously the work remains to get it to
> a format Spacy can understand but this is really easier than re-producing
> everything). I'd go one step forward and even consider asking from authors
> of good books on the subject (related to the greek language) for these type
> of data. Or even contact organizations as the Institute for Language and
> Speech Processing, and ask them.
> > >
> > > 4) Regarding the lemmatizer, Spacy doesn't mention much on the
> adding-language page, but I'm sure there must be discussion on the github
> issues page. What you suggest is your own approach or what is advised?
> Don't get me wrong, this seems highly interesting, just wondering if it is
> a bit too complicated (thus lot of noise on the results). Nevertheless the
> approach seems very interesting and we should make sure this experience is
> well documented and transfered to the Spacy.io community
> > >
> > > In general for all parts of the translation process it would make
> sense to check how they did it for other languages, just to profit from
> that experience (I'm sure you did work on this)
> > >
> > > Also would like to see a little more about text normalization and
> preprocessing, how you plan to handle punctuation (if you end up collecting
> the list yourself). Dictionary numbers are huge (eg I see on Polish 800k
> lemmas and Turkish 1.4m!) still we can do some manual review on the results.
> > >
> > > 5) On the 'extra' features, I'd like a very brief mention on how you
> would approach the categorization of the FEKs (btw that would make a really
> useful example on greek text classification how to!)
> > >
> > > Sentiment analysis is also an interesting addition to the proposal,
> wondering what you've tried here and what you have in mind.
> > >
> > > 6) Fix your github link on the CV, I think it's wrong
> > >
> > >
> > > Best,
> > > Markos
> > >
> > > On Fri, Mar 16, 2018 at 9:38 PM, Giannhs Daras <
> daras [ dot ] giannhs [ at ] gmail [ dot ] com> wrote:
> > > Hello Mr Gogoulos.
> > > Thanks for your answer!
> > > In my draft application that is uploaded in the Google Summer of Code
> platform there is a detailed timetable for the project implementation.
> Please confirm that you can see it. There is also a short bio. I will send
> you a private message with the link for that proposal in case you don't
> have access right now, but please inform me about whether you have access
> to the one that I have uploaded in the platform.
> > > In this e-mail, I also include a personal CV in case you or someone
> else in the list wants something more detailed than the one I wrote in my
> application.
> > > Thanks again for your message and I hope that I will hear from you
> soon.
> > > Kind regards,
> > > Ioannis Daras
> > >
> > >
> > > On 16 March 2018 at 20:42, Markos Gogoulos <mgogoulos [ at ] mist [ dot ] io> wrote:
> > > Hi Giannhs,
> > >
> > > thanks for the proposal, can you also include a CV and a timetable how
> you'd complete the project?
> > >
> > > Best,
> > > Markos
> > >
> > >
> > >
> > > On Fri, Mar 16, 2018 at 1:24 PM, Giannhs Daras <
> daras [ dot ] giannhs [ at ] gmail [ dot ] com> wrote:
> > > Dear Sir/Madam,
> > > My name is Ioannis Daras and I am third year student of Electrical and
> Computer Engineering at National Technical University of Athens. Some days
> ago, I came across a very interesting project of GFOSS for Google Summer of
> Code; I am talking about adding Greek language to Spacy platform.
> > > I read the documentation and the challenge fascinated me. I also had
> an idea about an interesting feature we could add to the project. Because
> of that, I read the student manual and wrote my proposal accordingly. I
> saved it as draft so hopefully organization has access to it.
> > > I would be really thankful if you could provide me with feedback about
> my application. I am deeply interested about the project so I would really
> like to adjust my proposal in order to meet the organization requirements.
> Please be as strict or benevolent you like, but I would love real feedback.
> > > Thanks again and I am waiting for your response,
> > > Ioannis Daras
> > >
> > >
> > >
> > > ----
> > > Λαμβάνετε αυτό το μήνυμα απο την λίστα: Γενική λίστα αλληλογραφίας που
> απευθύνεται σε developers/contributors έργων ανοικτού λογισμικού - A
> general discussion list for developers/contributors of open-source projects,
> > > https://lists.ellak.gr/opensource-devs/listinfo.html
> > >
> > > Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ.
> ταχυδρομείου στη διεύθυνση <opensource-devs+unsubscribe [ at ] ellak [ dot ] gr>.
> > >
> > >
> > >
> > >
> > >
> > > ----
> > > Λαμβάνετε αυτό το μήνυμα απο την λίστα: Γενική λίστα αλληλογραφίας που
> απευθύνεται σε developers/contributors έργων ανοικτού λογισμικού - A
> general discussion list for developers/contributors of open-source projects,
> > > https://lists.ellak.gr/opensource-devs/listinfo.html
> > >
> > > Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ.
> ταχυδρομείου στη διεύθυνση <opensource-devs+unsubscribe [ at ] ellak [ dot ] gr>.
> >
> >
>
>
 
----
Λαμβάνετε αυτό το μήνυμα απο την λίστα: Γενική λίστα αλληλογραφίας που απευθύνεται σε developers/contributors έργων ανοικτού λογισμικού - A general discussion list for developers/contributors of open-source projects,
https://lists.ellak.gr/opensource-devs/listinfo.html

Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <opensource-devs+unsubscribe [ at ] ellak [ dot ] gr>.

πλοήγηση μηνυμάτων