Make sure that you don't have different case email duplicates in src/cncf-config/email-map: cd src, ./lower_unique.sh cncf-config/email-map.
- If you generated new email-map using
./import_affs.sh, then:mv email-map cncf-config/email-map - To generate
git.logfile and make sure it includes all orgs used bydevstatsuse cncf/devstats'sGHA2DB_PROJECTS_OVERRIDE="+cncf,+opencontainers,+istio,+spinnaker,+knative,+linux,+zephyr" PG_PASS=... GHA2DB_EXTERNAL_INFO=1 GHA2DB_PROCESS_REPOS=1 ./get_reposand then final command line it generates. Make ituniq. - To get repos from CDF use:
PG_PASS=... GHA2DB_EXTERNAL_INFO=1 GHA2DB_PROCESS_REPOS=1 GHA2DB_PROJECTS_YAML=cdf_projects.yaml get_repos. - To get GraphQL repos use:
AWS_PROFILE=... KUBECONFIG=... helm install ./devstats-helm-graphql --set skipSecrets=1,skipPVs=1,skipProvisions=1,skipCrons=1,skipGrafanas=1,skipServices=1,skipPostgres=1,skipIngress=1,bootstrapPodName=debug,bootstrapCommand=sleep,bootstrapCommandArgs={36000s},AWS_PROFILE=... KUBECONFIG=... ../devstats-k8s-lf/util/pod_shell.sh debug,GHA2DB_EXTERNAL_INFO=1 GHA2DB_PROCESS_REPOS=1 GHA2DB_PROJECTS_YAML=gql/projects.yaml GHA2DB_LOCAL=1 get_repos,AWS_PROFILE=... KUBECONFIG=... kubectl delete po debug. - Top get LF repos use:
AWS_PROFILE=... KUBECONFIG=... helm install ./devstats-helm --set skipSecrets=1,skipPVs=1,skipProvisions=1,skipCrons=1,skipGrafanas=1,skipServices=1,bootstrapPodName=debug,bootstrapCommand=sleep,bootstrapCommandArgs={36000s},AWS_PROFILE=... KUBECONFIG=... ../devstats-k8s-lf/util/pod_shell.sh debug,ONLY='iovisor mininet opennetworkinglab opensecuritycontroller openswitch p4lang openbmp tungstenfabric cord' GHA2DB_PROPAGATE_ONLY_VAR=1 GHA2DB_EXTERNAL_INFO=1 GHA2DB_PROCESS_REPOS=1 GHA2DB_PROJECTS_YAML=k8s/projects.yaml GHA2DB_LOCAL=1 get_repos,AWS_PROFILE=... KUBECONFIG=... kubectl delete po debug. - Update
repos.txtto contain all repositories returned by the above commands. Updateall_repos.shto include data from CNCF, CDF, LF and GraphQL. - To run
cncf/gitdmon a generatedgit.logfile run:cd src/; ~/dev/alt/gitdm/src/cncfdm.py -i git.log -r "^vendor/|/vendor/|^Godeps/" -R -n -b ./ -t -z -d -D -A -U -u -o all.txt -x all.csv -a all_affs.csv > all.out. New approach is./mtpbut it don't have a way (yet) to deal with the same emails mapped into different user names from different per-thread buckets. - To generate human readable text affiliation files: first run:
./enchance_all_affs.shthen:SKIP_COMPANIES="(Unknown)" ./gen_aff_files.sh. - If updating via
ghusers.shorghusers_cached.sh(step 6) - rungenerate_actors.shtoo. If you need LF actors, run:AWS_PROFILE=... KUBECONFIG=... ./generate_actors_lf.sh,AWS_PROFILE=... KUBECONFIG=... ./generate_actors_gql.shprior to running./generate_actors.shand./generate_actors_cncf.sh. - Consider
./ghusers_cached.shor./ghusers.sh(if you run this, then copy result json somewhere and get 0-committers from previous version to save GH API points). Sometimes you should just run./ghusers.shwithout cache. - Recommended:
ghusers_partially_cached.sh 2> errors.txtwill refetch repos metadata and commits since last fetched and get users data fromgithub_users.jsonso you can save a lot of API points. You can prepend withNCPUS=Nto override autodetecting number of CPU cores available. - To copy source type from previous JSON version do
./copy_source.sh - Run
./company_names_mapping.shto fix typical company names spell errors, lower/upper case etc. Updatecompany-names-mappingbefore running this (with a new typos/correlations data from the last 3 steps). - To update (enhance)
github_users.jsonwith new affiliations./enhance_json.sh. If you runghusersyou may need to updateskip_github_logins.txtwith new broken GitHub logins found. This is optional if you already have an enhanced json. You can prepend withNCPUS=Nto override autodetecting number of CPU cores available. - To merge with previous JSON use:
./merge_jsons.sh. - To merge multiple GitHub logins data (for example propagate known affiliation to unknown or not found on the same GitHub login) run:
./merge_github_logins.sh. - Because this can find new affiliations you can now use
./import_from_github_users.shto import back fromgithub_users.jsonand then./lower_unique.sh cncf-config/email-mapand restart from step 4. This usescompany-names-mappingfile to import from GitHubcompanyfield. - Run
./correlations.shand examine its outputcorrelations.txtto try to normalize company names and remove common suffixes like Ltd., Corp. and downcase/upcase differences. - Run
./check_spellfor fuzziness/spell check errors finder (uses Levenshtein distance to find bugs). - Run
./lookup_json.shand examine its output JSONs - those GitHub profiles have some useful data directly available - this will save you some manual research work. - ALWAYS before any commit to GitHub run:
./handle_forbidden_data.shto remove any forbiden affiliations, please also seeFORBIDDEN_DATA.md. - You can use
./clear_affiliations_in_json.shto clear all affiliations on a generatedgithub_users.json. - To make json unique, call
./unique_json.rb github_users.json. To sort JSON by commits, login, email use:./sort_json.rb github_users.json. - You should run genderize/geousers (if needed) before the next step.
- You can create smaller final json for
cncf/devstatsusing./delete_json_fields.sh github_users.json; ./check_source.rb github_users.json; ./strip_json.sh github_users.json stripped.json; cp stripped.json ~/dev/go/src/github.com/cncf/devstats/github_users.json. - To generate final
unknowns.csvmanual research task file run:./gen_aff_task.rb unknowns.txt. You can also generate all actors./gen_aff_task.rb alldevs.txt. You can prepend withONLY_GH=1to skip entries without GitHub. You can prepend withONLY_EMP=1to skip entries with any affiliation already set. - To manually edit all affiliations related files: edit
cncf-config/email-map all.txt all.csv all_affs.csv github_users.json stripped.json ../developers_affiliations.txt ../company_developers.txt affiliations.csv - To add all possible entries from
github_users.jsontocncf-config/email-mapuse :github_users_to_map.sh. This is optional. - Finally copy
github_users.jsontogithub_users.old. You can check if JSON fileds are correct via./check_json_fields.sh github_users.json,./check_json_fields.sh stripped.json small. - If any file displays error with 'Invalid UTF-8' encoding, scrub it using Ruby tool:
./scrub.rb filename.
./all_repos_log.sh /root/devstats_repos/jenkins-x/* /root/devstats_repos/jenkinsci/* /root/devstats_repos/spinnaker/* /root/devstats_repos/tektoncd/* /root/devstats_repos/Azure/* /root/devstats_repos/BuoyantIO/* /root/devstats_repos/GoogleCloudPlatform/* /root/devstats_repos/OpenObservability/* /root/devstats_repos/RichiH/* /root/devstats_repos/Virtual-Kubelet/* /root/devstats_repos/alibaba/* /root/devstats_repos/apcera/* /root/devstats_repos/appc/* /root/devstats_repos/brigadecore/* /root/devstats_repos/buildpack/* /root/devstats_repos/cdfoundation/* /root/devstats_repos/cloudevents/* /root/devstats_repos/cncf/* /root/devstats_repos/containerd/* /root/devstats_repos/containernetworking/* /root/devstats_repos/coredns/* /root/devstats_repos/coreos/* /root/devstats_repos/cortexproject/* /root/devstats_repos/crosscloudci/* /root/devstats_repos/datawire/* /root/devstats_repos/docker/* /root/devstats_repos/dragonflyoss/* /root/devstats_repos/draios/* /root/devstats_repos/envoyproxy/* /root/devstats_repos/etcd-io/* /root/devstats_repos/falcosecurity/* /root/devstats_repos/fluent/* /root/devstats_repos/goharbor/* /root/devstats_repos/grpc/* /root/devstats_repos/helm/* /root/devstats_repos/istio/* /root/devstats_repos/jaegertracing/* /root/devstats_repos/knative/* /root/devstats_repos/kubeedge/* /root/devstats_repos/kubernetes/* /root/devstats_repos/kubernetes-client/* /root/devstats_repos/kubernetes-csi/* /root/devstats_repos/kubernetes-graveyard/* /root/devstats_repos/kubernetes-helm/* /root/devstats_repos/kubernetes-incubator/* /root/devstats_repos/kubernetes-incubator-retired/* /root/devstats_repos/kubernetes-retired/* /root/devstats_repos/kubernetes-security/* /root/devstats_repos/kubernetes-sig-testing/* /root/devstats_repos/kubernetes-sigs/* /root/devstats_repos/linkerd/* /root/devstats_repos/lyft/* /root/devstats_repos/miekg/* /root/devstats_repos/nats-io/* /root/devstats_repos/open-policy-agent/* /root/devstats_repos/opencontainers/* /root/devstats_repos/openeventing/* /root/devstats_repos/opentracing/* /root/devstats_repos/pingcap/* /root/devstats_repos/prometheus/* /root/devstats_repos/rkt/* /root/devstats_repos/rktproject/* /root/devstats_repos/rook/* /root/devstats_repos/spiffe/* /root/devstats_repos/telepresenceio/* /root/devstats_repos/theupdateframework/* /root/devstats_repos/tikv/* /root/devstats_repos/torvalds/* /root/devstats_repos/uber/* /root/devstats_repos/virtual-kubelet/* /root/devstats_repos/vitessio/* /root/devstats_repos/vmware/* /root/devstats_repos/weaveworks/* /root/devstats_repos/youtube/* /root/devstats_repos/zephyrproject-rtos/* /root/devstats_repos/iovisor/* /root/devstats_repos/mininet/* /root/devstats_repos/open-switch/* /root/devstats_repos/opencord/* /root/devstats_repos/opennetworkinglab/* /root/devstats_repos/opensecuritycontroller/* /root/devstats_repos/p4lang/* /root/devstats_repos/tungstenfabric/*.
- Open CNCF projects maintainers list
- Save "Name", "Company", "GitHub name" columns to a new sheet and download it as "maintainers.csv".
- Add "name,company,login" CSV header.
- Example file
- Run
[ONLYNEW=1] ./maintainers.shscript. Follow its instructions.
Please follow the instructions from ADD_PROJECT.md.
To add geo data (country_id, tz) and gender data (sex, sex_prob), do the following:
- Download
allCountries.zipfile from geonames server. - Create
geonamesdatabase via:sudo -u postgres createdb geonames,sudo -u postgres psql -f geonames.sql. Table details ingeonames.info - Unzip
allCountries.zipand runPG_PASS=... ./geodata.sh allCountries.tsv- this will populate the DB. - Create indices on columns to speedup localization:
sudo -u postgres psql -f geonames_idx.sql. - If this is a first geousers run create
geousers_cache.jsonviacp empty.json geousers_cache.json. - To use cache it is best to have
stripped.jsonfrom the previous run. See step 22. - Enchance
github_users.jsonviaPG_PASS=... ./geousers.sh github_users.json stripped.json geousers_cache.json 2000. It will addcountry_idandtzfields. - Go to store.genderize.io and get you
API_KEY, basic subscription ($9) allows 100,000 monthly gender lookups. - If this is a first genderize run create
genderize_cache.jsonviacp empty.json genderize_cache.json. - Enchance
github_users.jsonviaAPI_KEY=... ./genderize.sh github_users.json stripped.json genderize_cache.json 2000. It will addsexandsex_probfields. - You can skip
API_KEY=...but only 1000 gender lookups/day are allowed then. - Copy enhanced json to devstats:
./strip_json.sh github_users.json stripped.json; cp stripped.json ~/dev/go/src/devstats/github_users.json - Import new json on devstats using
./import_affstool.
- To import manual affiliations from a google sheet save this sheet as
affiliations.csvand then use./affiliations.shscript. - Prepend with
UPDATE=1to only import those marked as changed: columnchanges='x'. - Prepend with
DBG=1to enable verbose output. - After finishing import add a status line to
affiliations_import.txtfile and update the online spreadsheet. - After importing new data run
./src/burndown.sh(from the src's parent directory). Do this after processing all data mentioned here, not after just importing new CSV. - Import generated
csv/burndown.csvdata intohttps://docs.google.com/spreadsheets/d/1RxEbZNefBKkgo3sJ2UQz0OCA91LDOopacQjfFBRRqhQ/edit?usp=sharing. - To calculate CNCF/LF ratio use number of CNCF found from last commit - number of CNCF found from some previous commit diveded by the same ratio for all actors.